Details

  • Baidu introduces Qianfan-OCR, a 4B-parameter vision-language model that handles table extraction, formula recognition, chart understanding, and key information extraction in a single pass without multi-stage pipelines.
  • Achieves top benchmarks: 93.12 on OmniDocBench v1.5 and 79.8 on OlmOCR Bench among end-to-end models; 880 on OCRBench, highest overall including pipelines; 87.9 average on Key Information Extraction, surpassing Gemini-3.1-Pro.
  • Key innovation is Layout-as-Thought: optional thinking phase generates bounding boxes, element types, and reading order via tokens to recover explicit layout analysis in end-to-end setup.
  • Excels in chart understanding where traditional OCR+LLM pipelines score 0.0 on CharXiv by capturing spatial relationships, axis labels, and data points as a unified VLM.
  • Deployment-friendly: Runs on single A100 GPU at 1.024 pages/sec with W8A8 quantization using vLLM; available on Baidu Qianfan Platform with open weights on HuggingFace.
  • Developed by Baidu's ERNIE Team; paper details unified architecture integrating parsing, layout, and understanding; supports prompt-driven tasks like image-to-Markdown and document QA.

Impact

Baidu's Qianfan-OCR advances end-to-end document AI by topping OCRBench at 880 ahead of Qwen3-VL-4B's 873 and leading key information extraction over Gemini-3.1-Pro, pressuring rivals like Alibaba's Qwen-VL and global players in specialized OCR where pipelines falter on charts and layouts. This single-model approach lowers deployment costs with single-GPU serving, widening access for enterprises in intelligent office automation and accelerating adoption in sectors handling invoices, reports, and multilingual docs. It aligns with trends in vision-language models emphasizing domain-specific enhancements like layout reasoning, potentially steering R&D toward efficient 4B-scale VLMs that bridge OCR precision with semantic tasks. Over 12-24 months, open weights on HuggingFace could boost fine-tuning in education and B2B apps, narrowing the gap with larger Western models while highlighting China's push in multimodal AI amid U.S. export controls.