An expanded 50-document, 8-model benchmark evaluated by 5 independent AI judges confirms that Claude Sonnet 4.6, GPT-5.4, and GLM-OCR outperform all traditional OCR engines — and that local models still have a role in specific scenarios.
We benchmarked 8 OCR models — 5 local engines and 3 LLM-based vision models via OpenRouter — across 50 financial documents spanning invoices, SEC filings, financial reports, general documents, and degraded scans. The results reveal a clear capability tier shift.
Claude Sonnet 4.6 (88.9), GPT-5.4 (86.6), and GLM-OCR (84.5) occupy the top three positions. The gap between the best LLM model and the best purely local model (GLM-OCR) is 4.4 points — but the gap to traditional OCR like EasyOCR is a massive 36.9 points.
GLM-OCR leads on invoices (91.0) and financial mixed docs (82.9). Claude leads on degraded scans (86.1). GPT-5.4 tops SEC filings (97.1) and general documents (88.9). Category context drives model selection more than overall rank.
EasyOCR collapsed to 28.6 on degraded documents. Docling dropped to 44.9. LLM models maintained strong quality — Claude at 86.1, GLM-OCR at 83.7. Robustness to scan quality now requires a vision-language model or LLM.
Despite competitive text accuracy (85.9), Gemini 3.1 Pro scores only 71.8 on completeness — 18.8 points below Claude. On longer documents it consistently misses content. Strong accuracy does not imply full extraction.
Scores averaged across 5 independent AI judges (Gemini 2.5 Flash, Claude Sonnet, GPT-4.1 Mini, Qwen 2.5 VL 72B, DeepSeek V3) to reduce individual model bias. LLM models used the OpenRouter vision API.
| Rank | Model | Text Acc. (30%) |
Table Struct. (25%) |
Numeric Prec. (20%) |
Layout (15%) |
Complete. (10%) |
Total |
|---|---|---|---|---|---|---|---|
| 1 | Claude Sonnet 4.6LLM Anthropic · via OpenRouter |
89.3 | 86.6 | 95.4 | 85.9 | 90.6 | 88.9 |
| 2 | GPT-5.4LLM OpenAI · via OpenRouter |
87.0 | 84.0 | 93.2 | 83.5 | 88.6 | 86.6 |
| 3 | GLM-OCR Zhipu AI · 0.9B params, local |
86.2 | 78.9 | 92.6 | 80.8 | 85.9 | 84.5 |
| 4 | Gemini 3.1 ProLLM Google · via OpenRouter |
85.9 | 78.1 | 86.1 | 80.8 | 71.8 | 79.8 |
| 5 | docTR Mindee · CRNN+DBNet, local |
72.8 | 60.9 | 83.6 | 66.6 | 78.9 | 71.9 |
| 6 | Docling IBM · Granite 258M, local |
63.7 | 53.7 | 75.4 | 59.0 | 68.5 | 63.4 |
| 7 | Tesseract Google · LSTM, local |
62.4 | 47.0 | 71.5 | 56.0 | 67.8 | 60.9 |
| 8 | EasyOCR JaidedAI · CRAFT+CRNN, local |
51.9 | 41.2 | 63.8 | 46.4 | 58.5 | 52.0 |
Key insight: Table structure is the weakest criterion across all models — a 45-point spread between Claude (86.6) and EasyOCR (41.2). Numeric precision is consistently the strongest, suggesting digit recognition is solved while structured layout understanding remains the frontier challenge. Completeness is where Gemini falls behind its LLM peers — a hidden weakness that doesn't surface in accuracy-only metrics.
Category-level analysis shows no model performs uniformly. The gap between a model's best and worst category reveals specialisation: LLM models are broadly strong; local models are brittle outside their comfort zone.
| Model | Invoices & Receipts |
SEC Filings |
Financial Mixed |
General Documents |
Degraded Scans |
|---|---|---|---|---|---|
| Claude Sonnet 4.6 LLM | 89.0 | 97.0 | 81.0 | 88.1 | 86.1 |
| GPT-5.4 LLM | 87.0 | 97.1 | 79.2 | 88.9 | 76.2 |
| GLM-OCR | 91.0 | 80.3 | 82.9 | 84.1 | 83.7 |
| Gemini 3.1 Pro LLM | 79.9 | 83.5 | 67.1 | 86.7 | 72.2 |
| docTR | 88.2 | 70.5 | 67.4 | 67.3 | 66.4 |
| Docling | 74.7 | 63.8 | 42.7 | 74.8 | 44.9 |
| Tesseract | 80.6 | 53.2 | 51.7 | 64.9 | 47.6 |
| EasyOCR | 70.3 | 58.6 | 19.5 | 61.7 | 28.6 |
The GLM-OCR surprise: GLM-OCR scores 91.0 on invoices and 82.9 on financial mixed — beating both Claude and GPT-5.4 in those categories despite ranking 3rd overall. Its vision-language architecture is specifically tuned for financial documents. For organisations running offline or air-gapped, GLM-OCR is the only local model competitive with the LLM tier.
Average total score across all 50 documents and 5 judges. LLM-based models are highlighted in purple.
The LLM tier gap: Claude, GPT-5.4, and GLM-OCR form a distinct performance tier (84–89) separated from docTR (71.9) by a 12.6-point gap. This is not a marginal improvement — it reflects a fundamental architectural shift from pixel-pattern recognition to semantic document understanding. For any production financial workflow, this tier represents the new baseline.
A four-stage reproducible pipeline covering document collection, multi-model extraction, multi-judge evaluation, and score aggregation — designed to eliminate single-judge and single-document bias.
Sourced from SROIE (10 scanned receipts, ICDAR 2019), MultiFin (10 SEC EDGAR filings with ground truth), Sujet Finance Vision 10K (5 financial docs), and OmniDocBench (15 general documents). Added 10 synthetically degraded scans (halved resolution + Gaussian noise) to stress-test robustness. All 50 document images are included in the repository.
5 local engines: GLM-OCR (0.9B, Zhipu AI), docTR (Mindee CRNN+DBNet), Docling (IBM Granite-258M), Tesseract 5.3 (Google), EasyOCR (JaidedAI). 3 LLM-based vision models via OpenRouter: Claude Sonnet 4.6, GPT-5.4, Gemini 3.1 Pro. All images resized to ≤ 2048px before LLM API calls to stay within model limits. Zero total failures across 400 extractions.
Gemini 2.5 Flash, Claude Sonnet, GPT-4.1 Mini, Qwen 2.5 VL 72B, and DeepSeek V3 each scored every (model, document) pair on 5 weighted criteria via OpenRouter. Where ground truth existed (SROIE, MultiFin), judges compared against it directly — raising score reliability above blind LLM assessment.
For each (model, document) pair, all 5 judge scores were averaged into a single robust score. Category and criterion breakdowns were computed to reveal where each model excels. Rankings were derived from unweighted document averages to avoid dataset composition bias.
| Criterion | Weight | What it measures | Why it matters for finance |
|---|---|---|---|
| Text Accuracy | 30% | Character-level fidelity — misspellings, wrong characters, missing words | A single wrong character in a company name or account number can invalidate a document |
| Table Structure | 25% | Columns, rows, headers, merged cells preserved correctly | Financial statements are predominantly tabular — broken tables = broken data |
| Numeric Precision | 20% | Decimals, currencies, amounts, dates exactly correct | $1,000,000 vs $10,000,000 — a single transposed digit is catastrophic |
| Layout Preservation | 15% | Reading order, sections, headers, formatting maintained | Regulatory filings have legally significant section ordering |
| Completeness | 10% | All visible text captured — nothing major omitted | Missing footnotes or disclaimers in financial docs creates compliance risk |
Model selection depends on your document mix, deployment constraints, and accuracy requirements. The LLM tier is now the default for cloud-connected workloads.
Use Claude Sonnet 4.6 as the primary engine. Best overall score (88.9), highest completeness (90.6), strongest on degraded scans (86.1). Deploy via Anthropic API or OpenRouter. For SEC filings specifically, GPT-5.4 edges ahead at 97.1.
Use GLM-OCR as the only local model in the LLM performance tier. At 84.5 overall it beats every other local engine by 12+ points, with the best local performance on invoices (91.0) and financial mixed documents (82.9).
Use docTR as a cost-efficient baseline when LLM API costs are prohibitive. At 71.9 overall it is the best purely local model after GLM-OCR and has strong numeric precision (83.6). Avoid for degraded documents.
EasyOCR's collapse on financial mixed (19.5) and degraded scans (28.6) makes it unsuitable for any financial workflow. Tesseract's SEC filing score (53.2) — previously a strength — has been overtaken by all 7 other models; its advantage over Docling is now marginal (60.9 vs 63.4). Gemini 3.1 Pro is worth monitoring but its completeness gap (71.8) needs resolution before production use on long-form documents.
The performance gap between LLM-based and traditional OCR is not marginal — it reflects three fundamental architectural advantages that matter specifically for financial documents.
LLMs understand that "1,234,567" in a balance sheet column is a currency amount — not a phone number or date. Traditional OCR treats every token identically. This semantic awareness is why LLMs score 15–20 points higher on table structure and numeric precision.
Claude and GLM-OCR maintain 83–86% accuracy on degraded scans while EasyOCR collapses to 28.6%. Vision-language models infer missing characters from context; CRAFT+CRNN models fail when pixel patterns are corrupted below detection thresholds.
GLM-OCR's 0.9B GRPO-trained model competes with frontier LLMs through two optimisations: joint multi-task reinforcement learning across text, table, formula, and KIE tasks, and a PP-DocLayout-V3 stage that separates layout analysis from text recognition. This is why it leads on invoices and financial mixed — categories with dense tabular content.
Gemini 3.1 Pro's 71.8 completeness score vs Claude's 90.6 reveals a structural weakness in how the model handles long documents: it generates confident, accurate text for the regions it attends to, but omits sections outside its attention window. This is a known LLM limitation unrelated to OCR capability per se.