AI Research · April 2026

LLM-Based Models Dominate Financial Document OCR

An expanded 50-document, 8-model benchmark evaluated by 5 independent AI judges confirms that Claude Sonnet 4.6, GPT-5.4, and GLM-OCR outperform all traditional OCR engines — and that local models still have a role in specific scenarios.

8 OCR Models Tested 50 Financial Documents 5 Independent Judges 2,000 Total Evaluations

Executive Summary

Four findings that matter

We benchmarked 8 OCR models — 5 local engines and 3 LLM-based vision models via OpenRouter — across 50 financial documents spanning invoices, SEC filings, financial reports, general documents, and degraded scans. The results reveal a clear capability tier shift.

1. LLM models sweep the top 3

Claude Sonnet 4.6 (88.9), GPT-5.4 (86.6), and GLM-OCR (84.5) occupy the top three positions. The gap between the best LLM model and the best purely local model (GLM-OCR) is 4.4 points — but the gap to traditional OCR like EasyOCR is a massive 36.9 points.

2. No single model wins every category

GLM-OCR leads on invoices (91.0) and financial mixed docs (82.9). Claude leads on degraded scans (86.1). GPT-5.4 tops SEC filings (97.1) and general documents (88.9). Category context drives model selection more than overall rank.

3. Degraded scans remain the sharpest stress test

EasyOCR collapsed to 28.6 on degraded documents. Docling dropped to 44.9. LLM models maintained strong quality — Claude at 86.1, GLM-OCR at 83.7. Robustness to scan quality now requires a vision-language model or LLM.

4. Gemini 3.1 Pro underdelivers on completeness

Despite competitive text accuracy (85.9), Gemini 3.1 Pro scores only 71.8 on completeness — 18.8 points below Claude. On longer documents it consistently misses content. Strong accuracy does not imply full extraction.

Model Rankings

Overall performance across 50 documents

Scores averaged across 5 independent AI judges (Gemini 2.5 Flash, Claude Sonnet, GPT-4.1 Mini, Qwen 2.5 VL 72B, DeepSeek V3) to reduce individual model bias. LLM models used the OpenRouter vision API.

Rank	Model	Text Acc. (30%)	Table Struct. (25%)	Numeric Prec. (20%)	Layout (15%)	Complete. (10%)	Total
1	Claude Sonnet 4.6LLM Anthropic · via OpenRouter	89.3	86.6	95.4	85.9	90.6	88.9
2	GPT-5.4LLM OpenAI · via OpenRouter	87.0	84.0	93.2	83.5	88.6	86.6
3	GLM-OCR Zhipu AI · 0.9B params, local	86.2	78.9	92.6	80.8	85.9	84.5
4	Gemini 3.1 ProLLM Google · via OpenRouter	85.9	78.1	86.1	80.8	71.8	79.8
5	docTR Mindee · CRNN+DBNet, local	72.8	60.9	83.6	66.6	78.9	71.9
6	Docling IBM · Granite 258M, local	63.7	53.7	75.4	59.0	68.5	63.4
7	Tesseract Google · LSTM, local	62.4	47.0	71.5	56.0	67.8	60.9
8	EasyOCR JaidedAI · CRAFT+CRNN, local	51.9	41.2	63.8	46.4	58.5	52.0

Key insight: Table structure is the weakest criterion across all models — a 45-point spread between Claude (86.6) and EasyOCR (41.2). Numeric precision is consistently the strongest, suggesting digit recognition is solved while structured layout understanding remains the frontier challenge. Completeness is where Gemini falls behind its LLM peers — a hidden weakness that doesn't surface in accuracy-only metrics.

Performance by Document Type

Where each model wins — and where it breaks

Category-level analysis shows no model performs uniformly. The gap between a model's best and worst category reveals specialisation: LLM models are broadly strong; local models are brittle outside their comfort zone.

Model	Invoices & Receipts	SEC Filings	Financial Mixed	General Documents	Degraded Scans
Claude Sonnet 4.6 LLM	89.0	97.0	81.0	88.1	86.1
GPT-5.4 LLM	87.0	97.1	79.2	88.9	76.2
GLM-OCR	91.0	80.3	82.9	84.1	83.7
Gemini 3.1 Pro LLM	79.9	83.5	67.1	86.7	72.2
docTR	88.2	70.5	67.4	67.3	66.4
Docling	74.7	63.8	42.7	74.8	44.9
Tesseract	80.6	53.2	51.7	64.9	47.6
EasyOCR	70.3	58.6	19.5	61.7	28.6

The GLM-OCR surprise: GLM-OCR scores 91.0 on invoices and 82.9 on financial mixed — beating both Claude and GPT-5.4 in those categories despite ranking 3rd overall. Its vision-language architecture is specifically tuned for financial documents. For organisations running offline or air-gapped, GLM-OCR is the only local model competitive with the LLM tier.

Visual Comparison

Model performance at a glance

Average total score across all 50 documents and 5 judges. LLM-based models are highlighted in purple.

Claude Sonnet 4.6

LLM

88.9

GPT-5.4

LLM

86.6

GLM-OCR

84.5

Gemini 3.1 Pro

LLM

79.8

docTR

71.9

Docling

63.4

Tesseract

60.9

EasyOCR

52.0

The LLM tier gap: Claude, GPT-5.4, and GLM-OCR form a distinct performance tier (84–89) separated from docTR (71.9) by a 12.6-point gap. This is not a marginal improvement — it reflects a fundamental architectural shift from pixel-pattern recognition to semantic document understanding. For any production financial workflow, this tier represents the new baseline.

Methodology

How we built this benchmark

A four-stage reproducible pipeline covering document collection, multi-model extraction, multi-judge evaluation, and score aggregation — designed to eliminate single-judge and single-document bias.

Phase 1 — Document Collection

Curated 50 documents across 5 categories

Sourced from SROIE (10 scanned receipts, ICDAR 2019), MultiFin (10 SEC EDGAR filings with ground truth), Sujet Finance Vision 10K (5 financial docs), and OmniDocBench (15 general documents). Added 10 synthetically degraded scans (halved resolution + Gaussian noise) to stress-test robustness. All 50 document images are included in the repository.

Phase 2 — OCR Extraction

Ran 8 models on all 50 documents (400 extractions)

5 local engines: GLM-OCR (0.9B, Zhipu AI), docTR (Mindee CRNN+DBNet), Docling (IBM Granite-258M), Tesseract 5.3 (Google), EasyOCR (JaidedAI). 3 LLM-based vision models via OpenRouter: Claude Sonnet 4.6, GPT-5.4, Gemini 3.1 Pro. All images resized to ≤ 2048px before LLM API calls to stay within model limits. Zero total failures across 400 extractions.

Phase 3 — Multi-Judge Evaluation

5 independent AI judges scored each extraction (2,000 evaluations)

Gemini 2.5 Flash, Claude Sonnet, GPT-4.1 Mini, Qwen 2.5 VL 72B, and DeepSeek V3 each scored every (model, document) pair on 5 weighted criteria via OpenRouter. Where ground truth existed (SROIE, MultiFin), judges compared against it directly — raising score reliability above blind LLM assessment.

Phase 4 — Aggregation & Analysis

Averaged scores across judges, segmented by category and criterion

For each (model, document) pair, all 5 judge scores were averaged into a single robust score. Category and criterion breakdowns were computed to reveal where each model excels. Rankings were derived from unweighted document averages to avoid dataset composition bias.

Evaluation Framework

Five criteria, weighted for financial use cases

Criterion	Weight	What it measures	Why it matters for finance
Text Accuracy	30%	Character-level fidelity — misspellings, wrong characters, missing words	A single wrong character in a company name or account number can invalidate a document
Table Structure	25%	Columns, rows, headers, merged cells preserved correctly	Financial statements are predominantly tabular — broken tables = broken data
Numeric Precision	20%	Decimals, currencies, amounts, dates exactly correct	$1,000,000 vs $10,000,000 — a single transposed digit is catastrophic
Layout Preservation	15%	Reading order, sections, headers, formatting maintained	Regulatory filings have legally significant section ordering
Completeness	10%	All visible text captured — nothing major omitted	Missing footnotes or disclaimers in financial docs creates compliance risk

Recommendations

What to use, and when

Model selection depends on your document mix, deployment constraints, and accuracy requirements. The LLM tier is now the default for cloud-connected workloads.

For cloud / API pipelines

Use Claude Sonnet 4.6 as the primary engine. Best overall score (88.9), highest completeness (90.6), strongest on degraded scans (86.1). Deploy via Anthropic API or OpenRouter. For SEC filings specifically, GPT-5.4 edges ahead at 97.1.

For offline / air-gapped deployment

Use GLM-OCR as the only local model in the LLM performance tier. At 84.5 overall it beats every other local engine by 12+ points, with the best local performance on invoices (91.0) and financial mixed documents (82.9).

For high-volume batch processing

Use docTR as a cost-efficient baseline when LLM API costs are prohibitive. At 71.9 overall it is the best purely local model after GLM-OCR and has strong numeric precision (83.6). Avoid for degraded documents.

What to avoid

EasyOCR's collapse on financial mixed (19.5) and degraded scans (28.6) makes it unsuitable for any financial workflow. Tesseract's SEC filing score (53.2) — previously a strength — has been overtaken by all 7 other models; its advantage over Docling is now marginal (60.9 vs 63.4). Gemini 3.1 Pro is worth monitoring but its completeness gap (71.8) needs resolution before production use on long-form documents.

Model Architecture Analysis

Why the LLM tier outperforms traditional OCR

The performance gap between LLM-based and traditional OCR is not marginal — it reflects three fundamental architectural advantages that matter specifically for financial documents.

Semantic context understanding

LLMs understand that "1,234,567" in a balance sheet column is a currency amount — not a phone number or date. Traditional OCR treats every token identically. This semantic awareness is why LLMs score 15–20 points higher on table structure and numeric precision.

Noise and degradation robustness

Claude and GLM-OCR maintain 83–86% accuracy on degraded scans while EasyOCR collapses to 28.6%. Vision-language models infer missing characters from context; CRAFT+CRNN models fail when pixel patterns are corrupted below detection thresholds.

GLM-OCR: the local exception

GLM-OCR's 0.9B GRPO-trained model competes with frontier LLMs through two optimisations: joint multi-task reinforcement learning across text, table, formula, and KIE tasks, and a PP-DocLayout-V3 stage that separates layout analysis from text recognition. This is why it leads on invoices and financial mixed — categories with dense tabular content.

The completeness gap

Gemini 3.1 Pro's 71.8 completeness score vs Claude's 90.6 reveals a structural weakness in how the model handles long documents: it generates confident, accurate text for the regions it attends to, but omits sections outside its attention window. This is a known LLM limitation unrelated to OCR capability per se.