Evaluating OCR models on their ability to extract text from financial documents
This benchmark evaluates 8 OCR models (5 local + 3 LLM-based) across 50 financial documents, scored by an ensemble of 5 vision AI judges on 5 weighted criteria.
Character-level fidelity to the source document text
Tables, columns, and rows correctly preserved
Numbers, dates, and currency amounts correct
Reading order, indentation, and formatting intact
All visible text captured with no omissions
Weighted average of all criteria (higher is better)
Each OCR output was scored by 5 vision AI models. Final scores are the mean across all judges.
| Rank | Model |
Total Score
weighted avg
|
── Scoring Criteria (click any number for evidence) ── | ||||
|---|---|---|---|---|---|---|---|
|
Text Acc.
30% wt
|
Table Struct.
25% wt
|
Numeric Prec.
20% wt
|
Layout
15% wt
|
Completeness
10% wt
|
|||
Click any column header to sort. Click any score number to see the per-document evidence behind it.
Average scores showing how each judge rates each OCR model. Reveals judge leniency/strictness and model consistency across evaluators.
Across 2,000 evaluations (5 judges × 8 models × 50 documents), judge scoring patterns are explained by two primary effects: the model's overall capability and the judge's inherent strictness. There is minimal evidence of a specific judge systematically favoring or penalizing a particular OCR engine.
In practice, judges agree strongly on relative model ordering — LLM-based models (Gemini, GPT, Claude) generally outrank local models on complex layouts — even though their absolute score scales differ. This means the rankings are robust to evaluator choice.
Why this matters: The final rankings reflect true model quality, not evaluator favoritism.
The 5 judges show meaningful calibration differences: some apply consistently higher absolute scores (lenient) while others compress the scoring range (low discrimination). For example, on degraded scan documents, score spreads between models are amplified by stricter judges and compressed by lenient ones.
Despite these stylistic differences, judge rankings of models are highly correlated. The multi-judge average smooths out individual calibration quirks and produces a more reliable consensus score than any single judge would.
Why this matters: Raw score differences between judges reflect calibration, not unreliability.
Judge consensus is tightest on clean SEC filing documents (small variance across judges) and loosest on degraded scans, where partially extracted text leaves more room for interpretation. On doc_041–doc_050, score variance across judges can exceed 15 points for weaker models.
For the most reliable single-judge evaluation, the judge with the most compressed scoring range on average is most consistent. However, for ranking purposes, averaging all 5 judges produces the lowest uncertainty and is the recommended aggregation method.
Why this matters: Answers "which judge should I trust and why?" — use the ensemble average.
Pricing per million tokens (as of April 2026). For vision/OCR tasks, input tokens include image tokenisation (~750–1 500 tokens per document page). Blended rate = 80 % input + 20 % output — typical ratio for OCR workloads. Local models are open-source and self-hosted; API cost is $0 but infrastructure cost varies.
| Model | Type | Input $/M tokens |
Output $/M tokens |
Blended* $/M tokens |
Est. cost/doc 1 500 in + 500 out |
Benchmark Score | Score / $1 blended higher = better value |
|---|
* Blended = 80 % input + 20 % output. Efficiency (score/$1) calculated on blended rate; open-source models shown as "Free — ∞ value".
Click any data point to view full benchmark performance details for that model. Circles = LLM-based API models | Squares = open-source local models.
Average score per model per document category. Highlights where each OCR engine excels or struggles.
| Category |
|---|
Click any row to view full detail — image, extracted text, and per-judge scores. Click a score number for evidence. Click a thumbnail to enlarge.
| Preview | Doc | Category | Winner |
|---|
Sourced from 4 public datasets: SROIE (receipts), MultiFin (SEC filings), Sujet (financial reports), Omni OCR Benchmark (general docs), plus deliberately degraded scans to stress-test robustness.
Applied all 8 models to every document: 5 local engines + 3 LLM-based vision models via OpenRouter. Total: 400 extractions.
Each judge received the source image and OCR text, then scored 5 criteria (0–100). Using 5 independent judges reduces single-model bias and anchors rankings in consensus. Total: 2,000 evaluations (5 judges × 8 models × 50 documents).
Final score per model–document pair = arithmetic mean across all 5 judges. Weighted total = 30% text accuracy + 25% table structure + 20% numeric precision + 15% layout + 10% completeness.