Financial Document OCR Benchmark

Model Rankings

Rank	Model	Total Score weighted avg	── Scoring Criteria (click any number for evidence) ──
Rank	Model	Total Score weighted avg	Text Acc. 30% wt	Table Struct. 25% wt	Numeric Prec. 20% wt	Layout 15% wt	Completeness 10% wt

Click any column header to sort. Click any score number to see the per-document evidence behind it.

Judge vs Model Comparison Matrix

Average scores showing how each judge rates each OCR model. Reveals judge leniency/strictness and model consistency across evaluators.

Judges score consistently across all 8 OCR models

Across 2,000 evaluations (5 judges × 8 models × 50 documents), judge scoring patterns are explained by two primary effects: the model's overall capability and the judge's inherent strictness. There is minimal evidence of a specific judge systematically favoring or penalizing a particular OCR engine.

In practice, judges agree strongly on relative model ordering — LLM-based models (Gemini, GPT, Claude) generally outrank local models on complex layouts — even though their absolute score scales differ. This means the rankings are robust to evaluator choice.

Why this matters: The final rankings reflect true model quality, not evaluator favoritism.

Judges differ in strictness and range, not in relative preferences

The 5 judges show meaningful calibration differences: some apply consistently higher absolute scores (lenient) while others compress the scoring range (low discrimination). For example, on degraded scan documents, score spreads between models are amplified by stricter judges and compressed by lenient ones.

Despite these stylistic differences, judge rankings of models are highly correlated. The multi-judge average smooths out individual calibration quirks and produces a more reliable consensus score than any single judge would.

Why this matters: Raw score differences between judges reflect calibration, not unreliability.

Degraded scans surface the largest judge disagreements

Judge consensus is tightest on clean SEC filing documents (small variance across judges) and loosest on degraded scans, where partially extracted text leaves more room for interpretation. On doc_041–doc_050, score variance across judges can exceed 15 points for weaker models.

For the most reliable single-judge evaluation, the judge with the most compressed scoring range on average is most consistent. However, for ranking purposes, averaging all 5 judges produces the lowest uncertainty and is the recommended aggregation method.

Why this matters: Answers "which judge should I trust and why?" — use the ensemble average.

Key Findings

Cost vs Quality Analysis

Pricing per million tokens (as of April 2026). For vision/OCR tasks, input tokens include image tokenisation (~750–1 500 tokens per document page). Blended rate = 80 % input + 20 % output — typical ratio for OCR workloads. Local models are open-source and self-hosted; API cost is $0 but infrastructure cost varies.

Model	Type	Input $/M tokens	Output $/M tokens	Blended* $/M tokens	Est. cost/doc 1 500 in + 500 out	Benchmark Score	Score / $1 blended higher = better value

* Blended = 80 % input + 20 % output. Efficiency (score/$1) calculated on blended rate; open-source models shown as "Free — ∞ value".

Click any data point to view full benchmark performance details for that model. Circles = LLM-based API models | Squares = open-source local models.

Category Breakdown

Average score per model per document category. Highlights where each OCR engine excels or struggles.

Category

Per-Document Comparison

Click any row to view full detail — image, extracted text, and per-judge scores. Click a score number for evidence. Click a thumbnail to enlarge.

Preview	Doc	Category	Winner

How the Benchmark Was Built

Collected 50 Diverse Financial Documents

Sourced from 4 public datasets: SROIE (receipts), MultiFin (SEC filings), Sujet (financial reports), Omni OCR Benchmark (general docs), plus deliberately degraded scans to stress-test robustness.

SROIE receipts ×10 MultiFin SEC filings ×10 Sujet financial ×5 Omni general ×15 Degraded scans ×10

Ran 8 OCR Models on Each Document

Applied all 8 models to every document: 5 local engines + 3 LLM-based vision models via OpenRouter. Total: 400 extractions.

GLM-OCR (open-source, local) docTR (Mindee) Docling (IBM) Tesseract 5 EasyOCR Gemini 3.1 Pro GPT-5.4 Claude Sonnet 4.6

Sent Outputs to 5 Independent AI Judges

Each judge received the source image and OCR text, then scored 5 criteria (0–100). Using 5 independent judges reduces single-model bias and anchors rankings in consensus. Total: 2,000 evaluations (5 judges × 8 models × 50 documents).

Gemini 2.5 Flash Claude Sonnet GPT-4.1 Mini Qwen 2.5 VL 72B DeepSeek V3

Aggregated Scores Across Judges

Final score per model–document pair = arithmetic mean across all 5 judges. Weighted total = 30% text accuracy + 25% table structure + 20% numeric precision + 15% layout + 10% completeness.

2,000 evaluations Mean aggregation Weighted scoring Category breakdowns

Judge Prompt Structure

Each vision judge received the following prompt template per evaluation:

You are evaluating the quality of OCR (Optical Character Recognition) output. SOURCE IMAGE: [attached image of the document] OCR OUTPUT: [extracted text from the OCR model] Score the OCR output on the following criteria (0-100 each): 1. text_accuracy (30% weight): How accurately does the extracted text match the source? 2. table_structure (25% weight): Are tables, rows, and columns correctly preserved? 3. numeric_precision (20% weight): Are all numbers, dates, and amounts correct? 4. layout_preservation (15% weight): Is reading order and formatting maintained? 5. completeness (10% weight): Is all visible text captured with no omissions? Return a JSON object with keys: text_accuracy, table_structure, numeric_precision, layout_preservation, completeness, total_score (weighted sum), notes (brief explanation).

Test Documents (25 total)

invoice_receipt (10) — SROIE: real-world retail receipt images
sec_filing (10) — MultiFin: scanned SEC 10-K and 10-Q pages
financial_mixed (5) — Sujet: mixed financial statement images
general_doc (15) — Omni OCR Benchmark: varied document types
degraded_scan (10) — SROIE receipts with added blur, noise, and compression

Aggregation Method

Final score per model–document pair = arithmetic mean of total_score across all 5 judges. Category averages pool all documents of that type. Overall model score = unweighted mean across all 50 documents.

Scoring Rubric

Criterion	Weight	What's measured
text_accuracy	30%	Character-level fidelity to source
table_structure	25%	Tables, columns, rows preserved
numeric_precision	20%	Numbers, dates, amounts correct
layout_preservation	15%	Reading order and formatting
completeness	10%	All text captured, no omissions

Financial Document OCR Benchmark

Scoring Criteria Explained

AI Judges Used

Model Rankings

Judge vs Model Comparison Matrix

Judges score consistently across all 8 OCR models

Judges differ in strictness and range, not in relative preferences

Degraded scans surface the largest judge disagreements

Key Findings

Cost vs Quality Analysis

Category Breakdown

Per-Document Comparison

How the Benchmark Was Built

Collected 50 Diverse Financial Documents

Ran 8 OCR Models on Each Document

Sent Outputs to 5 Independent AI Judges

Aggregated Scores Across Judges

Financial Document OCR Benchmark

Scoring Criteria Explained

AI Judges Used

Model Rankings

Judge vs Model Comparison Matrix

Judges score consistently across all 8 OCR models

Judges differ in strictness and range, not in relative preferences

Degraded scans surface the largest judge disagreements

Key Findings

Cost vs Quality Analysis

Category Breakdown

Per-Document Comparison

How the Benchmark Was Built

Collected 50 Diverse Financial Documents

Ran 8 OCR Models on Each Document

Sent Outputs to 5 Independent AI Judges

Aggregated Scores Across Judges

Score Evidence

Document Detail

Evaluation Methodology

Judge Prompt Structure

Test Documents (25 total)

Aggregation Method

Scoring Rubric