Financial Document OCR Benchmark

Evaluating OCR models on their ability to extract text from financial documents

This benchmark evaluates 8 OCR models (5 local + 3 LLM-based) across 50 financial documents, scored by an ensemble of 5 vision AI judges on 5 weighted criteria.

Note: Documents span 5 categories: receipts, SEC filings, financial mixed, general docs, and degraded scans.
Missing scores () indicate output unavailable for that model–document pair.
Scoring Criteria Explained
30%
Text Accuracy

Character-level fidelity to the source document text

25%
Table Structure

Tables, columns, and rows correctly preserved

20%
Numeric Precision

Numbers, dates, and currency amounts correct

15%
Layout Preservation

Reading order, indentation, and formatting intact

10%
Completeness

All visible text captured with no omissions

100
Total Score

Weighted average of all criteria (higher is better)

AI Judges Used

Each OCR output was scored by 5 vision AI models. Final scores are the mean across all judges.

Model Rankings

Rank Model Total Score
weighted avg
── Scoring Criteria (click any number for evidence) ──
Text Acc.
30% wt
Table Struct.
25% wt
Numeric Prec.
20% wt
Layout
15% wt
Completeness
10% wt

Click any column header to sort. Click any score number to see the per-document evidence behind it.

Judge vs Model Comparison Matrix

Average scores showing how each judge rates each OCR model. Reveals judge leniency/strictness and model consistency across evaluators.

Judges score consistently across all 8 OCR models

Across 2,000 evaluations (5 judges × 8 models × 50 documents), judge scoring patterns are explained by two primary effects: the model's overall capability and the judge's inherent strictness. There is minimal evidence of a specific judge systematically favoring or penalizing a particular OCR engine.

In practice, judges agree strongly on relative model ordering — LLM-based models (Gemini, GPT, Claude) generally outrank local models on complex layouts — even though their absolute score scales differ. This means the rankings are robust to evaluator choice.

Why this matters: The final rankings reflect true model quality, not evaluator favoritism.

Judges differ in strictness and range, not in relative preferences

The 5 judges show meaningful calibration differences: some apply consistently higher absolute scores (lenient) while others compress the scoring range (low discrimination). For example, on degraded scan documents, score spreads between models are amplified by stricter judges and compressed by lenient ones.

Despite these stylistic differences, judge rankings of models are highly correlated. The multi-judge average smooths out individual calibration quirks and produces a more reliable consensus score than any single judge would.

Why this matters: Raw score differences between judges reflect calibration, not unreliability.

Degraded scans surface the largest judge disagreements

Judge consensus is tightest on clean SEC filing documents (small variance across judges) and loosest on degraded scans, where partially extracted text leaves more room for interpretation. On doc_041doc_050, score variance across judges can exceed 15 points for weaker models.

For the most reliable single-judge evaluation, the judge with the most compressed scoring range on average is most consistent. However, for ranking purposes, averaging all 5 judges produces the lowest uncertainty and is the recommended aggregation method.

Why this matters: Answers "which judge should I trust and why?" — use the ensemble average.

Key Findings

    Cost vs Quality Analysis

    Pricing per million tokens (as of April 2026). For vision/OCR tasks, input tokens include image tokenisation (~750–1 500 tokens per document page). Blended rate = 80 % input + 20 % output — typical ratio for OCR workloads. Local models are open-source and self-hosted; API cost is $0 but infrastructure cost varies.

    Model Type Input
    $/M tokens
    Output
    $/M tokens
    Blended*
    $/M tokens
    Est. cost/doc
    1 500 in + 500 out
    Benchmark Score Score / $1 blended
    higher = better value

    * Blended = 80 % input + 20 % output. Efficiency (score/$1) calculated on blended rate; open-source models shown as "Free — ∞ value".

    Click any data point to view full benchmark performance details for that model. Circles = LLM-based API models  |  Squares = open-source local models.

    Category Breakdown

    Average score per model per document category. Highlights where each OCR engine excels or struggles.

    Category

    Per-Document Comparison

    Click any row to view full detail — image, extracted text, and per-judge scores. Click a score number for evidence. Click a thumbnail to enlarge.

    Preview Doc Category Winner

    How the Benchmark Was Built

    1
    Collected 50 Diverse Financial Documents

    Sourced from 4 public datasets: SROIE (receipts), MultiFin (SEC filings), Sujet (financial reports), Omni OCR Benchmark (general docs), plus deliberately degraded scans to stress-test robustness.

    SROIE receipts ×10 MultiFin SEC filings ×10 Sujet financial ×5 Omni general ×15 Degraded scans ×10
    2
    Ran 8 OCR Models on Each Document

    Applied all 8 models to every document: 5 local engines + 3 LLM-based vision models via OpenRouter. Total: 400 extractions.

    GLM-OCR (open-source, local) docTR (Mindee) Docling (IBM) Tesseract 5 EasyOCR Gemini 3.1 Pro GPT-5.4 Claude Sonnet 4.6
    3
    Sent Outputs to 5 Independent AI Judges

    Each judge received the source image and OCR text, then scored 5 criteria (0–100). Using 5 independent judges reduces single-model bias and anchors rankings in consensus. Total: 2,000 evaluations (5 judges × 8 models × 50 documents).

    Gemini 2.5 Flash Claude Sonnet GPT-4.1 Mini Qwen 2.5 VL 72B DeepSeek V3
    4
    Aggregated Scores Across Judges

    Final score per model–document pair = arithmetic mean across all 5 judges. Weighted total = 30% text accuracy + 25% table structure + 20% numeric precision + 15% layout + 10% completeness.

    2,000 evaluations Mean aggregation Weighted scoring Category breakdowns