OCR Benchmarking

A comprehensive toolkit for benchmarking various OCR (Optical Character Recognition) methods against ground truth data from FUNSD dataset annotations.

Overview

This repository contains tools to:

Sample datasets for reproducible benchmarking
Generate ground truth from FUNSD document annotations
Run and evaluate various OCR methods
Visualize and compare results

Dataset

The benchmark uses the FUNSD (Form Understanding in Noisy Scanned Documents) dataset, which consists of noisy scanned forms with annotations for text, layout, and form understanding tasks.

Dataset Structure

The full dataset is organized as follows:

dataset/
  testing_data/
    images/         # PNG files of scanned forms
    annotations/    # JSON annotations with text fields, bounding boxes, etc.

The benchmarking toolkit can create reproducible samples from this dataset:

dataset/
  sample/
    images/         # Sampled subset of images
    annotations/    # Corresponding annotations
    ground_truth.json  # Extracted text for evaluation
    sample_info.json   # Sampling parameters for reproducibility

OCR Methods

This benchmark compares the following OCR methods:

Method	Type	Description	Requirements
Tesseract	Traditional	Industry-standard open-source OCR engine	`pip install pytesseract opencv-python`
EasyOCR	Deep Learning	Multi-language OCR using CRAFT text detector and CRNN recognizer	`pip install easyocr`
PaddleOCR	Deep Learning	Efficient OCR system by Baidu	`pip install paddlepaddle paddleocr`
DocTR	Deep Learning	Document Text Recognition from Hugging Face	`pip install python-doctr`
Docling	Deep Learning	Document processor with OCR capabilities	`pip install docling`
KerasOCR	Deep Learning	Uses CRAFT text detector and Keras CRNN recognizer	`pip install keras-ocr`
Amazon Textract	Cloud API	AWS OCR service for documents	AWS credentials + `pip install boto3`
VLM Models	Vision-Language	OpenRouter vision models (Qwen, Mistral, Pixtral)	OpenRouter API key + `pip install openai`

Quick Start

Prerequisites

Python 3.8+
Tesseract OCR installed on your system if using that method
Required Python packages (install with pip install -r requirements.txt)

Installation

Clone this repository:

git clone https://github.com/yourusername/ocr_benchmarking.git
cd ocr_benchmarking

Install dependencies:
```
make install
```

Creating a Sample Dataset

To create a reproducible sample of the dataset:

make sample SEED=42 SAMPLE_SIZE=10

This will randomly select 10 images from the dataset using seed 42 for reproducibility.

Generating Ground Truth

Generate ground truth text from the FUNSD annotations:

make ground-truth

For more control over how text is extracted from annotations, you can run the script directly:

python generate_ground_truth.py --annotations-dir dataset/sample/annotations --output-file dataset/sample/ground_truth.json --vertical-tolerance 15

The --vertical-tolerance parameter controls how text elements are grouped into lines based on their vertical position. A higher value will group more elements into the same line, while a lower value will create more separate lines.

Use the --debug flag to see detailed information about how text elements are grouped into lines:

python generate_ground_truth.py --debug

Using VLM Models as Ground Truth

You can also generate ground truth using Vision-Language Models (VLMs) via OpenRouter:

make ground-truth-vlm

This requires an OpenRouter API key set in your environment:

export OPENROUTER_API_KEY="your_api_key"

You can specify a different VLM model:

make ground-truth-vlm VLM_MODEL="anthropic/claude-3-5-sonnet"

Or run the script directly with more options:

python generate_vlm_ground_truth.py --image-dir dataset/sample/images --output-file dataset/sample/ground_truth_vlm.json --model "anthropic/claude-3-5-sonnet" --retries 3

Comparing Different Ground Truth Sources

You can evaluate OCR methods against different ground truth sources:

# Evaluate against annotation-based ground truth
make eval

# Evaluate against VLM-based ground truth
make eval-vlm

This allows you to compare how different OCR methods perform against different ground truth standards.

Running the Benchmark

Run the benchmark with selected OCR methods:

make benchmark METHODS="tesseract easyocr paddleocr"

This will automatically save the results to a file named based on your sample directory (e.g., results/result_sample_test.json).

Extracting Text Only (Without Evaluation)

If you just want to extract text from images without evaluating against a ground truth:

make only-extract-text METHODS="tesseract"

This will save the results to a file named result_<sample_name>.json in the results directory. For example, if your sample directory is dataset/sample_test, the results will be saved to results/result_sample_test.json.

You can specify a different sample directory:

make only-extract-text METHODS="tesseract" SAMPLE_DIR="dataset/my_custom_sample"

This is useful when you want to:

Process images with OCR methods without having ground truth available
Generate OCR results to share with others
Batch process a set of images with multiple OCR methods

Evaluating Results

To evaluate previously saved results against a ground truth:

make eval

By default, this uses the results file based on your sample directory name. You can specify a different sample directory or results file:

# Evaluate results for a specific sample directory
make eval SAMPLE_DIR="dataset/my_custom_sample"

# Evaluate a specific results file against the ground truth
make eval RESULT_FILE="result_custom.json"

To evaluate against VLM-generated ground truth instead:

make eval-vlm

The same options apply for specifying custom sample directories or result files:

make eval-vlm SAMPLE_DIR="dataset/my_custom_sample"
make eval-vlm RESULT_FILE="result_custom.json"

Full Pipeline

Run the entire pipeline (sample, ground truth, and benchmark):

make all

Advanced Usage

Manually Running Individual Steps

1. Sample Dataset

python sample_dataset.py --source-dir dataset/testing_data --dest-dir dataset/sample --sample-size 10 --seed 42

2. Generate Ground Truth

python generate_ground_truth.py --annotations-dir dataset/sample/annotations --output-file dataset/sample/ground_truth.json

3. Run Benchmark

python run_benchmark.py --image-dir dataset/sample/images --ground-truth dataset/sample/ground_truth.json --methods tesseract easyocr paddleocr

Saving and Loading Results

You can save OCR results to avoid reprocessing images:

python run_benchmark.py --save-results --results-file my_results.json

Load saved results and evaluate:

python run_benchmark.py --load-results --results-file my_results.json

Evaluate only (same as --load-results):

python run_benchmark.py --eval-only --results-file my_results.json

Evaluation Metrics

The benchmark evaluates OCR methods using several complementary metrics to provide a comprehensive understanding of performance:

Text Similarity

Description: Measures how similar the extracted text is to the ground truth using Python's difflib.SequenceMatcher.

Calculation: The ratio of matching elements to the total number of elements in both sequences, returning a value between 0.0 (no similarity) and 1.0 (perfect match).

Use case: Provides a general measure of overall text similarity that accounts for additions, deletions, and substitutions.

Word Error Rate (WER)

Description: A common metric in speech recognition and OCR that measures the edit distance between words.

Calculation: Calculated as:

WER = (S + D + I) / N

Where:

S = number of substituted words
D = number of deleted words
I = number of inserted words
N = total number of words in the reference text

Use case: Lower is better. WER is particularly useful for assessing how many word-level corrections would be needed to transform the OCR output into the ground truth.

Character Error Rate (CER)

Description: Similar to WER but at the character level, which provides finer-grained assessment.

Calculation: Calculated as:

CER = (S + D + I) / N

Where:

S = number of substituted characters
D = number of deleted characters
I = number of inserted characters
N = total number of characters in the reference text

Use case: Lower is better. CER is useful for languages where word boundaries are not clear or when character-level accuracy is important.

Common Word Accuracy

Description: Percentage of reference words that appear in the extracted text.

Calculation: The ratio of words from the ground truth that appear in the OCR output, regardless of order or frequency.

Use case: Useful for scenarios where the presence of key terms is more important than their exact positioning or order.

Processing Time

Description: How long each method takes to process images.

Calculation: Measured in seconds per image.

Use case: Important for real-time applications or when processing large volumes of documents.

Interpreting Results

When interpreting the benchmark results, consider:

Task-specific priorities: For some applications, accuracy might be more important than speed, while for others, the inverse might be true.
Document type sensitivity: Some OCR methods perform better on certain document types (handwritten, printed, forms, etc.).
Language considerations: Performance can vary significantly depending on the language and script.
Error patterns: Look beyond the raw metrics to understand the types of errors each method makes.
Ground truth quality: Remember that the evaluation is only as good as the ground truth it's compared against. FUNSD annotations and VLM-generated ground truths may have different characteristics.

Visualization

The benchmark generates several visualizations to help interpret results:

Comparison charts: Bar charts comparing all OCR methods across each metric.
Heatmaps: Show where each method excels or struggles.
Time vs. Accuracy plots: Help identify optimal methods balancing speed and accuracy.
Per-image results: Detailed metrics for each image to identify patterns based on document type or complexity.

Adding Your Own OCR Methods

To add a new OCR method, edit the ocr_methods.py file and add a function with the following signature:

def ocr_your_method(image_path: str) -> str:
    """Extract text from image using your method
    
    Installation: !pip install your-requirements
    
    Args:
        image_path (str): Path to the image file
        
    Returns:
        str: Extracted text from the image
    """
    # Your implementation here
    return extracted_text

Then add it to the OCR_METHODS dictionary at the bottom of the file:

OCR_METHODS = {
    # Existing methods...
    "your_method": ocr_your_method,
}

Cloud OCR Services Configuration

To use cloud OCR services, set the required environment variables:

Azure Computer Vision

export AZURE_VISION_KEY="your_api_key"
export AZURE_VISION_ENDPOINT="your_endpoint"

Amazon Textract

export AWS_ACCESS_KEY_ID="your_access_key"
export AWS_SECRET_ACCESS_KEY="your_secret_key"
export AWS_REGION_NAME="your_region"

OpenRouter (for Vision-Language Models)

export OPENROUTER_API_KEY="your_api_key"

Results

Results are saved in the results/ directory:

Individual JSON files with extracted text for each method
Complete OCR results with extracted text and processing times
Visualization plots comparing performance metrics
Similarity scores compared to ground truth
A summary table highlighting the best-performing methods
JSON export of evaluation metrics

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

FUNSD dataset - For providing the dataset of forms
Various OCR libraries and their maintainers

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.vscode		.vscode
dataset/sample		dataset/sample
examples		examples
memory_bank		memory_bank
results		results
src/utils		src/utils
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
Makefile		Makefile
README.md		README.md
VL_ocr_methods.py		VL_ocr_methods.py
generate_ground_truth.py		generate_ground_truth.py
generate_vlm_ground_truth.py		generate_vlm_ground_truth.py
ocr_benchmarking_blog.md		ocr_benchmarking_blog.md
ocr_comparison.py		ocr_comparison.py
ocr_illustration.png		ocr_illustration.png
ocr_methods.py		ocr_methods.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_benchmark.py		run_benchmark.py
sample_dataset.py		sample_dataset.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

OCR Benchmarking

Overview

Dataset

Dataset Structure

OCR Methods

Quick Start

Prerequisites

Installation

Creating a Sample Dataset

Generating Ground Truth

Using VLM Models as Ground Truth

Comparing Different Ground Truth Sources

Running the Benchmark

Extracting Text Only (Without Evaluation)

Evaluating Results

Full Pipeline

Advanced Usage

Manually Running Individual Steps

1. Sample Dataset

2. Generate Ground Truth

3. Run Benchmark

Saving and Loading Results

Evaluation Metrics

Text Similarity

Word Error Rate (WER)

Character Error Rate (CER)

Common Word Accuracy

Processing Time

Interpreting Results

Visualization

Adding Your Own OCR Methods

Cloud OCR Services Configuration

Azure Computer Vision

Amazon Textract

OpenRouter (for Vision-Language Models)

Results

License

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages