A comprehensive toolkit for benchmarking various OCR (Optical Character Recognition) methods against ground truth data from FUNSD dataset annotations.
This repository contains tools to:
- Sample datasets for reproducible benchmarking
- Generate ground truth from FUNSD document annotations
- Run and evaluate various OCR methods
- Visualize and compare results
The benchmark uses the FUNSD (Form Understanding in Noisy Scanned Documents) dataset, which consists of noisy scanned forms with annotations for text, layout, and form understanding tasks.
The full dataset is organized as follows:
dataset/
testing_data/
images/ # PNG files of scanned forms
annotations/ # JSON annotations with text fields, bounding boxes, etc.
The benchmarking toolkit can create reproducible samples from this dataset:
dataset/
sample/
images/ # Sampled subset of images
annotations/ # Corresponding annotations
ground_truth.json # Extracted text for evaluation
sample_info.json # Sampling parameters for reproducibility
This benchmark compares the following OCR methods:
| Method | Type | Description | Requirements |
|---|---|---|---|
| Tesseract | Traditional | Industry-standard open-source OCR engine | pip install pytesseract opencv-python |
| EasyOCR | Deep Learning | Multi-language OCR using CRAFT text detector and CRNN recognizer | pip install easyocr |
| PaddleOCR | Deep Learning | Efficient OCR system by Baidu | pip install paddlepaddle paddleocr |
| DocTR | Deep Learning | Document Text Recognition from Hugging Face | pip install python-doctr |
| Docling | Deep Learning | Document processor with OCR capabilities | pip install docling |
| KerasOCR | Deep Learning | Uses CRAFT text detector and Keras CRNN recognizer | pip install keras-ocr |
| Amazon Textract | Cloud API | AWS OCR service for documents | AWS credentials + pip install boto3 |
| VLM Models | Vision-Language | OpenRouter vision models (Qwen, Mistral, Pixtral) | OpenRouter API key + pip install openai |
- Python 3.8+
- Tesseract OCR installed on your system if using that method
- Required Python packages (install with
pip install -r requirements.txt)
-
Clone this repository:
git clone https://github.com/yourusername/ocr_benchmarking.git cd ocr_benchmarking -
Install dependencies:
make install
To create a reproducible sample of the dataset:
make sample SEED=42 SAMPLE_SIZE=10This will randomly select 10 images from the dataset using seed 42 for reproducibility.
Generate ground truth text from the FUNSD annotations:
make ground-truthFor more control over how text is extracted from annotations, you can run the script directly:
python generate_ground_truth.py --annotations-dir dataset/sample/annotations --output-file dataset/sample/ground_truth.json --vertical-tolerance 15The --vertical-tolerance parameter controls how text elements are grouped into lines based on their vertical position. A higher value will group more elements into the same line, while a lower value will create more separate lines.
Use the --debug flag to see detailed information about how text elements are grouped into lines:
python generate_ground_truth.py --debugYou can also generate ground truth using Vision-Language Models (VLMs) via OpenRouter:
make ground-truth-vlmThis requires an OpenRouter API key set in your environment:
export OPENROUTER_API_KEY="your_api_key"You can specify a different VLM model:
make ground-truth-vlm VLM_MODEL="anthropic/claude-3-5-sonnet"Or run the script directly with more options:
python generate_vlm_ground_truth.py --image-dir dataset/sample/images --output-file dataset/sample/ground_truth_vlm.json --model "anthropic/claude-3-5-sonnet" --retries 3You can evaluate OCR methods against different ground truth sources:
# Evaluate against annotation-based ground truth
make eval
# Evaluate against VLM-based ground truth
make eval-vlmThis allows you to compare how different OCR methods perform against different ground truth standards.
Run the benchmark with selected OCR methods:
make benchmark METHODS="tesseract easyocr paddleocr"This will automatically save the results to a file named based on your sample directory (e.g., results/result_sample_test.json).
If you just want to extract text from images without evaluating against a ground truth:
make only-extract-text METHODS="tesseract"This will save the results to a file named result_<sample_name>.json in the results directory. For example, if your sample directory is dataset/sample_test, the results will be saved to results/result_sample_test.json.
You can specify a different sample directory:
make only-extract-text METHODS="tesseract" SAMPLE_DIR="dataset/my_custom_sample"This is useful when you want to:
- Process images with OCR methods without having ground truth available
- Generate OCR results to share with others
- Batch process a set of images with multiple OCR methods
To evaluate previously saved results against a ground truth:
make evalBy default, this uses the results file based on your sample directory name. You can specify a different sample directory or results file:
# Evaluate results for a specific sample directory
make eval SAMPLE_DIR="dataset/my_custom_sample"
# Evaluate a specific results file against the ground truth
make eval RESULT_FILE="result_custom.json"To evaluate against VLM-generated ground truth instead:
make eval-vlmThe same options apply for specifying custom sample directories or result files:
make eval-vlm SAMPLE_DIR="dataset/my_custom_sample"
make eval-vlm RESULT_FILE="result_custom.json"Run the entire pipeline (sample, ground truth, and benchmark):
make allpython sample_dataset.py --source-dir dataset/testing_data --dest-dir dataset/sample --sample-size 10 --seed 42python generate_ground_truth.py --annotations-dir dataset/sample/annotations --output-file dataset/sample/ground_truth.jsonpython run_benchmark.py --image-dir dataset/sample/images --ground-truth dataset/sample/ground_truth.json --methods tesseract easyocr paddleocrYou can save OCR results to avoid reprocessing images:
python run_benchmark.py --save-results --results-file my_results.jsonLoad saved results and evaluate:
python run_benchmark.py --load-results --results-file my_results.jsonEvaluate only (same as --load-results):
python run_benchmark.py --eval-only --results-file my_results.jsonThe benchmark evaluates OCR methods using several complementary metrics to provide a comprehensive understanding of performance:
Description: Measures how similar the extracted text is to the ground truth using Python's difflib.SequenceMatcher.
Calculation: The ratio of matching elements to the total number of elements in both sequences, returning a value between 0.0 (no similarity) and 1.0 (perfect match).
Use case: Provides a general measure of overall text similarity that accounts for additions, deletions, and substitutions.
Description: A common metric in speech recognition and OCR that measures the edit distance between words.
Calculation: Calculated as:
WER = (S + D + I) / N
Where:
- S = number of substituted words
- D = number of deleted words
- I = number of inserted words
- N = total number of words in the reference text
Use case: Lower is better. WER is particularly useful for assessing how many word-level corrections would be needed to transform the OCR output into the ground truth.
Description: Similar to WER but at the character level, which provides finer-grained assessment.
Calculation: Calculated as:
CER = (S + D + I) / N
Where:
- S = number of substituted characters
- D = number of deleted characters
- I = number of inserted characters
- N = total number of characters in the reference text
Use case: Lower is better. CER is useful for languages where word boundaries are not clear or when character-level accuracy is important.
Description: Percentage of reference words that appear in the extracted text.
Calculation: The ratio of words from the ground truth that appear in the OCR output, regardless of order or frequency.
Use case: Useful for scenarios where the presence of key terms is more important than their exact positioning or order.
Description: How long each method takes to process images.
Calculation: Measured in seconds per image.
Use case: Important for real-time applications or when processing large volumes of documents.
When interpreting the benchmark results, consider:
-
Task-specific priorities: For some applications, accuracy might be more important than speed, while for others, the inverse might be true.
-
Document type sensitivity: Some OCR methods perform better on certain document types (handwritten, printed, forms, etc.).
-
Language considerations: Performance can vary significantly depending on the language and script.
-
Error patterns: Look beyond the raw metrics to understand the types of errors each method makes.
-
Ground truth quality: Remember that the evaluation is only as good as the ground truth it's compared against. FUNSD annotations and VLM-generated ground truths may have different characteristics.
The benchmark generates several visualizations to help interpret results:
-
Comparison charts: Bar charts comparing all OCR methods across each metric.
-
Heatmaps: Show where each method excels or struggles.
-
Time vs. Accuracy plots: Help identify optimal methods balancing speed and accuracy.
-
Per-image results: Detailed metrics for each image to identify patterns based on document type or complexity.
To add a new OCR method, edit the ocr_methods.py file and add a function with the following signature:
def ocr_your_method(image_path: str) -> str:
"""Extract text from image using your method
Installation: !pip install your-requirements
Args:
image_path (str): Path to the image file
Returns:
str: Extracted text from the image
"""
# Your implementation here
return extracted_textThen add it to the OCR_METHODS dictionary at the bottom of the file:
OCR_METHODS = {
# Existing methods...
"your_method": ocr_your_method,
}To use cloud OCR services, set the required environment variables:
export AZURE_VISION_KEY="your_api_key"
export AZURE_VISION_ENDPOINT="your_endpoint"export AWS_ACCESS_KEY_ID="your_access_key"
export AWS_SECRET_ACCESS_KEY="your_secret_key"
export AWS_REGION_NAME="your_region"export OPENROUTER_API_KEY="your_api_key"Results are saved in the results/ directory:
- Individual JSON files with extracted text for each method
- Complete OCR results with extracted text and processing times
- Visualization plots comparing performance metrics
- Similarity scores compared to ground truth
- A summary table highlighting the best-performing methods
- JSON export of evaluation metrics
This project is licensed under the MIT License - see the LICENSE file for details.
- FUNSD dataset - For providing the dataset of forms
- Various OCR libraries and their maintainers