📚 DocVLM: Make Your VLM an Efficient Reader Implementation

A partial implementation of DocVLM: Make your VLM an Efficient Reader (Original Paper)

🎯 Overview

This project implements the core concept of DocVLM, which enhances Vision-Language Models (VLMs) by adding an OCR modality alongside traditional Image and Text modalities to significantly improve document understanding accuracy.

🔍 Key Features

✅ Multi-modal Architecture: Image + Text + OCR modalities
✅ Enhanced Document Understanding: Leverages OCR text for better accuracy
✅ DocVQA Evaluation: Tested on document question-answering tasks

⚠️ Implementation Limitations

Due to resource and accessibility constraints, this is a partial implementation with the following adaptations:

🔧 Technical Adaptations

OCR Encoder: The original paper uses AWS's closed-source DocFormerV2. Instead, I implement OCR embeddings based on the LayTextLLM approach from ByteDance (LayTextLLM Paper)
OCR Engine: Using EasyOCR for text extraction instead of enterprise-grade solutions
Training Strategy: Simplified training pipeline (no separate OCR-LLM and Vision alignment phases)
Model Trained: I focused the implementation on Qwen 2.5 VL model.

💻 Resource Constraints

Dataset: Only trained on DocVQA, InfographicQA, OCR_VQA (vs. multiple other datasets in original)
Batch Size: Limited to batch size 1 due to GPU memory constraints
Image Resolution: Resized to 616
GPU Requirements: Minimum 44GB VRAM needed for training

🚀 Installation & Usage

Prerequisites

Make sure you have uv installed for dependency management.

Install Dependencies

uv sync

Training

uv run docvlm-train --help

Evaluation

uv run docvlm-eval --help

📊 'Results'

Due to limited computation ressources, I could not complete a single complete training epoch.

The following graph is validation loss by the model trained for 1h with 6_000 data (1 epoch = 180_673 data)

To completely validate the approach, at least a complete 30h of training on H100 is needed.

📚 Citations

Original DocVLM Paper (AWS Team)

@misc{nacson2024docvlmmakevlmefficient,
      title={DocVLM: Make Your VLM an Efficient Reader}, 
      author={Mor Shpigel Nacson and Aviad Aberdam and Roy Ganz and Elad Ben Avraham and Alona Golts and Yair Kittenplon and Shai Mazor and Ron Litman},
      year={2024},
      eprint={2412.08746},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.08746}, 
}

LayTextLLM Paper (ByteDance Team)

@article{lu2024bounding,
  title={A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding},
  author={Lu, Jinghui and Yu, Haiyang and Wang, Yanjie and Ye, Yongjie and Tang, Jingqun and Yang, Ziwei and Wu, Binghong and Liu, Qi and Feng, Hao and Wang, Han and others},
  journal={arXiv preprint arXiv:2407.01976},
  year={2024}
}

🤝 Contributing

This is a research implementation. Feel free to open issues or contribute improvements!

📝 Note

There are many many things to upgrade (batching data, better ocr, training alignments phases, lr, gpu training optimizations, ...). But it is a first step in direction of the paper complete experimentation. With current condition, I don't plan to continue the project.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docvlm		docvlm
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock
validation_loss_graph.png		validation_loss_graph.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📚 DocVLM: Make Your VLM an Efficient Reader Implementation

🎯 Overview

🔍 Key Features

⚠️ Implementation Limitations

🔧 Technical Adaptations

💻 Resource Constraints

🚀 Installation & Usage

Prerequisites

Install Dependencies

Training

Evaluation

📊 'Results'

📚 Citations

Original DocVLM Paper (AWS Team)

LayTextLLM Paper (ByteDance Team)

🤝 Contributing

📝 Note

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📚 DocVLM: Make Your VLM an Efficient Reader Implementation

🎯 Overview

🔍 Key Features

⚠️ Implementation Limitations

🔧 Technical Adaptations

💻 Resource Constraints

🚀 Installation & Usage

Prerequisites

Install Dependencies

Training

Evaluation

📊 'Results'

📚 Citations

Original DocVLM Paper (AWS Team)

LayTextLLM Paper (ByteDance Team)

🤝 Contributing

📝 Note

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages