A partial implementation of DocVLM: Make your VLM an Efficient Reader (Original Paper)
This project implements the core concept of DocVLM, which enhances Vision-Language Models (VLMs) by adding an OCR modality alongside traditional Image and Text modalities to significantly improve document understanding accuracy.
- ✅ Multi-modal Architecture: Image + Text + OCR modalities
- ✅ Enhanced Document Understanding: Leverages OCR text for better accuracy
- ✅ DocVQA Evaluation: Tested on document question-answering tasks
Due to resource and accessibility constraints, this is a partial implementation with the following adaptations:
- OCR Encoder: The original paper uses AWS's closed-source DocFormerV2. Instead, I implement OCR embeddings based on the LayTextLLM approach from ByteDance (LayTextLLM Paper)
- OCR Engine: Using EasyOCR for text extraction instead of enterprise-grade solutions
- Training Strategy: Simplified training pipeline (no separate OCR-LLM and Vision alignment phases)
- Model Trained: I focused the implementation on Qwen 2.5 VL model.
- Dataset: Only trained on DocVQA, InfographicQA, OCR_VQA (vs. multiple other datasets in original)
- Batch Size: Limited to batch size 1 due to GPU memory constraints
- Image Resolution: Resized to 616
- GPU Requirements: Minimum 44GB VRAM needed for training
Make sure you have uv installed for dependency management.
uv syncuv run docvlm-train --helpuv run docvlm-eval --helpDue to limited computation ressources, I could not complete a single complete training epoch.
The following graph is validation loss by the model trained for 1h with 6_000 data (1 epoch = 180_673 data)
To completely validate the approach, at least a complete 30h of training on H100 is needed.
@misc{nacson2024docvlmmakevlmefficient,
title={DocVLM: Make Your VLM an Efficient Reader},
author={Mor Shpigel Nacson and Aviad Aberdam and Roy Ganz and Elad Ben Avraham and Alona Golts and Yair Kittenplon and Shai Mazor and Ron Litman},
year={2024},
eprint={2412.08746},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.08746},
}@article{lu2024bounding,
title={A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding},
author={Lu, Jinghui and Yu, Haiyang and Wang, Yanjie and Ye, Yongjie and Tang, Jingqun and Yang, Ziwei and Wu, Binghong and Liu, Qi and Feng, Hao and Wang, Han and others},
journal={arXiv preprint arXiv:2407.01976},
year={2024}
}This is a research implementation. Feel free to open issues or contribute improvements!
There are many many things to upgrade (batching data, better ocr, training alignments phases, lr, gpu training optimizations, ...). But it is a first step in direction of the paper complete experimentation. With current condition, I don't plan to continue the project.
