Skip to content

Mitix-EPI/DocVLM-implementation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📚 DocVLM: Make Your VLM an Efficient Reader Implementation

A partial implementation of DocVLM: Make your VLM an Efficient Reader (Original Paper)

🎯 Overview

This project implements the core concept of DocVLM, which enhances Vision-Language Models (VLMs) by adding an OCR modality alongside traditional Image and Text modalities to significantly improve document understanding accuracy.

🔍 Key Features

  • Multi-modal Architecture: Image + Text + OCR modalities
  • Enhanced Document Understanding: Leverages OCR text for better accuracy
  • DocVQA Evaluation: Tested on document question-answering tasks

⚠️ Implementation Limitations

Due to resource and accessibility constraints, this is a partial implementation with the following adaptations:

🔧 Technical Adaptations

  • OCR Encoder: The original paper uses AWS's closed-source DocFormerV2. Instead, I implement OCR embeddings based on the LayTextLLM approach from ByteDance (LayTextLLM Paper)
  • OCR Engine: Using EasyOCR for text extraction instead of enterprise-grade solutions
  • Training Strategy: Simplified training pipeline (no separate OCR-LLM and Vision alignment phases)
  • Model Trained: I focused the implementation on Qwen 2.5 VL model.

💻 Resource Constraints

  • Dataset: Only trained on DocVQA, InfographicQA, OCR_VQA (vs. multiple other datasets in original)
  • Batch Size: Limited to batch size 1 due to GPU memory constraints
  • Image Resolution: Resized to 616
  • GPU Requirements: Minimum 44GB VRAM needed for training

🚀 Installation & Usage

Prerequisites

Make sure you have uv installed for dependency management.

Install Dependencies

uv sync

Training

uv run docvlm-train --help

Evaluation

uv run docvlm-eval --help

📊 'Results'

Due to limited computation ressources, I could not complete a single complete training epoch.

The following graph is validation loss by the model trained for 1h with 6_000 data (1 epoch = 180_673 data)

validation_loss_graph

To completely validate the approach, at least a complete 30h of training on H100 is needed.

📚 Citations

Original DocVLM Paper (AWS Team)

@misc{nacson2024docvlmmakevlmefficient,
      title={DocVLM: Make Your VLM an Efficient Reader}, 
      author={Mor Shpigel Nacson and Aviad Aberdam and Roy Ganz and Elad Ben Avraham and Alona Golts and Yair Kittenplon and Shai Mazor and Ron Litman},
      year={2024},
      eprint={2412.08746},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.08746}, 
}

LayTextLLM Paper (ByteDance Team)

@article{lu2024bounding,
  title={A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding},
  author={Lu, Jinghui and Yu, Haiyang and Wang, Yanjie and Ye, Yongjie and Tang, Jingqun and Yang, Ziwei and Wu, Binghong and Liu, Qi and Feng, Hao and Wang, Han and others},
  journal={arXiv preprint arXiv:2407.01976},
  year={2024}
}

🤝 Contributing

This is a research implementation. Feel free to open issues or contribute improvements!

📝 Note

There are many many things to upgrade (batching data, better ocr, training alignments phases, lr, gpu training optimizations, ...). But it is a first step in direction of the paper complete experimentation. With current condition, I don't plan to continue the project.

About

Partial implementation of the DocVLM paper https://arxiv.org/abs/2412.08746

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages