A Gradio-based demonstration application for the Tencent HunyuanOCR model, focused on optical character recognition (OCR) tasks such as text detection, extraction, and coordinate formatting from images. Users can upload images, customize prompts (e.g., for Chinese/English text), and generate structured outputs with advanced generation controls.
- Image Upload and Processing: Supports direct upload or clipboard paste; processes images via PIL for text recognition.
- Custom Prompts: Tailor queries like "检测并识别图片中的文字,将文本坐标格式化输出。" (Detect and recognize text in the image, format text coordinates output) for precise extraction.
- Advanced Generation Controls: Adjustable max new tokens (up to 8192) for handling complex documents.
- Output Handling: Cleaned text to remove repetitions; interactive textbox with copy button for easy use.
- Custom Theme: SteelBlueTheme with gradients and enhanced typography for a professional interface.
- Examples Integration: Built-in sample images for quick testing (e.g., documents, receipts).
- Queueing Support: Handles up to 10 concurrent inferences for smooth multi-user access.
- Error Resilience: Graceful handling of loading and generation errors with informative messages.
- Python 3.10 or higher.
- CUDA-compatible GPU (recommended for bfloat16; falls back to CPU with float32).
- Git for cloning submodules.
- Hugging Face account (optional, for model caching via
huggingface_hub).
-
Clone the repository:
git clone https://github.com/PRITHIVSAKTHIUR/HunyuanOCR-Demo.git cd HunyuanOCR-Demo -
Install dependencies: Create a
requirements.txtfile with the following content, then run:pip install -r requirements.txtrequirements.txt content:
git+https://github.com/huggingface/transformers@82a06db03535c49aa987719ed0746a76093b1ec4 git+https://github.com/huggingface/accelerate.git git+https://github.com/huggingface/diffusers.git git+https://github.com/huggingface/peft.git huggingface_hub gradio==6.0.1 qwen-vl-utils sentencepiece opencv-python torchvision supervision matplotlib easydict kernels einops addict hf_xet torch numpy av -
Start the application:
python app.pyThe demo launches at
http://localhost:7860(or the provided URL if using Spaces).
-
Upload Image: Drag-and-drop or paste an image (e.g., scanned document, sign, or multilingual text).
-
Set Prompt: Enter a custom query in the textbox. Default: "检测并识别图片中的文字,将文本坐标格式化输出。" for formatted text with coordinates.
-
Configure Settings:
- Expand "Advanced Settings" to adjust max new tokens for longer outputs.
-
Run Inference: Click "Perform OCR" to process. Results appear in the output textbox.
-
View Results:
- Text: Structured OCR output (e.g., detected text with bounding box coordinates).
- Copy or edit the interactive output as needed.
- Upload a Chinese receipt image.
- Use default prompt for coordinate-formatted extraction.
- Set max new tokens to 2048 for detailed results.
- Output: List of text segments with positions like "Text: '价格', Coordinates: [x1, y1, x2, y2]".
- Model Loading Errors: Verify CUDA setup; check console for
torch.version.cuda. Useattn_implementation="eager"to avoid SDPA issues. - Out of Memory: Reduce max new tokens or use CPU fallback; monitor with
nvidia-smi. - Import Issues: Install
spacesonly for Hugging Face Spaces deployment; it's mocked locally. - Repeated Output: Automatically cleaned via
clean_repeated_substrings; increase threshold if needed. - Generation Fails: Ensure prompt is valid; test with default for baseline.
- UI Launch: If
ssr_mode=Falsecauses issues, set toTruefor server-side rendering.
Contributions are encouraged! Open issues for bugs or enhancements (e.g., batch processing, additional post-processing). Fork, create a branch, and submit a pull request with tests. Potential areas:
- Integration with other OCR models.
- Export options (e.g., JSON coordinates).
- Multilingual prompt templates.
Repository: https://github.com/PRITHIVSAKTHIUR/HunyuanOCR-Demo.git
Apache License 2.0. See LICENSE for details.
Built by Prithiv Sakthi. Report issues via the repository.