PDF to markdown using vision LLMs — tables, layouts, and structure preserved
-
Updated
Feb 21, 2026 - Python
PDF to markdown using vision LLMs — tables, layouts, and structure preserved
Convert scanned PDFs into searchable text locally using Vision LLMs (olmOCR). 100% private, offline, and free. Features a modern Web UI & CLI.
Powerful PDF data extraction library powered by AI vision models. Transform PDFs into structured, validated data using TypeScript, Zod, and AI providers like Scaleway and Ollama.
PyMidscene - Midscene.js 的 Python SDK 实现 | AI 驱动的自然语言 UI 自动化,告别选择器,用中文描述即可操作。与官方缓存格式完全兼容。
AI Video Editor Pipeline with Vision LLM Models
AI-powered OCR for Diablo II: Resurrected - batch-extract item tooltips from screenshots using Vision LLMs (OpenAI, Groq, OpenRouter, LM Studio/Ollama). No Tesseract or EasyOCR needed.
A feature-rich desktop GUI for Ollama with Vision, RAG, and JSON support.
A Python‑based incident detection engine that analyzes video feeds for motion, detects objects, and uses large language models (LLMs) to generate semantic descriptions of incidents. Designed for extensibility with custom detectors and processors.
Free OCR powered by LLMs using OpenRouter — extract text from images with no API costs. Works with image URLs and Base64 inputs using free vision-capable models.
Multimodal AI-powered medical assistant with LLMs, speech, and image understanding.
Free, offline OCR using local LLMs with Ollama. Convert images to text with vision-enabled models running entirely on your machine — no cloud, no API costs, full privacy.
🖼️ Extract text from images locally using Ollama's LLMs—100% free, offline, and private. No API keys or cloud costs necessary.
AI-powered tool that extracts structured data from bank statement images using LLaMA Vision and displays it in clean JSON and table formats. Built with Streamlit and pandas for fast, accurate financial document parsing.
A FastAPI-based backend service that extracts structured information from academic marksheets (images or PDFs) using OCR and an LLM, and returns a normalized JSON response with confidence scores.
🤖 A Discord bot that scrapes daily tech comics (XKCD, MonkeyUser, Turnoff.us) and uses Vision LLMs (Llama-4 via Groq) to explain the jokes.
🤖 Automate UI interactions with ease using the PyMidscene Python SDK, leveraging Midscene.js for AI-driven, natural language commands.
Multi-engine image generation filter for Open WebUI. Features automated prompt enhancement, multi-language support, and real-time Vision QC scoring. Supports A1111, ComfyUI, and OpenAI backends with integrated performance telemetry.
This repository focuses on customizing the Qwen2.5-Vision model for specific tasks. It provides step-by-step guidance, scripts, and best practices for fine-tuning the model on custom datasets. Ideal for developers and researchers, it ensures optimal performance and accuracy tailored to unique use cases.
Automated data extraction from PDF receipts to Excel using Vision LLM (tested with Qwen3-VL and olmOCR 2).
Add a description, image, and links to the vision-llm topic page so that developers can more easily learn about it.
To associate your repository with the vision-llm topic, visit your repo's landing page and select "manage topics."