Famous Vision Language Models and Their Architectures
-
Updated
Jan 11, 2026 - Markdown
Famous Vision Language Models and Their Architectures
ComfyUI-QwenVL custom node: Integrates the Qwen-VL series, including Qwen2.5-VL and the latest Qwen3-VL, with GGUF support for advanced multimodal AI in text generation, image understanding, and video analysis.
A most Frontend Collection and survey of vision-language model papers, and models GitHub repository. Continuous updates.
A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.
Fast GPU OCR server. 270 img/s on FUNSD. TensorRT FP16, PP-OCRv5, HTTP + gRPC.
Reinforcement Learning of Vision Language Models with Self Visual Perception Reward
Mark web pages for use with vision-language models
Local Video RAG Engine. A FastAPI microservice for video understanding: Scene Detection + Whisper ASR + Qwen3-VL. Optimized for Apple Silicon (MLX) & Windows/Linux (Llama.cpp).
An AI Agent that is able to control your screen to complste any task
🛠️ Build and train multimodal models easily with LLaVA-OneVision 1.5, an open framework designed for seamless integration of vision and language tasks.
Qwen-VL base model for use with Autodistill.
A robotic sequential grasping system integrating YOLO detection and Qwen-VLM fine-tuning, enabling a full loop from manual teaching to LLM-based logical manipulation.
🤖 The Next-Gen AI Agent. Unlike normal agents, it goes beyond text and can control your Desktop & Android.
creates text from video and audio using Qwen-VL and Whisper
A computer vision system for automated analysis of index cards from a collection of coin forgeries using Qwen2.5-VL vision-language model. Developed for the imagines nummorum project.
Enable local integration of Qwen3.5 models with ComfyUI for text generation and multimodal visual tasks, featuring automatic model management and precision control.
EYUAI
A specialized ComfyUI toolkit for Qwen Image Edit workflows. It provides official training resolution calibration, real-time UI aspect ratio feedback, and intelligent image scaling (Crop/Pad/Stretch) to ensure optimal inference quality for Qwen-series image editing and generation.
Generate vivid, human-like captions for portrait images using the Qwen2.5-VL-7B model. Outputs dense descriptions covering emotion, posture, clothing, and environment.
Add a description, image, and links to the qwen-vl topic page so that developers can more easily learn about it.
To associate your repository with the qwen-vl topic, visit your repo's landing page and select "manage topics."