feat: vision LLM description pipeline for image documents by nv78 · Pull Request #111 · anote-ai/Autonomous-Intelligence

nv78 · 2026-03-24T21:26:04Z

Summary

When a user uploads an image file (JPEG, PNG, GIF, WebP, BMP, TIFF) as a document, instead of attempting extraction through Apache Tika (which produces nothing useful for images), the new vision pipeline generates a rich text description via GPT-4o or Claude 3.5 Sonnet and feeds it into the existing RAG chunking + embedding pipeline. The image becomes fully searchable.

New file: `backend/services/vision_service.py`

describe_image(bytes, mime_type, prompt?) — main entry point; dispatches to the correct provider based on AgentConfig.DEFAULT_AGENT_MODEL_TYPE
_describe_openai() — calls openai.OpenAI with an image_url content block using detail=high for better OCR accuracy
_describe_anthropic() — calls anthropic.Anthropic with a source content block
Uses a detailed _INDEXING_PROMPT that asks the model to transcribe text verbatim, describe charts/diagrams/objects/layout, and be thorough for search indexing
Respects ENABLE_MULTIMODAL and MAX_IMAGE_BYTES config flags
Never raises — returns a human-readable placeholder string on failure so the document record is always created

Updated: `backend/api_endpoints/documents/handler.py`

New elif category == "image" branch in IngestDocumentsHandler:

Reads raw bytes from the upload
Calls describe_image()
Stores the description as document_text (with media_type='image', mime_type)
Enqueues chunk_document_fn.remote() so the description is split, embedded, and indexed

Video and audio remain as binary-only stubs — to be wired up in PRs 3 & 4.

Test plan

Upload a PNG screenshot → document record created with media_type='image'
Chunks appear in the chunks table after the Ray task completes
Ask a question about the image content → RAG retrieves the vision-generated text, LLM answers correctly
Upload a PDF → existing Tika path unaffected (no regression)
Upload a 25 MB image → describe_image returns a size-limit placeholder, document still created
Set ENABLE_MULTIMODAL=false → placeholder text stored, no LLM call made

Depends on: claude/add-multimodal-support-QBQca (schema + add_document signature)

https://claude.ai/code/session_01C9mHttiQ4ZAaBbQecVV7uu

When a user uploads an image file (JPEG, PNG, GIF, WebP, BMP, TIFF) via the document upload flow, instead of attempting text extraction through Tika, the new vision pipeline: 1. Reads the raw image bytes from the upload 2. Calls describe_image() in the new services/vision_service.py module which invokes the configured vision-capable LLM (GPT-4o with detail=high or Claude 3.5 Sonnet) with a detailed indexing prompt that asks the model to transcribe text, describe charts/diagrams/objects, and capture layout 3. Stores the resulting description as document_text in the documents table (media_type='image', mime_type=<actual mime>) 4. Feeds the description into the existing chunk_document Ray task so it is split, embedded, and indexed — making the image fully searchable via RAG services/vision_service.py: - describe_image(bytes, mime_type, prompt?) -> str - _describe_openai(): uses openai.OpenAI client with image_url content block (detail=high for better OCR accuracy) - _describe_anthropic(): uses anthropic.Anthropic client with source block - Respects ENABLE_MULTIMODAL and MAX_IMAGE_BYTES config flags - Never raises — returns a placeholder string on failure so the document record is always created documents/handler.py: - New elif category == "image" branch reads bytes, calls describe_image(), stores description, and enqueues chunking via Ray remote - video/audio remain as binary-only stubs (PRs 3 & 4) https://claude.ai/code/session_01C9mHttiQ4ZAaBbQecVV7uu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: vision LLM description pipeline for image documents#111

feat: vision LLM description pipeline for image documents#111
nv78 wants to merge 1 commit intoclaude/add-multimodal-support-QBQcafrom
claude/image-document-analysis

nv78 commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nv78 commented Mar 24, 2026

Summary

New file: backend/services/vision_service.py

Updated: backend/api_endpoints/documents/handler.py

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New file: `backend/services/vision_service.py`

Updated: `backend/api_endpoints/documents/handler.py`