feat: native tabular (CSV/Excel) and plain-text ingestion by nv78 · Pull Request #114 · anote-ai/Autonomous-Intelligence

nv78 · 2026-03-24T21:33:58Z

Problem

CSV / spreadsheets: Tika extracts these as a flat whitespace dump with no column headers or row relationships — the data is nearly impossible to find semantically. A CSV with columns Date, Revenue, Region becomes 2024-01-01 42000 North 2024-01-02 38000 South … — meaningless without structure.

Plain text files (TXT, MD, JSON, etc.): These went through Tika unnecessarily, adding network latency to the Tika server and occasionally mangling encoding on files with 8-bit characters.

Solution

A _text_subcategory() router dispatches each file to the right parser before it ever reaches Tika. Tika is now only used for the formats it's actually good at: PDF, DOCX, DOC, RTF, PPT.

New file: `backend/services/tabular_service.py`

`ingest_tabular(bytes, filename, mime_type) -> str`

CSV / TSV — stdlib csv module, zero dependencies
XLSX / ODS — pandas + openpyxl, handles multiple sheets; each sheet becomes a ## Sheet: name section
XLS — pandas + xlrd (legacy Excel)
Output: Markdown table with column headers preserved — the LLM (and semantic search) can now understand column relationships
Up to 500 rows rendered as a full Markdown table; additional rows appended as raw CSV lines so they're still indexed
Up to 50 columns per table (wider sheets truncated with …)
Never raises — returns a human-readable placeholder on failure

`ingest_plaintext(bytes, filename) -> str`

UTF-8 decode with latin-1 fallback (no silent data loss)
Wraps content in a fenced code block tagged with the file extension so the LLM understands the format during retrieval

Updated: `backend/api_endpoints/documents/handler.py`

New _TABULAR_MIMES and _PLAINTEXT_MIMES sets
Extended _EXT_TO_MIME fallback table covers: csv tsv xls xlsx ods txt md rst py js json xml html
_text_subcategory(mime, filename) returns 'tabular' | 'plaintext' | 'document' — uses extension fallback when MIME is generic text/plain
Routing in IngestDocumentsHandler:

Subcategory	Files	Parser
`tabular`	CSV, TSV, XLS, XLSX, ODS	`ingest_tabular()`
`plaintext`	TXT, MD, RST, PY, JS, JSON, XML, HTML	`ingest_plaintext()`
`document`	PDF, DOCX, DOC, RTF, PPT	Apache Tika (unchanged)

Test plan

Upload data.csv with headers → document_text contains Markdown table with column headers; ask "what is the revenue for region North?" → correct answer
Upload report.xlsx with 3 sheets → all 3 sheets in document_text with ## Sheet: headings
Upload notes.txt → plain decode, no Tika call, encoding preserved
Upload schema.json → code-fenced JSON stored and searchable
Upload report.pdf → Tika path still used, no regression
Upload CSV with 600 rows → first 500 as Markdown table, remaining 100 appended as plain CSV
Upload XLS without xlrd installed → human-readable placeholder stored

Depends on: claude/video-frame-analysis → claude/audio-transcription-pipeline → claude/image-document-analysis → claude/add-multimodal-support-QBQca

https://claude.ai/code/session_01C9mHttiQ4ZAaBbQecVV7uu

Problem: CSV and spreadsheet files sent through Tika return a flat whitespace dump that loses all column/row structure, making the data nearly impossible to find via semantic search. Plain-text files (TXT, MD, JSON, etc.) went through Tika unnecessarily, adding latency and sometimes garbling encoding. Solution: add a text subcategory router in IngestDocumentsHandler that dispatches to the right parser before reaching Tika. services/tabular_service.py (new): - ingest_tabular(bytes, filename, mime_type) -> str - CSV/TSV: stdlib csv module, renders as Markdown table - XLSX/ODS: pandas + openpyxl, handles multiple sheets, each becomes a '## Sheet: name' section with a Markdown table - XLS: pandas + xlrd (legacy Excel) - Up to 500 rows rendered as a full Markdown table; additional rows appended as plain CSV so they are still indexed by the embedder - Up to 50 columns per table (wider sheets are truncated with '…') - Never raises — placeholder string on parse failure - ingest_plaintext(bytes, filename) -> str - UTF-8 decode with latin-1 fallback - Wraps content in a fenced code block tagged with the file extension so the LLM understands the format during retrieval documents/handler.py: - New _TABULAR_MIMES and _PLAINTEXT_MIMES sets - Extended _EXT_TO_MIME table covers csv, tsv, xls, xlsx, ods, txt, md, rst, py, js, json, xml, html - _text_subcategory(mime, filename) returns 'tabular' | 'plaintext' | 'document' — falls back to extension when MIME is generic - IngestDocumentsHandler routes: tabular -> ingest_tabular() -> chunk plaintext -> ingest_plaintext() -> chunk document -> Tika (PDF/DOCX/DOC/RTF/PPT — unchanged) https://claude.ai/code/session_01C9mHttiQ4ZAaBbQecVV7uu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: native tabular (CSV/Excel) and plain-text ingestion#114

feat: native tabular (CSV/Excel) and plain-text ingestion#114
nv78 wants to merge 1 commit intoclaude/video-frame-analysisfrom
claude/tabular-and-text-ingestion

nv78 commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nv78 commented Mar 24, 2026

Problem

Solution

New file: backend/services/tabular_service.py

ingest_tabular(bytes, filename, mime_type) -> str

ingest_plaintext(bytes, filename) -> str

Updated: backend/api_endpoints/documents/handler.py

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New file: `backend/services/tabular_service.py`

`ingest_tabular(bytes, filename, mime_type) -> str`

`ingest_plaintext(bytes, filename) -> str`

Updated: `backend/api_endpoints/documents/handler.py`