feat: native tabular (CSV/Excel) and plain-text ingestion#114
Open
nv78 wants to merge 1 commit intoclaude/video-frame-analysisfrom
Open
feat: native tabular (CSV/Excel) and plain-text ingestion#114nv78 wants to merge 1 commit intoclaude/video-frame-analysisfrom
nv78 wants to merge 1 commit intoclaude/video-frame-analysisfrom
Conversation
Problem: CSV and spreadsheet files sent through Tika return a flat
whitespace dump that loses all column/row structure, making the data
nearly impossible to find via semantic search. Plain-text files (TXT,
MD, JSON, etc.) went through Tika unnecessarily, adding latency and
sometimes garbling encoding.
Solution: add a text subcategory router in IngestDocumentsHandler that
dispatches to the right parser before reaching Tika.
services/tabular_service.py (new):
- ingest_tabular(bytes, filename, mime_type) -> str
- CSV/TSV: stdlib csv module, renders as Markdown table
- XLSX/ODS: pandas + openpyxl, handles multiple sheets, each becomes a
'## Sheet: name' section with a Markdown table
- XLS: pandas + xlrd (legacy Excel)
- Up to 500 rows rendered as a full Markdown table; additional rows
appended as plain CSV so they are still indexed by the embedder
- Up to 50 columns per table (wider sheets are truncated with '…')
- Never raises — placeholder string on parse failure
- ingest_plaintext(bytes, filename) -> str
- UTF-8 decode with latin-1 fallback
- Wraps content in a fenced code block tagged with the file extension
so the LLM understands the format during retrieval
documents/handler.py:
- New _TABULAR_MIMES and _PLAINTEXT_MIMES sets
- Extended _EXT_TO_MIME table covers csv, tsv, xls, xlsx, ods, txt, md,
rst, py, js, json, xml, html
- _text_subcategory(mime, filename) returns 'tabular' | 'plaintext' |
'document' — falls back to extension when MIME is generic
- IngestDocumentsHandler routes:
tabular -> ingest_tabular() -> chunk
plaintext -> ingest_plaintext() -> chunk
document -> Tika (PDF/DOCX/DOC/RTF/PPT — unchanged)
https://claude.ai/code/session_01C9mHttiQ4ZAaBbQecVV7uu
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
CSV / spreadsheets: Tika extracts these as a flat whitespace dump with no column headers or row relationships — the data is nearly impossible to find semantically. A CSV with columns
Date, Revenue, Regionbecomes2024-01-01 42000 North 2024-01-02 38000 South …— meaningless without structure.Plain text files (TXT, MD, JSON, etc.): These went through Tika unnecessarily, adding network latency to the Tika server and occasionally mangling encoding on files with 8-bit characters.
Solution
A
_text_subcategory()router dispatches each file to the right parser before it ever reaches Tika. Tika is now only used for the formats it's actually good at: PDF, DOCX, DOC, RTF, PPT.New file:
backend/services/tabular_service.pyingest_tabular(bytes, filename, mime_type) -> strcsvmodule, zero dependenciespandas+openpyxl, handles multiple sheets; each sheet becomes a## Sheet: namesectionpandas+xlrd(legacy Excel)…)ingest_plaintext(bytes, filename) -> strUpdated:
backend/api_endpoints/documents/handler.py_TABULAR_MIMESand_PLAINTEXT_MIMESsets_EXT_TO_MIMEfallback table covers:csv tsv xls xlsx ods txt md rst py js json xml html_text_subcategory(mime, filename)returns'tabular' | 'plaintext' | 'document'— uses extension fallback when MIME is generictext/plainIngestDocumentsHandler:tabularingest_tabular()plaintextingest_plaintext()documentTest plan
data.csvwith headers →document_textcontains Markdown table with column headers; ask "what is the revenue for region North?" → correct answerreport.xlsxwith 3 sheets → all 3 sheets indocument_textwith## Sheet:headingsnotes.txt→ plain decode, no Tika call, encoding preservedschema.json→ code-fenced JSON stored and searchablereport.pdf→ Tika path still used, no regressionxlrdinstalled → human-readable placeholder storedhttps://claude.ai/code/session_01C9mHttiQ4ZAaBbQecVV7uu