Skip to content

feat: native tabular (CSV/Excel) and plain-text ingestion#114

Open
nv78 wants to merge 1 commit intoclaude/video-frame-analysisfrom
claude/tabular-and-text-ingestion
Open

feat: native tabular (CSV/Excel) and plain-text ingestion#114
nv78 wants to merge 1 commit intoclaude/video-frame-analysisfrom
claude/tabular-and-text-ingestion

Conversation

@nv78
Copy link
Member

@nv78 nv78 commented Mar 24, 2026

Problem

CSV / spreadsheets: Tika extracts these as a flat whitespace dump with no column headers or row relationships — the data is nearly impossible to find semantically. A CSV with columns Date, Revenue, Region becomes 2024-01-01 42000 North 2024-01-02 38000 South … — meaningless without structure.

Plain text files (TXT, MD, JSON, etc.): These went through Tika unnecessarily, adding network latency to the Tika server and occasionally mangling encoding on files with 8-bit characters.

Solution

A _text_subcategory() router dispatches each file to the right parser before it ever reaches Tika. Tika is now only used for the formats it's actually good at: PDF, DOCX, DOC, RTF, PPT.

New file: backend/services/tabular_service.py

ingest_tabular(bytes, filename, mime_type) -> str

  • CSV / TSV — stdlib csv module, zero dependencies
  • XLSX / ODSpandas + openpyxl, handles multiple sheets; each sheet becomes a ## Sheet: name section
  • XLSpandas + xlrd (legacy Excel)
  • Output: Markdown table with column headers preserved — the LLM (and semantic search) can now understand column relationships
  • Up to 500 rows rendered as a full Markdown table; additional rows appended as raw CSV lines so they're still indexed
  • Up to 50 columns per table (wider sheets truncated with )
  • Never raises — returns a human-readable placeholder on failure

ingest_plaintext(bytes, filename) -> str

  • UTF-8 decode with latin-1 fallback (no silent data loss)
  • Wraps content in a fenced code block tagged with the file extension so the LLM understands the format during retrieval

Updated: backend/api_endpoints/documents/handler.py

  • New _TABULAR_MIMES and _PLAINTEXT_MIMES sets
  • Extended _EXT_TO_MIME fallback table covers: csv tsv xls xlsx ods txt md rst py js json xml html
  • _text_subcategory(mime, filename) returns 'tabular' | 'plaintext' | 'document' — uses extension fallback when MIME is generic text/plain
  • Routing in IngestDocumentsHandler:
Subcategory Files Parser
tabular CSV, TSV, XLS, XLSX, ODS ingest_tabular()
plaintext TXT, MD, RST, PY, JS, JSON, XML, HTML ingest_plaintext()
document PDF, DOCX, DOC, RTF, PPT Apache Tika (unchanged)

Test plan

  • Upload data.csv with headers → document_text contains Markdown table with column headers; ask "what is the revenue for region North?" → correct answer
  • Upload report.xlsx with 3 sheets → all 3 sheets in document_text with ## Sheet: headings
  • Upload notes.txt → plain decode, no Tika call, encoding preserved
  • Upload schema.json → code-fenced JSON stored and searchable
  • Upload report.pdf → Tika path still used, no regression
  • Upload CSV with 600 rows → first 500 as Markdown table, remaining 100 appended as plain CSV
  • Upload XLS without xlrd installed → human-readable placeholder stored

Depends on: claude/video-frame-analysisclaude/audio-transcription-pipelineclaude/image-document-analysisclaude/add-multimodal-support-QBQca

https://claude.ai/code/session_01C9mHttiQ4ZAaBbQecVV7uu

Problem: CSV and spreadsheet files sent through Tika return a flat
whitespace dump that loses all column/row structure, making the data
nearly impossible to find via semantic search.  Plain-text files (TXT,
MD, JSON, etc.) went through Tika unnecessarily, adding latency and
sometimes garbling encoding.

Solution: add a text subcategory router in IngestDocumentsHandler that
dispatches to the right parser before reaching Tika.

services/tabular_service.py (new):
- ingest_tabular(bytes, filename, mime_type) -> str
  - CSV/TSV: stdlib csv module, renders as Markdown table
  - XLSX/ODS: pandas + openpyxl, handles multiple sheets, each becomes a
    '## Sheet: name' section with a Markdown table
  - XLS: pandas + xlrd (legacy Excel)
  - Up to 500 rows rendered as a full Markdown table; additional rows
    appended as plain CSV so they are still indexed by the embedder
  - Up to 50 columns per table (wider sheets are truncated with '…')
  - Never raises — placeholder string on parse failure
- ingest_plaintext(bytes, filename) -> str
  - UTF-8 decode with latin-1 fallback
  - Wraps content in a fenced code block tagged with the file extension
    so the LLM understands the format during retrieval

documents/handler.py:
- New _TABULAR_MIMES and _PLAINTEXT_MIMES sets
- Extended _EXT_TO_MIME table covers csv, tsv, xls, xlsx, ods, txt, md,
  rst, py, js, json, xml, html
- _text_subcategory(mime, filename) returns 'tabular' | 'plaintext' |
  'document' — falls back to extension when MIME is generic
- IngestDocumentsHandler routes:
  tabular  -> ingest_tabular()  -> chunk
  plaintext -> ingest_plaintext() -> chunk
  document  -> Tika (PDF/DOCX/DOC/RTF/PPT — unchanged)

https://claude.ai/code/session_01C9mHttiQ4ZAaBbQecVV7uu
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants