Skip to content

feat(indexing): add index-file-document tool for file uploads#88

Draft
adityamparikh wants to merge 2 commits intoapache:mainfrom
adityamparikh:index-text
Draft

feat(indexing): add index-file-document tool for file uploads#88
adityamparikh wants to merge 2 commits intoapache:mainfrom
adityamparikh:index-text

Conversation

@adityamparikh
Copy link
Copy Markdown
Contributor

Summary

Adds a new MCP tool index-file-document that enables users to upload files of any format (PDF, Word, Excel, PowerPoint, etc.) through their AI chat client and have the content indexed into Solr for full-text search.

Closes #69

How it works

When a user uploads a file in an AI chat client like Claude Desktop, the client handles text extraction — not the MCP server. Here's the flow:

  1. User uploads a file (e.g., report.pdf) in Claude Desktop
  2. Claude Desktop extracts text from the PDF before Claude ever sees it — Claude receives the readable text content, not the raw binary bytes
  3. Claude calls the index-file-document tool, passing the extracted text as content and the original filename (report.pdf) as filename
  4. The MCP server indexes a SolrInputDocument with id (auto-generated UUID), content (the full text), and filename (for filtering/display)
  5. User can now search over the indexed content using existing search tools

This design means no binary parsing library (Tika, Docling, etc.) is needed on the server side — the AI chat client already does the heavy lifting of text extraction before invoking MCP tools. This keeps the server lightweight and avoids ~100MB of transitive dependencies.

Tool signature

index-file-document(collection, content, filename)
Parameter Description
collection Solr collection to index into
content Text content extracted from the file by the chat client
filename Original filename with extension (e.g. report.pdf) — stored as a searchable field

Changes

  • FileDocumentCreator (new) — @Component that creates a SolrInputDocument with id, content, and filename fields. Does not implement SolrDocumentCreator because it requires a filename parameter in addition to content.
  • IndexingDocumentCreator — Added FileDocumentCreator dependency and createSchemalessDocumentsFromFile() delegation method
  • IndexingService — New indexFileDocument() MCP tool with @PreAuthorize("isAuthenticated()")
  • AGENTS.md — Updated architecture docs

Test plan

  • FileDocumentCreatorTest — 9 unit tests: valid input, null/empty/blank content, null/empty filename, oversized content, unique IDs, multiline content
  • FileIndexingTest — Spring Boot integration test through IndexingDocumentCreator
  • IndexingServiceTest — 2 new Testcontainers integration tests verifying index-then-search round-trip (search by content, search by filename)
  • IndexingServiceTest.UnitTests — 2 new mocked unit tests for the MCP tool method
  • Existing test constructors updated for new FileDocumentCreator parameter
  • ./gradlew build passes with all tests green

🤖 Generated with Claude Code

claude and others added 2 commits March 31, 2026 20:58
Add a new MCP tool `index-file-document` that accepts text content
extracted by AI chat clients from uploaded files (PDF, Word, Excel,
etc.) and indexes it into Solr for full-text search.

New components:
- FileDocumentCreator: creates SolrInputDocument with id, content,
  and filename fields from pre-extracted text
- IndexingDocumentCreator: new createSchemalessDocumentsFromFile()
  delegation method
- IndexingService: new indexFileDocument() MCP tool

Closes apache#69

Signed-off-by: Aditya Parikh <adityamparikh@gmail.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: adityamparikh <aditya.m.parikh@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ability to index and search over markdown documents

2 participants