Skip to content

list_documents() returns all documents -> unusable with large workspaces #300

Description

@PhunkyBob

Problem

When a workspace contains a large number of indexed documents (e.g. 10 000+), using list_documents() as an LLM tool returns the full metadata of every document in a single JSON payload. This causes context window overflows and 500 errors on hosted LLM providers (tested with Azure OpenAI + gpt-oss-120).

Root cause: _load_workspace() loads all document metadata from _meta.json into self.documents at startup, and the only discovery mechanism available to the LLM is iterating the entire dict.

Expected behavior

The library should provide a way to discover relevant documents without loading all metadata into the LLM context.

Options:

  • A search_documents(query) method that filters by keyword on doc_name / doc_description
  • A global summary / table-of-contents across the workspace that the LLM can use for routing

Environment

  • PageIndex: latest (cloned from main)
  • Workspace size: ~10 000 documents
  • LLM provider: Azure OpenAI
  • Model gpt-oss120

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions