Skip to content

Observations from Building a Documentation Agent (Consumer Perspective) #3

@RohitDocs

Description

@RohitDocs

Title: Observations from Building a Documentation Agent (Consumer Perspective)

First, thank you

This spec is excellent. just as I mentioned to you on LinkedIn. The clarity around truncation limits, llms.txt discovery, and markdown availability would have saved us significant trial-and-error. Thank you for putting this together and sharing it with the community.

Our context

When we built a documentation Q&A system using vector search and LLMs. Along the way, we encountered challenges that your spec also addresses. We navigated some gaps ourselves. Sharing these observations in case any are useful. Please feel free to pick what applies; we recognize RAG/indexing may be intentionally out of scope.


1. Chunking Strategy

We split documents into smaller segments (500-2000 chars) for vector embedding. The spec's 50K page-size guidance helps with truncation, but we found:

  • Content with clear section boundaries chunked cleanly
  • Content that flowed across sections caused context loss
  • Metadata (source, section title) needed to be preserved per chunk

Potential gap: Guidance on structuring content for clean splitting (self-contained paragraphs, explicit section markers).


2. Content Categorization

We manually categorized content by type (API, config, user guide) and business domain to improve retrieval relevance. The current llms.txt is flat.

Potential gap: Optional metadata fields or hierarchical structure in llms.txt.


3. Multi-Model Considerations

We tested across multiple models with different context windows. The 50K recommendation works for some but not all.

Potential gap: Tiered size guidance (e.g., 10K/50K/100K) for different model capabilities.


4. Vector Search / Embedding Quality

Certain content structures embedded well; others didn't. Dense, jargon-heavy paragraphs retrieved better than verbose explanations spanning multiple concepts.

Potential gap: Guidance on embedding-friendly content structure (if RAG is in scope).


5. HTML Parsing Challenges

Some HTML patterns caused extraction issues:

  • Deeply nested non-semantic <div> structures
  • Content inside <script> or <template> tags
  • Inline styles bloating converted markdown

The spec mentions HTML abstractly. Potential gap: Concrete "do this, not that" examples.


6. Versioning and Change Detection

We built incremental processing pipelines with versioned backups. The spec treats docs as static.

Potential gap: Guidance on version/diff-friendly formats for agents that cache or index content.


To be clear

These may or may not be applicable to your vision; every use case is different. The spec's focus on real-time coding agents is clear, and some of these observations come from a RAG/indexing perspective that may be out of scope.

Thank you again for this work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions