-
-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Title: Observations from Building a Documentation Agent (Consumer Perspective)
First, thank you
This spec is excellent. just as I mentioned to you on LinkedIn. The clarity around truncation limits, llms.txt discovery, and markdown availability would have saved us significant trial-and-error. Thank you for putting this together and sharing it with the community.
Our context
When we built a documentation Q&A system using vector search and LLMs. Along the way, we encountered challenges that your spec also addresses. We navigated some gaps ourselves. Sharing these observations in case any are useful. Please feel free to pick what applies; we recognize RAG/indexing may be intentionally out of scope.
1. Chunking Strategy
We split documents into smaller segments (500-2000 chars) for vector embedding. The spec's 50K page-size guidance helps with truncation, but we found:
- Content with clear section boundaries chunked cleanly
- Content that flowed across sections caused context loss
- Metadata (source, section title) needed to be preserved per chunk
Potential gap: Guidance on structuring content for clean splitting (self-contained paragraphs, explicit section markers).
2. Content Categorization
We manually categorized content by type (API, config, user guide) and business domain to improve retrieval relevance. The current llms.txt is flat.
Potential gap: Optional metadata fields or hierarchical structure in llms.txt.
3. Multi-Model Considerations
We tested across multiple models with different context windows. The 50K recommendation works for some but not all.
Potential gap: Tiered size guidance (e.g., 10K/50K/100K) for different model capabilities.
4. Vector Search / Embedding Quality
Certain content structures embedded well; others didn't. Dense, jargon-heavy paragraphs retrieved better than verbose explanations spanning multiple concepts.
Potential gap: Guidance on embedding-friendly content structure (if RAG is in scope).
5. HTML Parsing Challenges
Some HTML patterns caused extraction issues:
- Deeply nested non-semantic
<div>structures - Content inside
<script>or<template>tags - Inline styles bloating converted markdown
The spec mentions HTML abstractly. Potential gap: Concrete "do this, not that" examples.
6. Versioning and Change Detection
We built incremental processing pipelines with versioned backups. The spec treats docs as static.
Potential gap: Guidance on version/diff-friendly formats for agents that cache or index content.
To be clear
These may or may not be applicable to your vision; every use case is different. The spec's focus on real-time coding agents is clear, and some of these observations come from a RAG/indexing perspective that may be out of scope.
Thank you again for this work.