|
| 1 | +# refdocs |
| 2 | + |
| 3 | +A local CLI tool that indexes markdown documentation and exposes fast fuzzy search with intelligent chunking. Designed to give LLM coding agents efficient, token-conscious access to project documentation without MCP servers, network calls, or full-file context dumps. |
| 4 | + |
| 5 | +## Architecture |
| 6 | + |
| 7 | +``` |
| 8 | +refdocs/ |
| 9 | +├── src/ |
| 10 | +│ ├── index.ts # CLI entrypoint (commander) |
| 11 | +│ ├── indexer.ts # Walks target dir, chunks md files, builds search index |
| 12 | +│ ├── chunker.ts # Splits markdown by heading hierarchy into right-sized chunks |
| 13 | +│ ├── search.ts # MiniSearch wrapper, query + rank + format results |
| 14 | +│ └── config.ts # Reads .refdocs.json config |
| 15 | +├── .refdocs.json # Example config |
| 16 | +├── package.json |
| 17 | +├── tsconfig.json |
| 18 | +└── README.md |
| 19 | +``` |
| 20 | + |
| 21 | +## Tech Stack |
| 22 | + |
| 23 | +- **Runtime**: Node/Bun (target `bun build --compile` for single binary) |
| 24 | +- **Language**: TypeScript, strict mode |
| 25 | +- **Search engine**: MiniSearch — pure JS, ~7kb, fuzzy matching, field boosting, prefix search |
| 26 | +- **CLI framework**: Commander |
| 27 | +- **Markdown parsing**: markdown-it or remark for heading extraction (evaluate which is lighter) |
| 28 | +- **Zero external services** — no network calls, no API keys, everything local |
| 29 | + |
| 30 | +## Config |
| 31 | + |
| 32 | +`.refdocs.json` at project root: |
| 33 | + |
| 34 | +```json |
| 35 | +{ |
| 36 | + "paths": ["ref-docs"], |
| 37 | + "index": ".refdocs-index.json", |
| 38 | + "chunkMaxTokens": 800, |
| 39 | + "chunkMinTokens": 100, |
| 40 | + "boostFields": { |
| 41 | + "title": 2, |
| 42 | + "headings": 1.5, |
| 43 | + "body": 1 |
| 44 | + } |
| 45 | +} |
| 46 | +``` |
| 47 | + |
| 48 | +- `paths` — array of directories to index (relative to project root) |
| 49 | +- `index` — where to persist the serialized search index (gitignored) |
| 50 | +- `chunkMaxTokens` — upper bound for chunk size, rough estimate (chars / 4) |
| 51 | +- `chunkMinTokens` — minimum chunk size; merge small sections with their parent |
| 52 | +- `boostFields` — field relevance weights for search ranking |
| 53 | + |
| 54 | +## CLI Commands |
| 55 | + |
| 56 | +### `refdocs index` |
| 57 | + |
| 58 | +Walk all configured paths, chunk every `.md` file, build and persist the MiniSearch index. |
| 59 | + |
| 60 | +- Parse each markdown file into chunks split by heading boundaries (h1 > h2 > h3) |
| 61 | +- Each chunk gets metadata: `{ id, file, title, headings, body, startLine, endLine }` |
| 62 | +- Small sections (below `chunkMinTokens`) merge into their parent heading's chunk |
| 63 | +- Large sections (above `chunkMaxTokens`) split at paragraph boundaries |
| 64 | +- Serialize index to `.refdocs-index.json` |
| 65 | +- Print summary: files indexed, chunks created, index size |
| 66 | + |
| 67 | +### `refdocs search <query>` |
| 68 | + |
| 69 | +Fuzzy search the index and return the top chunks. |
| 70 | + |
| 71 | +- Load persisted index (error if not built yet) |
| 72 | +- Run MiniSearch with fuzzy matching (fuzzy: 0.2), prefix search enabled |
| 73 | +- Return top 3 results by default |
| 74 | +- Output format: each chunk preceded by a comment with source file and line range |
| 75 | + |
| 76 | +**Flags:** |
| 77 | +- `-n, --results <count>` — number of results (default: 3, max: 10) |
| 78 | +- `-f, --file <pattern>` — filter results to files matching glob |
| 79 | +- `--json` — output results as JSON array instead of formatted text |
| 80 | +- `--raw` — output chunk body only, no metadata header (for piping) |
| 81 | + |
| 82 | +### `refdocs list` |
| 83 | + |
| 84 | +List all indexed files and their chunk counts. Useful for verifying what's in the index. |
| 85 | + |
| 86 | +### `refdocs info <file>` |
| 87 | + |
| 88 | +Show all chunks for a specific file with their headings and token estimates. |
| 89 | + |
| 90 | +## Chunking Strategy |
| 91 | + |
| 92 | +This is the core value of the tool. Chunks must be: |
| 93 | + |
| 94 | +1. **Semantically coherent** — never split mid-section. Heading boundaries are the primary split points. |
| 95 | +2. **Right-sized for LLM context** — 100-800 tokens. Big enough to be useful, small enough to not waste context. |
| 96 | +3. **Hierarchical** — each chunk carries its full heading breadcrumb (e.g. `Configuration > Database > Connections`) so the LLM understands where the chunk fits. |
| 97 | + |
| 98 | +Algorithm: |
| 99 | +1. Parse markdown into AST |
| 100 | +2. Walk AST and split at heading nodes (h1, h2, h3) |
| 101 | +3. Each section becomes a candidate chunk with its heading breadcrumb |
| 102 | +4. If chunk < minTokens, merge with previous sibling or parent |
| 103 | +5. If chunk > maxTokens, split at paragraph boundaries (double newline) |
| 104 | +6. Attach metadata: source file path, line range, heading trail |
| 105 | + |
| 106 | +## Output Format |
| 107 | + |
| 108 | +Default output for `refdocs search "data transformers"`: |
| 109 | + |
| 110 | +``` |
| 111 | +# [1] spatie-laravel-data/transformers.md:15-48 |
| 112 | +# Transformers > Built-in Transformers |
| 113 | +
|
| 114 | +Transformers are used to convert data properties when... |
| 115 | +<chunk body here> |
| 116 | +
|
| 117 | +--- |
| 118 | +
|
| 119 | +# [2] spatie-laravel-data/creating-data-objects.md:72-95 |
| 120 | +# Creating Data Objects > Casting and Transforming |
| 121 | +
|
| 122 | +When creating a data object from a request... |
| 123 | +<chunk body here> |
| 124 | +``` |
| 125 | + |
| 126 | +JSON output (`--json`) returns: |
| 127 | + |
| 128 | +```json |
| 129 | +[ |
| 130 | + { |
| 131 | + "score": 12.45, |
| 132 | + "file": "spatie-laravel-data/transformers.md", |
| 133 | + "lines": [15, 48], |
| 134 | + "headings": ["Transformers", "Built-in Transformers"], |
| 135 | + "body": "..." |
| 136 | + } |
| 137 | +] |
| 138 | +``` |
| 139 | + |
| 140 | +## Design Principles |
| 141 | + |
| 142 | +- **No runtime dependencies beyond the binary** — everything bundles into one file |
| 143 | +- **Fast** — indexing a typical ref-docs folder (50 files) should take <1s. Search should be <50ms. |
| 144 | +- **Deterministic** — same docs, same index. No embeddings, no ML, no probabilistic retrieval. |
| 145 | +- **Composable** — output is plain text or JSON. Pipe it wherever you want. |
| 146 | +- **Offline** — works air-gapped, on a plane, in a container with no egress |
| 147 | + |
| 148 | +## Code Style |
| 149 | + |
| 150 | +- TypeScript strict mode, no `any` |
| 151 | +- Pure functions where possible, side effects at the edges (CLI entrypoint, file I/O) |
| 152 | +- No classes unless genuinely needed — prefer modules with exported functions |
| 153 | +- Error messages should be actionable: "Index not found. Run `refdocs index` first." |
| 154 | +- Tests with Vitest, focus on chunker logic and search relevance |
| 155 | + |
| 156 | +## Future Considerations (not MVP) |
| 157 | + |
| 158 | +- `refdocs watch` — rebuild index on file change |
| 159 | +- `refdocs add <url>` — fetch a URL, convert to markdown, save to ref-docs |
| 160 | +- `refdocs update` — re-pull docs from configured upstream sources (git repos, URLs) |
| 161 | +- MCP server mode — expose search as an MCP tool for editors that prefer it |
| 162 | +- Token counting with tiktoken instead of chars/4 estimate |
| 163 | +- Embedding-based search as optional mode (would require onnxruntime or similar) |
0 commit comments