Skip to content
This repository was archived by the owner on Feb 14, 2026. It is now read-only.

Commit 03226d4

Browse files
Chris Arterclaude
andcommitted
Initial commit
CLI tool that indexes markdown documentation and exposes fast fuzzy search with intelligent chunking. Includes CI workflow for tests and GitHub Packages publishing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
0 parents  commit 03226d4

34 files changed

Lines changed: 5105 additions & 0 deletions

.github/workflows/publish.yml

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
name: Publish to GitHub Packages
2+
3+
on:
4+
push:
5+
tags:
6+
- "v*"
7+
8+
permissions:
9+
contents: read
10+
packages: write
11+
12+
jobs:
13+
publish:
14+
runs-on: ubuntu-latest
15+
steps:
16+
- uses: actions/checkout@v4
17+
18+
- uses: actions/setup-node@v4
19+
with:
20+
node-version: 22
21+
registry-url: https://npm.pkg.github.com
22+
23+
- run: npm ci
24+
- run: npm test
25+
- run: npx tsc
26+
27+
- run: npm publish
28+
env:
29+
NODE_AUTH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

.github/workflows/test.yml

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
name: Tests
2+
3+
on:
4+
push:
5+
branches: [main]
6+
pull_request:
7+
8+
jobs:
9+
test:
10+
runs-on: ubuntu-latest
11+
steps:
12+
- uses: actions/checkout@v4
13+
14+
- uses: actions/setup-node@v4
15+
with:
16+
node-version: 22
17+
18+
- run: npm ci
19+
- run: npm test

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
node_modules/
2+
dist/
3+
.refdocs-index.json
4+
refdocs
5+
*.tsbuildinfo

.refdocs.json

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
{
2+
"paths": ["ref-docs"],
3+
"index": ".refdocs-index.json",
4+
"chunkMaxTokens": 800,
5+
"chunkMinTokens": 100,
6+
"boostFields": {
7+
"title": 2,
8+
"headings": 1.5,
9+
"body": 1
10+
}
11+
}

CLAUDE.md

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
# refdocs
2+
3+
A local CLI tool that indexes markdown documentation and exposes fast fuzzy search with intelligent chunking. Designed to give LLM coding agents efficient, token-conscious access to project documentation without MCP servers, network calls, or full-file context dumps.
4+
5+
## Architecture
6+
7+
```
8+
refdocs/
9+
├── src/
10+
│ ├── index.ts # CLI entrypoint (commander)
11+
│ ├── indexer.ts # Walks target dir, chunks md files, builds search index
12+
│ ├── chunker.ts # Splits markdown by heading hierarchy into right-sized chunks
13+
│ ├── search.ts # MiniSearch wrapper, query + rank + format results
14+
│ └── config.ts # Reads .refdocs.json config
15+
├── .refdocs.json # Example config
16+
├── package.json
17+
├── tsconfig.json
18+
└── README.md
19+
```
20+
21+
## Tech Stack
22+
23+
- **Runtime**: Node/Bun (target `bun build --compile` for single binary)
24+
- **Language**: TypeScript, strict mode
25+
- **Search engine**: MiniSearch — pure JS, ~7kb, fuzzy matching, field boosting, prefix search
26+
- **CLI framework**: Commander
27+
- **Markdown parsing**: markdown-it or remark for heading extraction (evaluate which is lighter)
28+
- **Zero external services** — no network calls, no API keys, everything local
29+
30+
## Config
31+
32+
`.refdocs.json` at project root:
33+
34+
```json
35+
{
36+
"paths": ["ref-docs"],
37+
"index": ".refdocs-index.json",
38+
"chunkMaxTokens": 800,
39+
"chunkMinTokens": 100,
40+
"boostFields": {
41+
"title": 2,
42+
"headings": 1.5,
43+
"body": 1
44+
}
45+
}
46+
```
47+
48+
- `paths` — array of directories to index (relative to project root)
49+
- `index` — where to persist the serialized search index (gitignored)
50+
- `chunkMaxTokens` — upper bound for chunk size, rough estimate (chars / 4)
51+
- `chunkMinTokens` — minimum chunk size; merge small sections with their parent
52+
- `boostFields` — field relevance weights for search ranking
53+
54+
## CLI Commands
55+
56+
### `refdocs index`
57+
58+
Walk all configured paths, chunk every `.md` file, build and persist the MiniSearch index.
59+
60+
- Parse each markdown file into chunks split by heading boundaries (h1 > h2 > h3)
61+
- Each chunk gets metadata: `{ id, file, title, headings, body, startLine, endLine }`
62+
- Small sections (below `chunkMinTokens`) merge into their parent heading's chunk
63+
- Large sections (above `chunkMaxTokens`) split at paragraph boundaries
64+
- Serialize index to `.refdocs-index.json`
65+
- Print summary: files indexed, chunks created, index size
66+
67+
### `refdocs search <query>`
68+
69+
Fuzzy search the index and return the top chunks.
70+
71+
- Load persisted index (error if not built yet)
72+
- Run MiniSearch with fuzzy matching (fuzzy: 0.2), prefix search enabled
73+
- Return top 3 results by default
74+
- Output format: each chunk preceded by a comment with source file and line range
75+
76+
**Flags:**
77+
- `-n, --results <count>` — number of results (default: 3, max: 10)
78+
- `-f, --file <pattern>` — filter results to files matching glob
79+
- `--json` — output results as JSON array instead of formatted text
80+
- `--raw` — output chunk body only, no metadata header (for piping)
81+
82+
### `refdocs list`
83+
84+
List all indexed files and their chunk counts. Useful for verifying what's in the index.
85+
86+
### `refdocs info <file>`
87+
88+
Show all chunks for a specific file with their headings and token estimates.
89+
90+
## Chunking Strategy
91+
92+
This is the core value of the tool. Chunks must be:
93+
94+
1. **Semantically coherent** — never split mid-section. Heading boundaries are the primary split points.
95+
2. **Right-sized for LLM context** — 100-800 tokens. Big enough to be useful, small enough to not waste context.
96+
3. **Hierarchical** — each chunk carries its full heading breadcrumb (e.g. `Configuration > Database > Connections`) so the LLM understands where the chunk fits.
97+
98+
Algorithm:
99+
1. Parse markdown into AST
100+
2. Walk AST and split at heading nodes (h1, h2, h3)
101+
3. Each section becomes a candidate chunk with its heading breadcrumb
102+
4. If chunk < minTokens, merge with previous sibling or parent
103+
5. If chunk > maxTokens, split at paragraph boundaries (double newline)
104+
6. Attach metadata: source file path, line range, heading trail
105+
106+
## Output Format
107+
108+
Default output for `refdocs search "data transformers"`:
109+
110+
```
111+
# [1] spatie-laravel-data/transformers.md:15-48
112+
# Transformers > Built-in Transformers
113+
114+
Transformers are used to convert data properties when...
115+
<chunk body here>
116+
117+
---
118+
119+
# [2] spatie-laravel-data/creating-data-objects.md:72-95
120+
# Creating Data Objects > Casting and Transforming
121+
122+
When creating a data object from a request...
123+
<chunk body here>
124+
```
125+
126+
JSON output (`--json`) returns:
127+
128+
```json
129+
[
130+
{
131+
"score": 12.45,
132+
"file": "spatie-laravel-data/transformers.md",
133+
"lines": [15, 48],
134+
"headings": ["Transformers", "Built-in Transformers"],
135+
"body": "..."
136+
}
137+
]
138+
```
139+
140+
## Design Principles
141+
142+
- **No runtime dependencies beyond the binary** — everything bundles into one file
143+
- **Fast** — indexing a typical ref-docs folder (50 files) should take <1s. Search should be <50ms.
144+
- **Deterministic** — same docs, same index. No embeddings, no ML, no probabilistic retrieval.
145+
- **Composable** — output is plain text or JSON. Pipe it wherever you want.
146+
- **Offline** — works air-gapped, on a plane, in a container with no egress
147+
148+
## Code Style
149+
150+
- TypeScript strict mode, no `any`
151+
- Pure functions where possible, side effects at the edges (CLI entrypoint, file I/O)
152+
- No classes unless genuinely needed — prefer modules with exported functions
153+
- Error messages should be actionable: "Index not found. Run `refdocs index` first."
154+
- Tests with Vitest, focus on chunker logic and search relevance
155+
156+
## Future Considerations (not MVP)
157+
158+
- `refdocs watch` — rebuild index on file change
159+
- `refdocs add <url>` — fetch a URL, convert to markdown, save to ref-docs
160+
- `refdocs update` — re-pull docs from configured upstream sources (git repos, URLs)
161+
- MCP server mode — expose search as an MCP tool for editors that prefer it
162+
- Token counting with tiktoken instead of chars/4 estimate
163+
- Embedding-based search as optional mode (would require onnxruntime or similar)

README.md

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
# refdocs
2+
3+
Index your markdown docs. Search them fast. Get back only what matters.
4+
5+
Built for LLM coding agents that need token-conscious access to project documentation — no network calls, no API keys, no MCP servers. Just a single binary and a JSON index file.
6+
7+
```bash
8+
$ refdocs search "database connections"
9+
10+
# [1] config/database.md:12-34
11+
# Configuration > Database > Connections
12+
13+
Connection pooling is configured via the `pool` key in your
14+
database config. Each connection type supports `min`, `max`,
15+
and `idle_timeout` options...
16+
17+
---
18+
19+
# [2] guides/troubleshooting.md:88-104
20+
# Troubleshooting > Database > Connection Refused
21+
22+
If you see "ECONNREFUSED", check that your database server
23+
is running and the host/port in your config matches...
24+
```
25+
26+
refdocs chunks markdown at heading boundaries into 100-800 token pieces, indexes them with fuzzy search, and returns only the relevant chunks — not entire files.
27+
28+
## Install
29+
30+
From GitHub Packages:
31+
32+
```bash
33+
npm install -g @dynamik-dev/refdocs --registry=https://npm.pkg.github.com
34+
```
35+
36+
Or build from source:
37+
38+
```bash
39+
bun install && bun run build
40+
```
41+
42+
Produces a standalone `./refdocs` binary. Or run directly:
43+
44+
```bash
45+
bun src/index.ts <command>
46+
```
47+
48+
## Usage
49+
50+
```bash
51+
# Point at your docs directory
52+
echo '{ "paths": ["docs"] }' > .refdocs.json
53+
54+
# Build the index
55+
refdocs index
56+
# Indexed 42 files -> 156 chunks (45.2 KB, 320ms)
57+
58+
# Search
59+
refdocs search "authentication"
60+
refdocs search "config" -n 5 # top 5 results
61+
refdocs search "api" -f "api/**/*.md" # filter by file glob
62+
refdocs search "hooks" --json # structured output
63+
refdocs search "auth" --raw # body only, for piping
64+
65+
# Inspect the index
66+
refdocs list # files and chunk counts
67+
refdocs info "api/auth.md" # chunks in a specific file
68+
```
69+
70+
## How it works
71+
72+
1. **Index** — parses each `.md` file into an AST, splits at h1/h2/h3 boundaries, merges small sections, splits large ones at paragraph breaks. Each chunk keeps its full heading breadcrumb (`Config > Database > Connections`).
73+
74+
2. **Search** — fuzzy matching (20% edit tolerance) with prefix search and field boosting. Titles weighted 2x, headings 1.5x, body 1x. Results ranked by TF-IDF. File-level glob filtering via `-f`.
75+
76+
3. **Output** — human-readable by default, `--json` for structured consumption, `--raw` for piping. Each result includes source file, line range, and heading trail.
77+
78+
## Configuration
79+
80+
`.refdocs.json` at project root:
81+
82+
```json
83+
{
84+
"paths": ["docs"],
85+
"index": ".refdocs-index.json",
86+
"chunkMaxTokens": 800,
87+
"chunkMinTokens": 100,
88+
"boostFields": { "title": 2, "headings": 1.5, "body": 1 }
89+
}
90+
```
91+
92+
All fields optional. See [Configuration](docs/configuration.md) for details.
93+
94+
## Documentation
95+
96+
- [Getting Started](docs/getting-started.md) — installation, quick start, and overview
97+
- [CLI Reference](docs/cli-reference.md) — commands, flags, output formats, and exit codes
98+
- [Configuration](docs/configuration.md)`.refdocs.json` options with defaults and examples
99+
- [Chunking](docs/chunking.md) — the 3-pass splitting algorithm and chunk structure
100+
- [Search](docs/search.md) — fuzzy matching, boosting, scoring, and index persistence
101+
102+
## Tech
103+
104+
| Dependency | Role |
105+
|------------|------|
106+
| [MiniSearch](https://github.com/lucaong/minisearch) | Full-text fuzzy search (~7kb, pure JS) |
107+
| [Commander](https://github.com/tj/commander.js) | CLI framework |
108+
| [mdast-util-from-markdown](https://github.com/syntax-tree/mdast-util-from-markdown) | Markdown AST parsing |
109+
| [picomatch](https://github.com/micromatch/picomatch) | Glob pattern matching |
110+
111+
Zero external services. Works offline, in containers, on planes.

0 commit comments

Comments
 (0)