-
Notifications
You must be signed in to change notification settings - Fork 8
Add AGENT.md, CLAUDE.md symlink, and pre-push git hook #156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
6d2db27
2f6777c
f1ffbfd
2fad3cb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -13,6 +13,7 @@ matches.json | |
| .queue/ | ||
| .queues/ | ||
| .idea | ||
| .cursor | ||
| .scalability-test-queue | ||
|
|
||
| # Test queue directories | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,5 +1,5 @@ | ||
| #!/usr/bin/env sh | ||
| . "$(dirname -- "$0")/_/husky.sh" | ||
|
|
||
| echo "Building, running tests, and linting before commit..." | ||
| npm run build && npm run test && npm run lint | ||
| echo "Building, running tests, linting, and checking formatting before commit..." | ||
| npm run build && npm run test && npm run lint && npm run format:check | ||
| Original file line number | Diff line number | Diff line change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,5 @@ | ||||||||||||
| #!/usr/bin/env sh | ||||||||||||
| . "$(dirname -- "$0")/_/husky.sh" | ||||||||||||
|
||||||||||||
| . "$(dirname -- "$0")/_/husky.sh" | |
| # Allow bypassing this hook by setting HUSKY=0, similar to husky.sh | |
| if [ "${HUSKY:-1}" = "0" ]; then | |
| exit 0 | |
| fi |
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should be maybe this is the reason sonnet is not listening
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also honestly I'd remove almost all overview/architectural guidance everything. Almost all of it is inferrable by agent. I would only leave a couple of commands and instructions when to use them that's it ultra thin. And the overview thing we can add a marble diagram to compress the data for agents. Wdyt? |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,215 @@ | ||
| # Semantic Code Search Indexer — Agent Guide | ||
|
|
||
| ## Project Overview | ||
|
|
||
| A high-performance CLI tool that parses codebases into semantically meaningful chunks and indexes them into Elasticsearch for AI-powered code search. It uses **tree-sitter** for AST-based parsing of 15+ languages and Elasticsearch's **ELSER** model for semantic embeddings. | ||
|
|
||
| **Primary use case:** Run on a schedule (cron) to keep an Elasticsearch index in sync with git repositories. | ||
|
|
||
| ## Architecture | ||
|
|
||
| ```text | ||
| src/ | ||
| ├── index.ts # CLI entrypoint (commander) | ||
| ├── config.ts # All configuration — env vars with typed getters | ||
| ├── commands/ # CLI commands (index, setup, search, queue:*, etc.) | ||
| │ ├── index_command.ts # Main `index` command — orchestrates full + incremental indexing | ||
| │ ├── full_index_producer.ts # Enqueues all files for a full reindex | ||
| │ ├── incremental_index_command.ts # Only processes git-changed files | ||
| │ ├── worker_command.ts # Consumer worker that processes the queue | ||
| │ ├── setup_command.ts # Creates/updates ES index mappings | ||
| │ └── ... # Other utility commands | ||
| ├── languages/ # Language configurations (one file per language) | ||
| │ ├── index.ts # Registry — exports all configs, validation | ||
| │ ├── typescript.ts # Example: tree-sitter queries for TS | ||
| │ ├── templates/ # Scaffold templates for new languages | ||
| │ └── ... | ||
| └── utils/ | ||
| ├── parser.ts # Core parsing engine — tree-sitter + custom parsers | ||
| ├── elasticsearch.ts # ES client, bulk indexing, document ID generation | ||
| ├── sqlite_queue.ts # SQLite-backed persistent work queue | ||
| ├── indexer_worker.ts # Batch consumer with p-queue concurrency | ||
| ├── git_helper.ts # Git operations (clone, pull, diff, file hashes) | ||
| ├── otel_provider.ts # OpenTelemetry setup (logs + metrics) | ||
| ├── metrics.ts # OTel metrics instrumentation | ||
| ├── logger.ts # Structured logging | ||
| ├── language_validator.ts # Validates language configs at startup | ||
| ├── constants.ts # Shared string constants | ||
| └── shared_extensions.ts # Cross-language file extension resolution | ||
| ``` | ||
|
|
||
| ### Key Data Flow | ||
|
|
||
| 1. **Producer** (`full_index_producer.ts` / `incremental_index_command.ts`) walks the repo, parses files via `parser.ts`, and enqueues `CodeChunk` documents into the **SQLite queue**. | ||
| 2. **Worker** (`indexer_worker.ts`) dequeues batches and bulk-indexes them into Elasticsearch via `elasticsearch.ts`. | ||
| 3. The `index` command orchestrates both producer and worker together. | ||
|
|
||
| ### Parser System | ||
|
|
||
| - **Tree-sitter languages** (TypeScript, JavaScript, Python, Go, Java, C, C++, Bash): AST-based parsing with S-expression queries to extract functions, classes, imports, exports, etc. | ||
| - **Custom parsers** (Markdown, YAML, JSON, text, Gradle, Handlebars, Properties): Regex/line-based splitting with language-specific chunking strategies. | ||
| - Each `LanguageConfiguration` defines: `name`, `fileSuffixes`, `parser` (tree-sitter grammar or parser type constant), `queries` (tree-sitter S-expressions), `importQueries`, `exportQueries`. | ||
| - Chunks include: content, language, file path, git metadata, line ranges, symbols, imports/exports, directory info, and a content-based `chunk_hash` for deduplication. | ||
|
|
||
| ### Queue System | ||
|
|
||
| - **SQLite queue** (`sqlite_queue.ts`): Persistent, crash-safe work queue using `better-sqlite3`. | ||
| - Supports: enqueue/dequeue, retry with max 3 attempts, stale task requeue (5min timeout with PID tracking), WAL mode for concurrent reads. | ||
|
simianhacker marked this conversation as resolved.
|
||
| - Queue state persists across restarts — incomplete indexing jobs resume automatically. | ||
|
|
||
| ### Configuration | ||
|
|
||
| All config is via environment variables, loaded from `.env` (or `.env.test` in test mode). Key prefixes: | ||
| - `ELASTICSEARCH_*` — Connection settings (endpoint, cloud ID, API key, username/password) | ||
| - `SCS_IDXR_*` — Indexer-specific settings (chunk size, overlap, inference ID, OTEL, etc.) | ||
| - `OTEL_*` — OpenTelemetry standard env vars | ||
|
|
||
| See `src/config.ts` for the full list with defaults. | ||
|
|
||
| ## Git Hooks (Husky) | ||
|
|
||
| > **🚨 CRITICAL: NEVER bypass Husky hooks.** Do not use `--no-verify`, `HUSKY=0`, or any | ||
| > other mechanism to skip hooks. No exceptions. If a hook fails, **fix the issue** — do not | ||
| > work around the hook. Broken code must never be committed or pushed. This is a hard rule. | ||
|
|
||
| Two hooks gate every commit and push: | ||
|
|
||
| | Hook | Runs | Purpose | | ||
| |------|------|---------| | ||
| | **pre-commit** | `build → test → lint → format:check` | Catches errors before they enter history | | ||
| | **pre-push** | `build → test → lint → format:check` | Safety net — blocks pushes even if commit hooks were skipped | | ||
|
|
||
| Both hooks run the same four checks. If any step fails, the operation is aborted. | ||
|
Comment on lines
+69
to
+82
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is better in personal agent hooks actually enforced for dumb models |
||
|
|
||
| ## Commands to Verify Before Committing | ||
|
|
||
| The hooks run these automatically, but you can run them manually: | ||
|
|
||
| ```bash | ||
| # 1. Build (compiles TypeScript, must succeed cleanly) | ||
| npm run build | ||
|
|
||
| # 2. Run all unit tests | ||
| npm run test | ||
|
|
||
| # 3. Lint (ESLint + Prettier) | ||
| npm run lint | ||
|
|
||
| # 4. Check formatting (no writes, fails if unformatted) | ||
| npm run format:check | ||
| ``` | ||
|
|
||
| To auto-fix formatting issues: `npm run format` | ||
|
|
||
| ### Additional Verification Commands | ||
|
|
||
| ```bash | ||
| # Type-check only (no emit) — faster than full build for iterating | ||
| npm run type:check | ||
|
|
||
| # Format check (CI-friendly, no writes) | ||
| npm run format:check | ||
|
|
||
| # Run a specific test file | ||
| npx vitest run tests/unit/parser.test.ts | ||
|
|
||
| # Run tests in watch mode during development | ||
| npm run test:watch | ||
| ``` | ||
|
|
||
| ### Integration Tests | ||
|
|
||
| Integration tests require a running Elasticsearch instance and are **not** part of the pre-commit hook: | ||
|
|
||
| ```bash | ||
| # Full integration test lifecycle (setup → run → teardown) | ||
| npm run test:integration | ||
| ``` | ||
|
|
||
| Integration tests use a separate vitest config (`vitest.integration.config.ts`) with 3-minute timeouts and fork-based isolation. They live in `tests/integration/`. | ||
|
|
||
| ## Test Conventions | ||
|
|
||
| - **Framework:** Vitest 4.x with globals enabled (`describe`, `it`, `expect` available without imports) | ||
| - **Unit tests:** `tests/unit/*.test.ts` — mock external dependencies (ES client, filesystem, git) | ||
| - **Integration tests:** `tests/integration/*.integration.test.ts` — require real Elasticsearch | ||
| - **Fixtures:** `tests/fixtures/` — sample source files for each supported language | ||
| - **Snapshots:** `tests/unit/__snapshots__/` — parser output snapshots (update with `npx vitest run -u`) | ||
| - **Test setup:** `tests/setup.ts` sets `NODE_ENV=test` and configures the test environment | ||
| - **Mocking:** Vitest auto-cleanup is enabled (`mockReset`, `restoreMocks`, `clearMocks`). Use `vi.mock()` for module mocks and `vi.spyOn()` for spies. | ||
| - **Config in tests:** Import from `../src/config` and set values via the config object setters (e.g., `indexingConfig.maxChunkSizeBytes = 500`). Tests use `.env.test` if it exists. | ||
| - **Pool:** Tests use `forks` pool. File parallelism is enabled locally, disabled in CI. | ||
|
|
||
| ## Code Style & Formatting | ||
|
|
||
| - **Prettier** enforced via ESLint plugin. Config in `.prettierrc.json`: | ||
| - Single quotes, trailing commas (es5), 2-space indent, 120 char print width, LF line endings | ||
| - **ESLint** config in `eslint.config.js` (flat config format): | ||
| - Uses `typescript-eslint` recommended rules + prettier plugin | ||
| - Ignores: `dist/`, `.repos/`, `tests/fixtures/`, `libs/es-query/` | ||
| - **TypeScript:** Strict mode, ES2021 target, CommonJS modules | ||
| - `tsconfig.json` — includes tests, `noEmit: true` (for type-checking) | ||
| - `tsconfig.build.json` — excludes tests, emits to `dist/` | ||
|
|
||
| ## Adding a New Language | ||
|
|
||
| 1. Run the scaffold command: `ts-node src/index.ts scaffold-language --name <language> --extensions ".ext"` | ||
| - Or manually copy from `src/languages/templates/` | ||
| 2. Create `src/languages/<name>.ts` implementing `LanguageConfiguration` | ||
| 3. Register it in `src/languages/index.ts` (add import + add to `languageConfigurations` object) | ||
| 4. Add a test fixture in `tests/fixtures/<name>.<ext>` | ||
| 5. Add parser tests in `tests/unit/parser.test.ts` (or a new test file) | ||
| 6. The language validator runs at startup and will warn about misconfigured extensions or missing queries | ||
|
|
||
| ### Language Configuration Shape | ||
|
|
||
| ```typescript | ||
| interface LanguageConfiguration { | ||
| name: string; // Language identifier | ||
| fileSuffixes: string[]; // File extensions (e.g., ['.ts', '.tsx']) | ||
| parser: TreeSitterParser | string; // Tree-sitter grammar or parser type constant | ||
| queries?: string[]; // Tree-sitter S-expression queries for code extraction | ||
| importQueries?: string[]; // Queries for import extraction | ||
| exportQueries?: string[]; // Queries for export extraction | ||
| // ... additional optional fields | ||
| } | ||
| ``` | ||
|
|
||
| ## Common Patterns | ||
|
|
||
| ### Elasticsearch Client | ||
|
|
||
| The ES client is lazily initialized in `src/utils/elasticsearch.ts`. Commands that don't need ES (like `dump-tree`, `scaffold`) work without ES credentials. Use `getClient()` to access the singleton, `setClient()` for test injection. | ||
|
|
||
| ### Logging | ||
|
|
||
| Use the structured logger from `src/utils/logger.ts`: | ||
|
|
||
| ```typescript | ||
| import { logger, createLogger } from './logger'; | ||
| logger.info('message', { key: 'value' }); | ||
| const scopedLogger = createLogger('my-module'); | ||
| ``` | ||
|
coderabbitai[bot] marked this conversation as resolved.
|
||
|
|
||
| Logging is suppressed in test mode (`NODE_ENV=test`) unless `SCS_IDXR_FORCE_LOGGING=true`. | ||
|
|
||
| ### Metrics | ||
|
|
||
| OTel metrics are instrumented in `src/utils/metrics.ts`. Counters and histograms track: chunks indexed, files processed, parse errors, queue depth, batch durations. Metrics are cached (5s TTL) to avoid blocking the event loop. | ||
|
|
||
| ### Error Handling | ||
|
|
||
| - Parser errors are logged but don't crash the indexer — individual files are skipped | ||
| - Queue failures retry up to 3 times with stale task detection (5min timeout) | ||
| - Bulk indexing errors are logged per-document; successful documents in a batch are not retried | ||
|
|
||
| ## Important Gotchas | ||
|
|
||
| 1. **`src/config.ts` MUST be the first import** in `src/index.ts` — it loads `.env` before anything reads `process.env`. | ||
| 2. **Chunk hashing intentionally excludes file path and line numbers** — identical code in different files maps to the same document ID for deduplication/aggregation. | ||
| 3. **Git env vars are stripped in `parser.ts`** (`GIT_DIR`, `GIT_WORK_TREE`, `GIT_INDEX_FILE`) to avoid issues when running inside git hooks (husky). | ||
| 4. **The SQLite queue uses WAL mode** — don't delete the `-wal` or `-shm` files while the process is running. | ||
| 5. **`p-queue` is pinned to v6** (CommonJS) — later versions are ESM-only and incompatible with the CommonJS build. | ||
| 6. **`tree-sitter` is pinned to `^0.25.0`** via `overrides` in `package.json` — native module version must match all grammar packages. | ||
| 7. **Integration tests manage their own ES indices** — they create and tear down indices; never run them against a production cluster. | ||
| 8. **The `semantic_text` field type requires ELSER inference** — set `SCS_IDXR_DISABLE_SEMANTIC_TEXT=true` for local testing without an inference endpoint. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| AGENT.md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hook sources
.husky/_/husky.sh, but that path isn’t present in the repository. This will emit a shell error on every commit and may breakHUSKY=0bypass behavior. Consider committing the Husky./.husky/_helper files or inlining the minimal logic (including honoringHUSKY=0) and removing the source line.