Restructure processing to enable mutli-language file support by blopker · Pull Request #234 · blopker/codebook

blopker · 2026-03-20T22:00:27Z

No description provided.

@iNJECTiON

Unify multi-language support into .scm query files using @iNJECTiON.* capture tags, eliminating the separate regions.rs module. Three injection tag forms: - @iNJECTiON.{lang} — static injection (e.g. @injection.html for HTML blocks in markdown) - @injection.content + @injection.language — dynamic injection (e.g. fenced code blocks where the language comes from the info string) - Existing tags (@string, @comment, @Identifier.*) — word extraction The recursive extract_all_words function replaces the previous extract_regions → extract_nodes → extract_words pipeline. Adding injection support to any language is now just adding @iNJECTiON.* captures to its .scm file — no Rust code changes needed. Restores markdown.scm with injection captures and the Markdown tree-sitter grammar. Deletes regions.rs.

- Use HashSet<TextRange> in checker.rs to deduplicate identical spans, matching the old main branch behavior that used HashSet per word. - Add test for duplicate span deduplication in checker.rs. - Add test for injected region byte offset correctness (verifies offsets from Python code blocks map back to the right document position). - Add test for no duplicate spans in block quotes. - Fix misleading comment on bash code block test: mkdir passes because bash.scm doesn't capture command invocations, not because of a bash dictionary.

Replace per-call Query::new with a static COMPILED_QUERIES map that compiles all .scm queries once on first access. Since queries come from include_str! and never change at runtime, this avoids recompilation on every recursive injection call and panics immediately on invalid queries rather than hiding failures until a user opens that file type.

Verify that include_tags and exclude_tags are correctly applied inside injected code blocks (e.g. Python inside markdown), not just at the top-level language.

- Fix weak alias test: now asserts wrld has 2 locations (one per injected block) instead of just checking word presence. - Restructure check_words to filter correct words before insertion, matching the old behavior where the debug_assert only fires on misspelled words with duplicate locations (actual query bugs). Correct words with overlapping captures (e.g. Erlang atoms that are both @string.special and @identifier.function) are filtered out before the assert. - Update examples/example.md with pathological multi-language test cases covering aliases, unknown languages, HTML blocks, block quotes, many fenced blocks, and edge cases.

The blanket (atom) @string.special pattern overlapped with (function_clause name: (atom) @identifier.function), producing duplicate captures at the same byte range. Replace with specific parent-context patterns (module_attribute, tuple, map_field, call) that don't overlap with function names. The debug_assert in checker.rs now fires on all candidates (not just misspelled words), matching the original intent of catching inefficient queries during development.

Change WordCandidate from owning a String to borrowing &str from the source document text. This removes one heap allocation per extracted word. The lifetime chain works because splitter::split returns SplitRef borrowing from its input, which borrows from document_text through the tree-sitter node text or directly. Also removes an unnecessary String allocation for injection language text by borrowing from the tree-sitter provider bytes directly.

Add split_into() that appends to a caller-provided Vec, avoiding a fresh Vec allocation per word boundary segment in extract_words_from_text. The Vec is allocated once per text node and reused across all words.

blopker added 16 commits March 20, 2026 09:07

Attempt 1

0decb10

Attempt 2

fdd4931

Add tests for tag filters through injected regions

fcabf2d

Verify that include_tags and exclude_tags are correctly applied inside injected code blocks (e.g. Python inside markdown), not just at the top-level language.

Remove old doc

2eb40a9

Format and spelling

c0c7672

Example

4104246

Reuse splitter Vec across word boundary iterations

76380a5

Add split_into() that appends to a caller-provided Vec, avoiding a fresh Vec allocation per word boundary segment in extract_words_from_text. The Vec is allocated once per text node and reused across all words.

Update changelog with unreleased changes

d2bb135

Update queries README with injection docs, remove em dashes

3c59c97

Format

5a936db

blopker merged commit 7f78997 into main Mar 20, 2026
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructure processing to enable mutli-language file support#234

Restructure processing to enable mutli-language file support#234
blopker merged 16 commits intomainfrom
feature/new-pipeline

blopker commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

blopker commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant