Restructure processing to enable mutli-language file support#234
Merged
Restructure processing to enable mutli-language file support#234
Conversation
Unify multi-language support into .scm query files using @iNJECTiON.* capture tags, eliminating the separate regions.rs module. Three injection tag forms: - @iNJECTiON.{lang} — static injection (e.g. @injection.html for HTML blocks in markdown) - @injection.content + @injection.language — dynamic injection (e.g. fenced code blocks where the language comes from the info string) - Existing tags (@string, @comment, @Identifier.*) — word extraction The recursive extract_all_words function replaces the previous extract_regions → extract_nodes → extract_words pipeline. Adding injection support to any language is now just adding @iNJECTiON.* captures to its .scm file — no Rust code changes needed. Restores markdown.scm with injection captures and the Markdown tree-sitter grammar. Deletes regions.rs.
- Use HashSet<TextRange> in checker.rs to deduplicate identical spans, matching the old main branch behavior that used HashSet per word. - Add test for duplicate span deduplication in checker.rs. - Add test for injected region byte offset correctness (verifies offsets from Python code blocks map back to the right document position). - Add test for no duplicate spans in block quotes. - Fix misleading comment on bash code block test: mkdir passes because bash.scm doesn't capture command invocations, not because of a bash dictionary.
Replace per-call Query::new with a static COMPILED_QUERIES map that compiles all .scm queries once on first access. Since queries come from include_str! and never change at runtime, this avoids recompilation on every recursive injection call and panics immediately on invalid queries rather than hiding failures until a user opens that file type.
Verify that include_tags and exclude_tags are correctly applied inside injected code blocks (e.g. Python inside markdown), not just at the top-level language.
- Fix weak alias test: now asserts wrld has 2 locations (one per injected block) instead of just checking word presence. - Restructure check_words to filter correct words before insertion, matching the old behavior where the debug_assert only fires on misspelled words with duplicate locations (actual query bugs). Correct words with overlapping captures (e.g. Erlang atoms that are both @string.special and @identifier.function) are filtered out before the assert. - Update examples/example.md with pathological multi-language test cases covering aliases, unknown languages, HTML blocks, block quotes, many fenced blocks, and edge cases.
The blanket (atom) @string.special pattern overlapped with (function_clause name: (atom) @identifier.function), producing duplicate captures at the same byte range. Replace with specific parent-context patterns (module_attribute, tuple, map_field, call) that don't overlap with function names. The debug_assert in checker.rs now fires on all candidates (not just misspelled words), matching the original intent of catching inefficient queries during development.
Change WordCandidate from owning a String to borrowing &str from the source document text. This removes one heap allocation per extracted word. The lifetime chain works because splitter::split returns SplitRef borrowing from its input, which borrows from document_text through the tree-sitter node text or directly. Also removes an unnecessary String allocation for injection language text by borrowing from the tree-sitter provider bytes directly.
Add split_into() that appends to a caller-provided Vec, avoiding a fresh Vec allocation per word boundary segment in extract_words_from_text. The Vec is allocated once per text node and reused across all words.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.