Skip to content

Restructure processing to enable mutli-language file support#234

Merged
blopker merged 16 commits intomainfrom
feature/new-pipeline
Mar 20, 2026
Merged

Restructure processing to enable mutli-language file support#234
blopker merged 16 commits intomainfrom
feature/new-pipeline

Conversation

@blopker
Copy link
Owner

@blopker blopker commented Mar 20, 2026

No description provided.

blopker added 16 commits March 20, 2026 09:07
Unify multi-language support into .scm query files using @iNJECTiON.*
capture tags, eliminating the separate regions.rs module.

Three injection tag forms:
- @iNJECTiON.{lang} — static injection (e.g. @injection.html for
  HTML blocks in markdown)
- @injection.content + @injection.language — dynamic injection
  (e.g. fenced code blocks where the language comes from the info string)
- Existing tags (@string, @comment, @Identifier.*) — word extraction

The recursive extract_all_words function replaces the previous
extract_regions → extract_nodes → extract_words pipeline. Adding
injection support to any language is now just adding @iNJECTiON.*
captures to its .scm file — no Rust code changes needed.

Restores markdown.scm with injection captures and the Markdown
tree-sitter grammar. Deletes regions.rs.
- Use HashSet<TextRange> in checker.rs to deduplicate identical spans,
  matching the old main branch behavior that used HashSet per word.
- Add test for duplicate span deduplication in checker.rs.
- Add test for injected region byte offset correctness (verifies
  offsets from Python code blocks map back to the right document
  position).
- Add test for no duplicate spans in block quotes.
- Fix misleading comment on bash code block test: mkdir passes because
  bash.scm doesn't capture command invocations, not because of a bash
  dictionary.
Replace per-call Query::new with a static COMPILED_QUERIES map that
compiles all .scm queries once on first access. Since queries come from
include_str! and never change at runtime, this avoids recompilation on
every recursive injection call and panics immediately on invalid queries
rather than hiding failures until a user opens that file type.
Verify that include_tags and exclude_tags are correctly applied inside
injected code blocks (e.g. Python inside markdown), not just at the
top-level language.
- Fix weak alias test: now asserts wrld has 2 locations (one per
  injected block) instead of just checking word presence.
- Restructure check_words to filter correct words before insertion,
  matching the old behavior where the debug_assert only fires on
  misspelled words with duplicate locations (actual query bugs).
  Correct words with overlapping captures (e.g. Erlang atoms that
  are both @string.special and @identifier.function) are filtered
  out before the assert.
- Update examples/example.md with pathological multi-language test
  cases covering aliases, unknown languages, HTML blocks, block
  quotes, many fenced blocks, and edge cases.
The blanket (atom) @string.special pattern overlapped with
(function_clause name: (atom) @identifier.function), producing
duplicate captures at the same byte range. Replace with specific
parent-context patterns (module_attribute, tuple, map_field, call)
that don't overlap with function names.

The debug_assert in checker.rs now fires on all candidates (not just
misspelled words), matching the original intent of catching inefficient
queries during development.
Change WordCandidate from owning a String to borrowing &str from the
source document text. This removes one heap allocation per extracted
word. The lifetime chain works because splitter::split returns
SplitRef borrowing from its input, which borrows from document_text
through the tree-sitter node text or directly.

Also removes an unnecessary String allocation for injection language
text by borrowing from the tree-sitter provider bytes directly.
Add split_into() that appends to a caller-provided Vec, avoiding a
fresh Vec allocation per word boundary segment in extract_words_from_text.
The Vec is allocated once per text node and reused across all words.
@blopker blopker merged commit 7f78997 into main Mar 20, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant