Skip to content

Commit 47e74d4

Browse files
authored
Add settings to include or exclude types of code (tag support) (#232)
* feat: namespace tree-sitter capture tags for tag-based filtering Rename all capture names in .scm query files to use a dot-separated hierarchy (e.g., @identifier.function instead of @func_declaration) so users can filter spell checking by tag via include_tags/exclude_tags in codebook.toml. Add query tag reference README and document the new config options in the main README. * test: validate capture names in .scm files against allowed tag list Adds a test that checks every capture name across all language queries matches the allowed tag taxonomy (comment, string, identifier, etc.). This prevents ad-hoc tag names from being introduced. * feat: implement tag-based filtering for spell checking Add include_tags/exclude_tags to ConfigSettings with prefix-based matching. exclude_tags takes precedence over include_tags, matching how ignore_paths takes precedence over include_paths. The parser now extracts capture names from tree-sitter queries and skips captures whose tags don't pass the filter. Text mode (no tree-sitter) ignores tag filters since there are no captures. * Allow build * docs: add changelog entry for tag-based filtering * Refactor settings
1 parent aad4c6f commit 47e74d4

37 files changed

Lines changed: 873 additions & 378 deletions

.claude/settings.local.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,8 @@
1313
"Bash(git remote get-url:*)",
1414
"Bash(gh issue list:*)",
1515
"Bash(gh issue view:*)",
16-
"Bash(gh repo view:*)"
16+
"Bash(gh repo view:*)",
17+
"Bash(cargo build:*)"
1718
]
1819
}
1920
}

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
[Unreleased]
2+
3+
- Add tag-based filtering (`include_tags`/`exclude_tags`) to control which parts of code are spell-checked (comments, strings, identifiers, etc.)
4+
- Rename tree-sitter capture names to use dot-separated namespace convention (e.g., `@identifier.function` instead of `@func_declaration`)
5+
16
[0.3.34]
27

38
- Fix crash in Termux by falling back to bundled Mozilla CA roots on Android (#230)

README.md

Lines changed: 34 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -293,6 +293,19 @@ ignore_patterns = [
293293
# Set to 2 to check words with 2 or more characters
294294
min_word_length = 3
295295

296+
# Filter which parts of your code are spell-checked by tag.
297+
# Tags use a dot-separated hierarchy (e.g., "comment", "identifier.function").
298+
# Matching is prefix-based: "comment" matches "comment", "comment.line",
299+
# "comment.block", etc.
300+
#
301+
# Only check these tags (if set, everything else is excluded)
302+
# Default: [] (empty = check everything)
303+
include_tags = ["comment", "string"]
304+
#
305+
# Exclude these tags from checking (takes precedence over include_tags)
306+
# Default: []
307+
exclude_tags = ["string.heredoc"]
308+
296309
# Whether to use global configuration (project config only)
297310
# Set to false to completely ignore global settings
298311
# Default: true
@@ -355,6 +368,26 @@ ignore_patterns = [
355368

356369
**Tip**: Include the identifier in your pattern. `'vim\.opt\.[a-z]+'` skips `showmode` in `vim.opt.showmode`, but `'vim\.opt\.'` alone won't (it only matches up to the dot).
357370

371+
### Tag-Based Filtering
372+
373+
Codebook categorizes every piece of text it checks using **tags** — dot-separated labels like `comment`, `string`, `identifier.function`, etc. You can use `include_tags` and `exclude_tags` to control which categories are spell-checked.
374+
375+
Matching is **prefix-based**: `"comment"` matches `comment`, `comment.line`, `comment.block`, etc. `include_tags` narrows what is checked (allowlist), and `exclude_tags` removes from that set (blocklist, takes precedence). This works the same way as `include_paths`/`ignore_paths`.
376+
377+
```toml
378+
# Only check comments and strings, ignore all identifiers
379+
include_tags = ["comment", "string"]
380+
381+
# Check everything except variable and parameter names
382+
exclude_tags = ["identifier.variable", "identifier.parameter"]
383+
384+
# Both can be combined: check comments and strings, but skip heredocs
385+
include_tags = ["comment", "string"]
386+
exclude_tags = ["string.heredoc"]
387+
```
388+
389+
For the full list of available tags, see the [query tag reference](crates/codebook/src/queries/README.md).
390+
358391
### LSP Initialization Options
359392

360393
Editors can pass `initializationOptions` when starting the Codebook LSP for LSP-specific options. Refer to your editor's documentation for how to apply these options. All values are optional, omit them for the default behavior:
@@ -451,68 +484,7 @@ For plain text dictionaries, use `TextRepo::new()` instead and add to `TEXT_DICT
451484

452485
## Adding New Programming Language Support
453486

454-
Codebook uses Tree-sitter support additional programming languages. Here's how to add support for a new language:
455-
456-
### 1. Create a Tree-sitter Query
457-
458-
Each language needs a Tree-sitter query file that defines which parts of the code should be checked for spelling issues. The query needs to capture:
459-
460-
- Identifiers (variable names, function names, class names, etc.)
461-
- String literals
462-
- Comments
463-
464-
Create a new `.scm` file in `codebook/crates/codebook/src/queries/` named after your language (e.g., `java.scm`).
465-
466-
### 2. Understand the Language's AST
467-
468-
To write an effective query, you need to understand the Abstract Syntax Tree (AST) structure of your language. Use these tools:
469-
470-
- [Tree-sitter Playground](https://tree-sitter.github.io/tree-sitter/7-playground.html): Interactively explore how Tree-sitter parses code
471-
- [Tree-sitter Visualizer](https://blopker.github.io/ts-visualizer/): Visualize the AST of your code in a more detailed way
472-
473-
A good approach is to:
474-
475-
1. Write sample code with identifiers, strings, and comments
476-
2. Paste it into the playground/visualizer
477-
3. Observe the node types used for each element
478-
4. Create capture patterns that target only definition nodes, not usages
479-
480-
### 3. Update the Language Settings
481-
482-
Add your language to `codebook/crates/codebook/src/queries.rs`:
483-
484-
1. Add a new variant to the `LanguageType` enum
485-
2. Add a new entry to the `LANGUAGE_SETTINGS` array with:
486-
- The language type
487-
- File extensions for your language
488-
- Language identifiers
489-
- Path to your query file
490-
491-
### 4. Add the Tree-sitter Grammar
492-
493-
Make sure the appropriate Tree-sitter grammar is added as a dependency in `Cargo.toml` and update the `language()` function in `queries.rs` to return the correct language parser.
494-
495-
### 5. Test Your Implementation
496-
497-
Run the tests to ensure your query is valid:
498-
499-
```bash
500-
cargo test -p codebook queries::tests::test_all_queries_are_valid
501-
```
502-
503-
Additional language tests should go in `codebook/tests`. There are many example tests to copy.
504-
505-
You can also test with real code files to verify that Codebook correctly identifies spelling issues in your language. Example files should go in `examples/` and contain at least one spelling error to pass integration tests.
506-
507-
### Tips for Writing Effective Queries
508-
509-
- Focus on capturing definitions, not usages
510-
- Include only nodes that contain user-defined text (not keywords)
511-
- Test with representative code samples
512-
- Start simple and add complexity as needed
513-
- Look at existing language queries for patterns
514-
515-
If you've successfully added support for a new language, please consider contributing it back to Codebook with a pull request!
487+
See the [query development guide](crates/codebook/src/queries/README.md) for instructions on adding Tree-sitter queries for new languages, the tag naming convention, and tips for writing effective queries.
516488

517489
## Running Tests
518490

crates/codebook-config/src/helpers.rs

Lines changed: 0 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
1-
use crate::settings::ConfigSettings;
2-
use glob::Pattern;
31
use log::error;
42
use regex::{Regex, RegexBuilder};
53
use std::env;
@@ -57,86 +55,6 @@ pub(crate) fn unix_cache_dir() -> PathBuf {
5755
env::temp_dir().join("codebook").join("cache")
5856
}
5957

60-
/// Insert a word into the allowlist, returning true when it was newly added.
61-
pub(crate) fn insert_word(settings: &mut ConfigSettings, word: &str) -> bool {
62-
let word = word.to_ascii_lowercase();
63-
if settings.words.contains(&word) {
64-
return false;
65-
}
66-
settings.words.push(word);
67-
settings.words.sort();
68-
settings.words.dedup();
69-
true
70-
}
71-
72-
/// Insert a path into the ignore list, returning true when it was newly added.
73-
pub(crate) fn insert_ignore(settings: &mut ConfigSettings, file: &str) -> bool {
74-
let file = file.to_string();
75-
if settings.ignore_paths.contains(&file) {
76-
return false;
77-
}
78-
settings.ignore_paths.push(file);
79-
settings.ignore_paths.sort();
80-
settings.ignore_paths.dedup();
81-
true
82-
}
83-
84-
/// Insert a path into the include list, returning true when it was newly added.
85-
pub(crate) fn insert_include(settings: &mut ConfigSettings, file: &str) -> bool {
86-
let file = file.to_string();
87-
if settings.include_paths.contains(&file) {
88-
return false;
89-
}
90-
settings.include_paths.push(file);
91-
settings.include_paths.sort();
92-
settings.include_paths.dedup();
93-
true
94-
}
95-
96-
/// Resolve configured dictionary IDs, providing a default when none are set.
97-
pub(crate) fn dictionary_ids(settings: &ConfigSettings) -> Vec<String> {
98-
if settings.dictionaries.is_empty() {
99-
vec!["en_us".to_string()]
100-
} else {
101-
settings.dictionaries.clone()
102-
}
103-
}
104-
105-
fn match_pattern(pattern: &[String], path_str: &str) -> bool {
106-
pattern.iter().any(|pattern| {
107-
Pattern::new(pattern)
108-
.map(|p| p.matches(path_str))
109-
.unwrap_or(false)
110-
})
111-
}
112-
113-
/// Determine whether a path should be included based on the configured glob patterns.
114-
pub(crate) fn should_include_path(settings: &ConfigSettings, path: &Path) -> bool {
115-
if settings.include_paths.is_empty() {
116-
return true;
117-
}
118-
let path_str = path.to_string_lossy();
119-
match_pattern(&settings.include_paths, &path_str)
120-
}
121-
122-
/// Determine whether a path should be ignored based on the configured glob patterns.
123-
pub(crate) fn should_ignore_path(settings: &ConfigSettings, path: &Path) -> bool {
124-
let path_str = path.to_string_lossy();
125-
match_pattern(&settings.ignore_paths, &path_str)
126-
}
127-
128-
/// Check if a word is explicitly allowed.
129-
pub(crate) fn is_allowed_word(settings: &ConfigSettings, word: &str) -> bool {
130-
let word = word.to_ascii_lowercase();
131-
settings.words.iter().any(|w| w == &word)
132-
}
133-
134-
/// Check if a word should be flagged.
135-
pub(crate) fn should_flag_word(settings: &ConfigSettings, word: &str) -> bool {
136-
let word = word.to_ascii_lowercase();
137-
settings.flag_words.iter().any(|w| w == &word)
138-
}
139-
14058
/// Compile user-provided ignore regex patterns, dropping invalid entries.
14159
/// Patterns are compiled with multiline mode so `^` and `$` match line boundaries.
14260
pub(crate) fn build_ignore_regexes(patterns: &[String]) -> Vec<Regex> {
@@ -154,11 +72,6 @@ pub(crate) fn build_ignore_regexes(patterns: &[String]) -> Vec<Regex> {
15472
.collect()
15573
}
15674

157-
/// Retrieve the configured minimum word length.
158-
pub(crate) fn min_word_length(settings: &ConfigSettings) -> usize {
159-
settings.min_word_length
160-
}
161-
16275
pub(crate) fn expand_tilde<P: AsRef<Path>>(path_user_input: P) -> Option<PathBuf> {
16376
let p = path_user_input.as_ref();
16477
if !p.starts_with("~") {

crates/codebook-config/src/lib.rs

Lines changed: 30 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,8 @@ mod helpers;
22
mod settings;
33
mod watched_file;
44
use crate::helpers::expand_tilde;
5-
use crate::settings::ConfigSettings;
5+
pub use crate::settings::ConfigSettings;
6+
67
use crate::watched_file::WatchedFile;
78
use log::debug;
89
use log::info;
@@ -32,6 +33,7 @@ pub trait CodebookConfig: Sync + Send + Debug {
3233
fn should_flag_word(&self, word: &str) -> bool;
3334
fn get_ignore_patterns(&self) -> Option<Vec<Regex>>;
3435
fn get_min_word_length(&self) -> usize;
36+
fn should_check_tag(&self, tag: &str) -> bool;
3537
fn cache_dir(&self) -> &Path;
3638
}
3739

@@ -474,51 +476,51 @@ impl CodebookConfigFile {
474476
impl CodebookConfig for CodebookConfigFile {
475477
/// Add a word to the project configs allowlist
476478
fn add_word(&self, word: &str) -> Result<bool, io::Error> {
477-
Ok(self.update_project_settings(|settings| helpers::insert_word(settings, word)))
479+
Ok(self.update_project_settings(|settings| settings.insert_word(word)))
478480
}
479481
/// Add a word to the global configs allowlist
480482
fn add_word_global(&self, word: &str) -> Result<bool, io::Error> {
481-
Ok(self.update_global_settings(|settings| helpers::insert_word(settings, word)))
483+
Ok(self.update_global_settings(|settings| settings.insert_word(word)))
482484
}
483485

484486
/// Add a file to the ignore list
485487
fn add_ignore(&self, file: &str) -> Result<bool, io::Error> {
486-
Ok(self.update_project_settings(|settings| helpers::insert_ignore(settings, file)))
488+
Ok(self.update_project_settings(|settings| settings.insert_ignore(file)))
487489
}
488490

489491
/// Add a file to the include list
490492
fn add_include(&self, file: &str) -> Result<bool, io::Error> {
491-
Ok(self.update_project_settings(|settings| helpers::insert_include(settings, file)))
493+
Ok(self.update_project_settings(|settings| settings.insert_include(file)))
492494
}
493495

494496
/// Get dictionary IDs from effective configuration
495497
fn get_dictionary_ids(&self) -> Vec<String> {
496498
let snapshot = self.snapshot();
497-
helpers::dictionary_ids(&snapshot)
499+
snapshot.dictionary_ids()
498500
}
499501

500502
/// Check if a path is included based on the effective configuration
501503
fn should_include_path(&self, path: &Path) -> bool {
502504
let snapshot = self.snapshot();
503-
helpers::should_include_path(&snapshot, path)
505+
snapshot.should_include_path(path)
504506
}
505507

506508
/// Check if a path should be ignored based on the effective configuration
507509
fn should_ignore_path(&self, path: &Path) -> bool {
508510
let snapshot = self.snapshot();
509-
helpers::should_ignore_path(&snapshot, path)
511+
snapshot.should_ignore_path(path)
510512
}
511513

512514
/// Check if a word is in the effective allowlist
513515
fn is_allowed_word(&self, word: &str) -> bool {
514516
let snapshot = self.snapshot();
515-
helpers::is_allowed_word(&snapshot, word)
517+
snapshot.is_allowed_word(word)
516518
}
517519

518520
/// Check if a word should be flagged according to effective configuration
519521
fn should_flag_word(&self, word: &str) -> bool {
520522
let snapshot = self.snapshot();
521-
helpers::should_flag_word(&snapshot, word)
523+
snapshot.should_flag_word(word)
522524
}
523525

524526
/// Get the list of user-defined ignore patterns
@@ -534,7 +536,11 @@ impl CodebookConfig for CodebookConfigFile {
534536

535537
/// Get the minimum word length which should be checked
536538
fn get_min_word_length(&self) -> usize {
537-
helpers::min_word_length(&self.snapshot())
539+
self.snapshot().min_word_length()
540+
}
541+
542+
fn should_check_tag(&self, tag: &str) -> bool {
543+
self.snapshot().should_check_tag(tag)
538544
}
539545

540546
fn cache_dir(&self) -> &Path {
@@ -576,7 +582,7 @@ impl CodebookConfigMemory {
576582
impl CodebookConfig for CodebookConfigMemory {
577583
fn add_word(&self, word: &str) -> Result<bool, io::Error> {
578584
let mut settings = self.settings.write().unwrap();
579-
Ok(helpers::insert_word(&mut settings, word))
585+
Ok(settings.insert_word(word))
580586
}
581587

582588
fn add_word_global(&self, word: &str) -> Result<bool, io::Error> {
@@ -585,37 +591,37 @@ impl CodebookConfig for CodebookConfigMemory {
585591

586592
fn add_ignore(&self, file: &str) -> Result<bool, io::Error> {
587593
let mut settings = self.settings.write().unwrap();
588-
Ok(helpers::insert_ignore(&mut settings, file))
594+
Ok(settings.insert_ignore(file))
589595
}
590596

591597
fn add_include(&self, file: &str) -> Result<bool, io::Error> {
592598
let mut settings = self.settings.write().unwrap();
593-
Ok(helpers::insert_include(&mut settings, file))
599+
Ok(settings.insert_include(file))
594600
}
595601

596602
fn get_dictionary_ids(&self) -> Vec<String> {
597603
let snapshot = self.snapshot();
598-
helpers::dictionary_ids(&snapshot)
604+
snapshot.dictionary_ids()
599605
}
600606

601607
fn should_include_path(&self, path: &Path) -> bool {
602608
let snapshot = self.snapshot();
603-
helpers::should_include_path(&snapshot, path)
609+
snapshot.should_include_path(path)
604610
}
605611

606612
fn should_ignore_path(&self, path: &Path) -> bool {
607613
let snapshot = self.snapshot();
608-
helpers::should_ignore_path(&snapshot, path)
614+
snapshot.should_ignore_path(path)
609615
}
610616

611617
fn is_allowed_word(&self, word: &str) -> bool {
612618
let snapshot = self.snapshot();
613-
helpers::is_allowed_word(&snapshot, word)
619+
snapshot.is_allowed_word(word)
614620
}
615621

616622
fn should_flag_word(&self, word: &str) -> bool {
617623
let snapshot = self.snapshot();
618-
helpers::should_flag_word(&snapshot, word)
624+
snapshot.should_flag_word(word)
619625
}
620626

621627
fn get_ignore_patterns(&self) -> Option<Vec<Regex>> {
@@ -624,7 +630,11 @@ impl CodebookConfig for CodebookConfigMemory {
624630
}
625631

626632
fn get_min_word_length(&self) -> usize {
627-
helpers::min_word_length(&self.snapshot())
633+
self.snapshot().min_word_length()
634+
}
635+
636+
fn should_check_tag(&self, tag: &str) -> bool {
637+
self.snapshot().should_check_tag(tag)
628638
}
629639

630640
fn cache_dir(&self) -> &Path {

0 commit comments

Comments
 (0)