Goal
Allow users to exclude files and/or content regions (e.g., license headers, generated code) from duplicate scanning.
Scope
- CLI flags:
--ignore-glob (repeatable), --ignore-file PATTERN_FILE.
- Pattern file format: one glob per line,
# comments, blank lines ignored.
- Content region exclusion: configurable regex patterns applied before tokenizing (e.g.,
--ignore-region 're:^# Generated.*?\n').
- Update
DuplicateFinder to apply filters pre-tokenization.
Acceptance Criteria
- Files matching ignore globs are skipped (not counted in scanned set).
- Region regex patterns remove matched text before tokenization.
- Tests: file exclusion, region removal reduces shingles, no false negatives for remaining content.
- README section documenting patterns + examples.
Non-Goals
- Language-aware comment stripping (future tokenizer feature).
Future
- Central config file (
duplicate-finder.toml) for ignore patterns.
Goal
Allow users to exclude files and/or content regions (e.g., license headers, generated code) from duplicate scanning.
Scope
--ignore-glob(repeatable),--ignore-file PATTERN_FILE.#comments, blank lines ignored.--ignore-region 're:^# Generated.*?\n').DuplicateFinderto apply filters pre-tokenization.Acceptance Criteria
Non-Goals
Future
duplicate-finder.toml) for ignore patterns.