Skip to content

bugfix(audit): word-boundary terminology match — fix false positive on substring#10

Open
CryptoJones wants to merge 1 commit into
mainfrom
bugfix/audit-terminology-word-boundary
Open

bugfix(audit): word-boundary terminology match — fix false positive on substring#10
CryptoJones wants to merge 1 commit into
mainfrom
bugfix/audit-terminology-word-boundary

Conversation

@CryptoJones
Copy link
Copy Markdown
Owner

TerminologyUsedCheck did a naive substring search:

if term.lower() not in other_text.lower(): flag()

That false-positives whenever a defined term happens to appear inside
a longer word in some other planning file:

term 'lien' matches 'client', 'reliant', 'aliens'
term 'tier' matches 'tiers', 'outlier', 'frontier'
term 'rebate' matches 'rebated'
term 'state' matches 'estate', 'statement', 'statehouse'

When that happens the term gets silently rated as "used elsewhere" and
NEVER flagged — undermining the whole point of the check, which is to
catch dead glossary entries.

Fix: same word-boundary regex with kebab-friendly boundary chars used
by the patterns.py fix:

(?<![\w-]){re.escape(term)}(?![\w-])

Matches multi-word and kebab-case terms as complete tokens; refuses to
match inside a longer identifier.

Test added: define lien in DOMAIN.md; the clean_project fixture's
rendered files mention client everywhere (which contains lien as
a substring). Pre-fix the check silently swallowed the warning;
post-fix lien is correctly flagged as orphaned.

148/148 tests pass; ruff + mypy clean.

…n substring

TerminologyUsedCheck did a naive substring search:

    if term.lower() not in other_text.lower(): flag()

That false-positives whenever a defined term happens to appear inside
a longer word in some other planning file:

  term 'lien'   matches 'client', 'reliant', 'aliens'
  term 'tier'   matches 'tiers', 'outlier', 'frontier'
  term 'rebate' matches 'rebated'
  term 'state'  matches 'estate', 'statement', 'statehouse'

When that happens the term gets silently rated as "used elsewhere" and
NEVER flagged — undermining the whole point of the check, which is to
catch dead glossary entries.

Fix: same word-boundary regex with kebab-friendly boundary chars used
by the patterns.py fix:

  (?<![\w-]){re.escape(term)}(?![\w-])

Matches multi-word and kebab-case terms as complete tokens; refuses to
match inside a longer identifier.

Test added: define `lien` in DOMAIN.md; the clean_project fixture's
rendered files mention `client` everywhere (which contains `lien` as
a substring). Pre-fix the check silently swallowed the warning;
post-fix `lien` is correctly flagged as orphaned.

148/148 tests pass; ruff + mypy clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant