Skip to content

Normalize component search tokens for better matching#2424

Open
Mbeaulne wants to merge 2 commits into
06-18-expand_component_search_indexing_fieldsfrom
06-18-normalize_component_search_tokens_for_better_matching
Open

Normalize component search tokens for better matching#2424
Mbeaulne wants to merge 2 commits into
06-18-expand_component_search_indexing_fieldsfrom
06-18-normalize_component_search_tokens_for_better_matching

Conversation

@Mbeaulne

@Mbeaulne Mbeaulne commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Description

Improves the component search index by normalizing indexed and query text beyond simple lowercasing. Specifically:

  • Identifier splitting: snake_case, kebab-case, and camelCase component names are split into individual words before indexing, so a query like "train model" matches a component named train-model or train_model, and "load csv file" matches loadCSVFile.
  • Lightweight stemming: A stemToken function reduces common English inflections (plurals via -s/-ies, gerunds via -ing, past tense via -ed, sibilant plurals) to their base forms. Both the original token and its stem are stored in the index, so queries like "training", "datasets", or "batch" match components described with "train", "dataset", or "batches".
  • Normalized query tokenization: The same normalizeSearchText pipeline is applied to query text before scoring, ensuring query tokens and indexed tokens are in the same form.

Related Issue and Pull requests

Type of Change

  • Bug fix
  • New feature
  • Improvement
  • Cleanup/Refactor
  • Breaking change
  • Documentation update

Checklist

  • I have tested this does not break current pipelines / runs functionality
  • I have tested the changes on staging

Screenshots (if applicable)

Test Instructions

  1. Run the existing test suite (componentSearchIndex.test.ts) to verify the new normalization cases pass:
    • Snake/kebab/camelCase names matched by space-separated queries.
    • Stemmed query terms (training, datasets, batch) matching indexed descriptions.
  2. Manually search for components using inflected or hyphenated terms in the UI and confirm relevant results surface.

Additional Comments

The stemmer is intentionally minimal — it handles the most common English suffixes without introducing a full NLP dependency. Both the raw token and its stem are stored so that exact matches are never lost.

@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown

🎩 Preview

A preview build has been created at: 06-18-normalize_component_search_tokens_for_better_matching/0494d71

Comment thread src/services/componentSearchIndex.ts
Comment thread src/services/componentSearchIndex.ts
Comment thread src/services/componentSearchIndex.ts
Comment thread src/services/componentSearchIndex.ts
- splitIdentifierText: anchor first capital group to a single char to remove
  O(n²) regex backtracking on long uppercase runs (behavior-preserving)
- stemToken: guard -is/-us endings so status/analysis/axis aren't over-stemmed

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant