Skip to content

SentenceBoundaryDetector: unambiguous cache key to avoid wrong cached boundaries #151

@ysdede

Description

@ysdede

Location: src/lib/transcription/SentenceBoundaryDetector.ts (around lines 423-432, 253-255)

Finding: generateCacheKey returns a 32-bit integer hash string. Different input texts can hash to the same value (collision), so performNLP may return wrong cached sentence boundaries for a different text.

Suggested fix: Make the cache key unambiguous. Options:

  • Use the full input text as the key (simplest and safest for the default small cache size of 100).
  • Or at minimum prefix the hash with text length, e.g. len:${text.length}:${hash} or use a "len:hash" compact key so cached entries map uniquely to their original text.

Update generateCacheKey accordingly and ensure performNLP and any cache lookup/add use the same key shape. If using raw text as key, ensure cache size limits (e.g. LRU) still work with potentially long keys.

Verification: generateCacheKey at 423-431 returns hash.toString() (32-bit); performNLP at 253-255 uses this key for this.cache.get(cacheKey) and cached results.

Metadata

Metadata

Assignees

No one assigned

    Labels

    priority:highImportant next work, impacts stability/performance/usability

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions