⚡ Bolt: Optimize regex matching in keyword density analysis#229
⚡ Bolt: Optimize regex matching in keyword density analysis#229
Conversation
- Extracted `_TITLE_PATTERNS` and `_COMPANY_PATTERNS` to module-level pre-compiled regex objects in `cli/utils/keyword_density.py` to eliminate recompilation overhead on each method call. - Optimized keyword counting in `_count_keywords_in_resume` by avoiding `re.IGNORECASE` in `re.findall`. Instead, lowercased the `all_text` blob once and lowercased each keyword string beforehand. Co-authored-by: anchapin <6326294+anchapin@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
Reviewer's GuideOptimizes keyword density analysis performance by pre-compiling job title/company regex patterns and switching keyword counting to operate on lowercased text instead of using case-insensitive regex flags in tight loops. Class diagram for optimized keyword density analysis and regex patternsclassDiagram
class KeywordDensityModule {
+_TITLE_PATTERNS: list
+_COMPANY_PATTERNS: list
}
class KeywordDensityAnalyzer {
-_extract_job_details(job_description: str) Tuple_str_str
-_count_keywords_in_resume(resume_data: ResumeYAML, keywords: List_Tuple_str_int) Dict_str_int
-_get_all_text(resume_data: ResumeYAML) str
}
KeywordDensityModule <.. KeywordDensityAnalyzer : uses
class _TITLE_PATTERNS {
+pattern_1: Pattern %% (?:job title|position|title):\s*([^\n]+) with IGNORECASE MULTILINE
+pattern_2: Pattern %% ^([^\n]+)\s*[-|]\s*[^|]+$ with IGNORECASE MULTILINE
+pattern_3: Pattern %% #\s*([^\n]+) with IGNORECASE MULTILINE
}
class _COMPANY_PATTERNS {
+pattern_1: Pattern %% (?:company|organization):\s*([^\n]+) with IGNORECASE
+pattern_2: Pattern %% (?:at|from)\s+([A-Z][^\n]+?)(?:\s+[-\u2014]|\s+$) with IGNORECASE
}
KeywordDensityModule o-- _TITLE_PATTERNS
KeywordDensityModule o-- _COMPANY_PATTERNS
KeywordDensityAnalyzer --> _TITLE_PATTERNS : search(job_description)
KeywordDensityAnalyzer --> _COMPANY_PATTERNS : search(job_description)
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
💡 What:
title_patternsandcompany_patternsto module-level pre-compiled regex objects (_TITLE_PATTERNS,_COMPANY_PATTERNS) incli/utils/keyword_density.py._count_keywords_in_resumemethod to avoid using there.IGNORECASEflag inside a tight loop withre.findall. Now, it lowercases the entire resume text once and lowercases each keyword string before matching.🎯 Why:
re.compilewithin hot code paths and particularlyre.IGNORECASEevaluation inside text-search loops introduces significant performance overhead, especially as the text length or the number of keywords increases.📊 Impact:
re.IGNORECASEin favor of pre-lowercasing the target strings.🔬 Measurement:
python -m pytest tests/test_keyword_density.pyto ensure all keyword matching, job parsing, and density analysis logic operates exactly as before without regressions.PR created automatically by Jules for task 15327457375750365719 started by @anchapin
Summary by Sourcery
Optimize keyword density analysis by precompiling regex patterns and reducing per-call regex overhead in resume and job parsing.
Enhancements: