Skip to content

reset error patterns at the start of each LogAnalyzer.analyze call#14

Open
HrachShah wants to merge 2 commits into
mainfrom
fix/analyzer-reset-error-patterns-per-call
Open

reset error patterns at the start of each LogAnalyzer.analyze call#14
HrachShah wants to merge 2 commits into
mainfrom
fix/analyzer-reset-error-patterns-per-call

Conversation

@HrachShah

@HrachShah HrachShah commented Jun 20, 2026

Copy link
Copy Markdown
Owner

LogAnalyzer accumulates error groups in self._error_patterns across calls. Re-using the same analyzer instance for two batches therefore leaked the first batch's groups into the second batch's result — calling analyzer.analyze(batch2) reported error_groups containing entries from batch1 even though batch1 was never passed in.

analyze_log_entries masked the issue by constructing a fresh LogAnalyzer per call, but the class is part of the public API and reset() is the only documented way to clear state — easy to forget.

Clear self._error_patterns = {} at the top of analyze() so each call starts from a clean slate. Added a regression test that runs two distinct error batches through the same analyzer instance and asserts the second result contains only the second batch's groups.

All 12 analyzer tests pass (11 existing + 1 new). The 12 pre-existing failures in test_cli.py / test_parsers.py are unrelated Click 9.x exit-code mismatches in the upstream suite.

Summary by Sourcery

Ensure LogAnalyzer analysis runs are stateless across calls and refine syslog log-level detection.

Bug Fixes:

  • Reset accumulated error patterns at the start of each LogAnalyzer.analyze call so error groups from previous batches are not leaked into subsequent results.
  • Adjust syslog parser log-level detection to use the message portion of the line, avoiding misclassification based on non-message content.

Tests:

  • Add a regression test verifying that running analyze twice on different batches with the same LogAnalyzer instance produces independent error groups.

Summary by CodeRabbit

  • Bug Fixes

    • Fixed accumulation of error groups when the same analyzer instance processes multiple separate batches of logs
    • Improved log-level detection accuracy by focusing analysis on message fields rather than full log lines
  • Tests

    • Added test case to verify analyzer instances properly reset between multiple analyses

Zo Bot added 2 commits May 20, 2026 13:41
…g detect_log_level on the full line causes false positives when host or process names contain words like 'error' or 'warn'; now it runs on the message portion only
@sourcery-ai

sourcery-ai Bot commented Jun 20, 2026

Copy link
Copy Markdown

Reviewer's Guide

LogAnalyzer.analyze is updated to clear internal error pattern state at the start of each call to avoid leaking error groups across batches, a regression test is added to cover repeated analyze calls on the same instance, and syslog parsing is slightly adjusted to detect log levels from the message portion instead of the entire line.

Sequence diagram for repeated LogAnalyzer.analyze calls with state reset

sequenceDiagram
    actor Client
    participant LogAnalyzer
    participant ErrorPatterns as _error_patterns
    participant AnalysisResult

    Client->>LogAnalyzer: analyze(entries_batch1)
    LogAnalyzer->>ErrorPatterns: [accumulate patterns for batch1]
    LogAnalyzer->>AnalysisResult: AnalysisResult()
    LogAnalyzer-->>Client: AnalysisResult for batch1

    Client->>LogAnalyzer: analyze(entries_batch2)
    LogAnalyzer->>ErrorPatterns: [reset to empty]
    LogAnalyzer->>ErrorPatterns: [accumulate patterns for batch2]
    LogAnalyzer->>AnalysisResult: AnalysisResult()
    LogAnalyzer-->>Client: AnalysisResult for batch2 (only batch2 groups)
Loading

File-Level Changes

Change Details Files
Ensure LogAnalyzer.analyze starts from a clean error pattern state on each call and add a regression test for repeated analysis on the same instance.
  • Reset internal error pattern storage at the top of the analyze method to avoid leaking state between calls.
  • Document behavior in-line so callers understand they should not rely on internal aggregation across batches.
  • Add a test that runs two different error batches through the same analyzer and asserts that error_groups for the second batch only reflect the second batch.
src/log_analyzer_cli/analyzer.py
tests/test_analyzer.py
Adjust syslog parser log level detection to operate on the parsed message instead of the full raw line.
  • Remove earlier detection of log level from the entire line.
  • Change log level detection to use the extracted message field from the regex groups, likely improving level classification accuracy for syslog entries.
src/log_analyzer_cli/parsers/syslog.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Two bug fixes: SyslogParser.parse now computes log level from the extracted message field only (not the full raw line), and LogAnalyzer.analyze() clears _error_patterns at the start of each call. A regression test verifies the analyzer produces isolated error groups across repeated invocations on the same instance.

Changes

Syslog Level Detection and Analyzer State Leak Fixes

Layer / File(s) Summary
Syslog log-level detection scoped to message field
src/log_analyzer_cli/parsers/syslog.py
Removes the early detect_log_level(line) call on the full raw line; after regex groups are extracted, level is assigned via detect_log_level(groups.get("message", "")) using only the parsed message text.
Analyzer _error_patterns reset and regression test
src/log_analyzer_cli/analyzer.py, tests/test_analyzer.py
LogAnalyzer.analyze() resets _error_patterns to {} at entry; the new test test_repeated_analyze_does_not_leak_error_groups calls analyze() twice with distinct error batches and asserts each result holds exactly one isolated error group.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

🐇 Hop hop, the rabbit tidies up his trail,
Old patterns cleared so stale groups never prevail,
The syslog reads the message, not the whole long line,
Each analyze() starts fresh — how terribly fine!
No leaky buckets, no leftover crumbs,
Just clean little fixes from clever little thumbs. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly describes the main change: clearing error patterns at the start of LogAnalyzer.analyze() calls to prevent state leakage across multiple invocations.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/analyzer-reset-error-patterns-per-call

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • Resetting self._error_patterns inside analyze() changes the class’s stateful semantics; if callers relied on cumulative behavior, consider instead keeping _error_patterns as a local per-call structure (or a parameter) and using reset() explicitly where needed to avoid surprising existing users.
  • The new test reaches into analyzer._error_patterns and depends on private state; consider asserting behavior only through the public analyze/result API to keep tests resilient to internal refactors.
  • The change in syslog.parse() from detect_log_level(line) to detect_log_level(groups.get("message", "")) may alter level detection for messages where the level appears outside the captured message; it might be worth capturing or documenting the intended difference in behavior to ensure it aligns with real syslog formats.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Resetting `self._error_patterns` inside `analyze()` changes the class’s stateful semantics; if callers relied on cumulative behavior, consider instead keeping `_error_patterns` as a local per-call structure (or a parameter) and using `reset()` explicitly where needed to avoid surprising existing users.
- The new test reaches into `analyzer._error_patterns` and depends on private state; consider asserting behavior only through the public `analyze`/result API to keep tests resilient to internal refactors.
- The change in `syslog.parse()` from `detect_log_level(line)` to `detect_log_level(groups.get("message", ""))` may alter level detection for messages where the level appears outside the captured `message`; it might be worth capturing or documenting the intended difference in behavior to ensure it aligns with real syslog formats.

## Individual Comments

### Comment 1
<location path="src/log_analyzer_cli/parsers/syslog.py" line_range="100" />
<code_context>
                 if groups.get("pid"):
                     metadata["pid"] = groups["pid"]

+                level = detect_log_level(groups.get("message", ""))
+                
                 return ParsedEntry(
</code_context>
<issue_to_address>
**issue (bug_risk):** Restricting log-level detection to only the message body may miss levels present in the prefix or structured fields.

Previously `detect_log_level` received the full raw line, so it could pick up level markers in prefixes (e.g., `ERROR`, `WARN`, numeric priorities) outside the `message` field. With the new call using only `groups["message"]`, those indicators will be ignored in formats where the level isn't inside `message`, which may regress detection for some syslog variants. Consider calling `detect_log_level` on `message` first and falling back to the full `line` when `message` is empty or the result is unknown, to retain robustness while favoring the new behavior.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

if groups.get("pid"):
metadata["pid"] = groups["pid"]

level = detect_log_level(groups.get("message", ""))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Restricting log-level detection to only the message body may miss levels present in the prefix or structured fields.

Previously detect_log_level received the full raw line, so it could pick up level markers in prefixes (e.g., ERROR, WARN, numeric priorities) outside the message field. With the new call using only groups["message"], those indicators will be ignored in formats where the level isn't inside message, which may regress detection for some syslog variants. Consider calling detect_log_level on message first and falling back to the full line when message is empty or the result is unknown, to retain robustness while favoring the new behavior.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/test_analyzer.py (1)

152-176: 💤 Low value

Regression test correctly validates state isolation.

The test effectively verifies that error groups don't leak across analyze() calls on the same instance. The assertions properly check pattern difference and count isolation.

Optional: Add symmetric assertion for completeness

For extra thoroughness, you could also assert the first result's count:

 first_result = analyzer.analyze(first_batch)
 second_result = analyzer.analyze(second_batch)
 
 assert len(first_result.error_groups) == 1
+assert first_result.error_groups[0].count == 1
 assert len(second_result.error_groups) == 1
 assert second_result.error_groups[0].pattern != first_result.error_groups[0].pattern
 assert second_result.error_groups[0].count == 1

This makes the test more symmetric and explicit about both results being isolated, though the existing assertions already validate the key invariant.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_analyzer.py` around lines 152 - 176, To make the
test_repeated_analyze_does_not_leak_error_groups test more symmetric and
explicit about state isolation, add an assertion to verify that the first_result
also has an error count of 1, similar to the existing assertion for
second_result. After the line asserting second_result.error_groups[0].count ==
1, add a new assertion that checks first_result.error_groups[0].count == 1. This
ensures both analyze() calls are validated for proper isolation and prevents any
regression where the first batch's error group count could be incorrectly
modified.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/log_analyzer_cli/parsers/syslog.py`:
- Line 100: The CLI filters log entries using detect_log_level on the raw line
before parsing, while the syslog parser detects level using detect_log_level
only on the message field, creating a mismatch. The CLI filtering logic (in
cli.py around line 209) should be refactored to filter based on the parsed
entry's level field instead of calling detect_log_level on the raw line. This
ensures consistency between what the CLI-level filter accepts and what level is
actually stored in the parsed entry by the syslog parser's level detection at
line 100.

---

Nitpick comments:
In `@tests/test_analyzer.py`:
- Around line 152-176: To make the
test_repeated_analyze_does_not_leak_error_groups test more symmetric and
explicit about state isolation, add an assertion to verify that the first_result
also has an error count of 1, similar to the existing assertion for
second_result. After the line asserting second_result.error_groups[0].count ==
1, add a new assertion that checks first_result.error_groups[0].count == 1. This
ensures both analyze() calls are validated for proper isolation and prevents any
regression where the first batch's error group count could be incorrectly
modified.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9c7ef08a-7aaa-42cc-b926-52880ff8d1ff

📥 Commits

Reviewing files that changed from the base of the PR and between e93757f and f817ad4.

📒 Files selected for processing (3)
  • src/log_analyzer_cli/analyzer.py
  • src/log_analyzer_cli/parsers/syslog.py
  • tests/test_analyzer.py

if groups.get("pid"):
metadata["pid"] = groups["pid"]

level = detect_log_level(groups.get("message", ""))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Verify CLI level filtering behavior with the parser change

# Check how _parse_file uses level filtering before parsing
ast-grep --pattern $'def _parse_file($$$):
  $$$
  if include_levels:
    $$$
  $$$
  parsed = parser.parse($$$)
  $$$'

# Check if there are any tests that validate level filtering consistency
rg -n -A5 -B5 'include_levels' --glob '*test*.py'

Repository: HrachShah/log-analyzer-cli

Length of output: 52


🏁 Script executed:

# Check CLI code structure around level filtering
fd 'cli.py' -x cat -n {} | head -250

# Also check syslog parser around line 100
fd 'syslog.py' -x cat -n {} | grep -A 10 -B 10 'level = detect_log_level'

Repository: HrachShah/log-analyzer-cli

Length of output: 10564


🏁 Script executed:

# Find the _parse_file function in cli.py
rg -n '_parse_file' --glob '*.py' -A 30

# Also check if detect_log_level is being used elsewhere in cli.py
rg -n 'detect_log_level' --glob '*.py'

Repository: HrachShah/log-analyzer-cli

Length of output: 4722


🏁 Script executed:

# Get the complete _parse_file function to understand the flow
sed -n '186,250p' src/log_analyzer_cli/cli.py

Repository: HrachShah/log-analyzer-cli

Length of output: 1341


🏁 Script executed:

# Check for tests that validate the level filtering behavior
rg -n 'error-db|hostname.*error|include_levels' tests/ --glob '*.py' -A 3 -B 3

# Also check what detect_log_level does
sed -n '121,150p' src/log_analyzer_cli/utils.py

Repository: HrachShah/log-analyzer-cli

Length of output: 863


🏁 Script executed:

# Check if there are any tests showing the hostname/error issue being fixed
rg -n 'error-db|hostname' tests/ --glob '*.py' -A 5 -B 5

# Check if UNKNOWN levels are converted to INFO or other defaults anywhere
rg -n 'UNKNOWN' src/ --glob '*.py' -B 2 -A 2

# List test files to understand test coverage
find tests/ -name '*.py' -type f | head -20

Repository: HrachShah/log-analyzer-cli

Length of output: 3642


🏁 Script executed:

# Check for tests that specifically validate the level filtering consistency
rg -n 'def test.*level' tests/ --glob '*.py' -A 10

# Check the syslog test to see if it validates the message-only level detection
sed -n '34,60p' tests/test_parsers.py

Repository: HrachShah/log-analyzer-cli

Length of output: 3822


The parser and CLI-level filtering have mismatched level detection logic that breaks filtering consistency.

The CLI filters entries using detect_log_level(line) on the full line before parsing (cli.py:209), but the syslog parser detects level using detect_log_level(groups.get("message", "")) on only the message field (syslog.py:100). This creates a critical mismatch:

A line like "error-db-01: connection established" will:

  • Pass the CLI's --level ERROR filter (because "error" in "error-db-01" matches \berror\b)
  • But be stored with level="UNKNOWN" after parsing (because the message contains no level keywords)

This contradicts the PR's goal to prevent false positives from hostnames/process names and breaks the semantic contract of the level filter—users expect filtered results to have the filtered level. The CLI should filter using the parsed entry's level, not the raw line.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/log_analyzer_cli/parsers/syslog.py` at line 100, The CLI filters log
entries using detect_log_level on the raw line before parsing, while the syslog
parser detects level using detect_log_level only on the message field, creating
a mismatch. The CLI filtering logic (in cli.py around line 209) should be
refactored to filter based on the parsed entry's level field instead of calling
detect_log_level on the raw line. This ensures consistency between what the
CLI-level filter accepts and what level is actually stored in the parsed entry
by the syslog parser's level detection at line 100.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant