Skip to content

filter --levels on parsed entry level, not raw line#13

Open
HrachShah wants to merge 1 commit into
mainfrom
fix/cli-filter-on-parsed-level
Open

filter --levels on parsed entry level, not raw line#13
HrachShah wants to merge 1 commit into
mainfrom
fix/cli-filter-on-parsed-level

Conversation

@HrachShah

@HrachShah HrachShah commented Jun 20, 2026

Copy link
Copy Markdown
Owner

Two changes that fix a class of false-positive level classification when a log line's hostname, process name, or message body contains words like 'error' or 'warn' that the heuristic matches with \b boundaries:

  1. parsers/generic.py — detect_log_level now runs on the extracted message (text after the timestamp) instead of the full line. Mirrors the change that was already made to parsers/syslog.py.

  2. cli.py — _parse_file no longer pre-scans the raw line with detect_log_level for the --levels filter; it parses first and filters on parsed.level. The parser and the filter now agree on every line, and we skip a redundant per-line scan that the parser was about to redo anyway.

Repro: a JSON line like {"level": "WARNING", "message": "error in the system"} with --levels=WARNING used to print 'No log entries found matching criteria'. It now keeps the line.

Test added: test_analyze_level_filter_uses_parsed_level_not_raw_line. It fails on the old code and passes with this change.

Summary by Sourcery

Align log level detection and filtering with the parsed message content to avoid misclassification based on raw line text.

Bug Fixes:

  • Fix log level misclassification when words like 'error' appear in host/process/message text by detecting levels from the parsed message instead of the full line.
  • Ensure --levels filtering uses the parser’s computed level so entries with matching structured levels are not incorrectly filtered out.

Tests:

  • Add a CLI test verifying that --levels filtering respects the parsed level and does not treat message text mentions of other levels as actual log levels.

Summary by CodeRabbit

Bug Fixes

  • Fixed log-level filtering to correctly analyze structured log level information instead of scanning raw content, preventing false matches when level keywords appear in timestamps, hostnames, or other metadata.

Tests

  • Added test coverage for log-level filtering behavior with structured fields.

The CLI's _parse_file scanned the raw line with detect_log_level to
decide whether to keep a row before calling parser.parse(). For
generic-format logs that have no structured level field, the generic
parser also calls detect_log_level on the full line, so the same
false positives affected both the filter check and the output:
the parser misclassified WARNING lines that mention 'error' in the
hostname or message as ERROR. Worse, the cli filter ran first
and rejected lines that the parser would later have accepted —
the user got 'No log entries found matching criteria' even when the
file contained real WARNING-level rows.

Two changes:

1. generic.py: run detect_log_level on the message portion (the text
   after the timestamp) rather than the whole line. The previous
   syslog-parser commit fixed the same issue there.

2. cli.py: filter on parsed.level after parsing instead of running
   detect_log_level separately on the raw line. Now the parser and
   the filter agree on every line — the filter only sees levels the
   report would have printed. The pre-parse scan also did extra
   string work per line that the parser was about to redo.

Add a regression test using a JSON line whose message contains
'error' but whose level field is WARNING; the old code rejected it
under --levels=WARNING, the new code keeps it.
@sourcery-ai

sourcery-ai Bot commented Jun 20, 2026

Copy link
Copy Markdown

Reviewer's Guide

Adjusts log level detection and CLI filtering to operate on the parsed message/entry level instead of the raw line, fixing false-positive level matches and adding a regression test for the new behavior.

Sequence diagram for CLI level filtering using parsed entry level

sequenceDiagram
    participant CLI as _parse_file
    participant LogFile as read_log_file
    participant Parser as parser.parse
    participant UtilsTS as parse_timestamp
    participant UtilsLevel as detect_log_level

    loop for each line
        CLI->>LogFile: read_log_file(file_path)
        LogFile-->>CLI: line
        CLI->>UtilsTS: parse_timestamp(line)
        UtilsTS-->>CLI: timestamp
        CLI->>Parser: parse(line)
        Parser->>Parser: [extract timestamp match]
        Parser->>Parser: message = line[message_start:]
        Parser->>UtilsLevel: detect_log_level(message)
        UtilsLevel-->>Parser: level
        Parser-->>CLI: ParsedEntry(level, message)
        CLI->>CLI: [parsed exists]
        CLI->>CLI: [parsed.level in include_levels]
        CLI->>CLI: entries.append(parsed)
    end
Loading

File-Level Changes

Change Details Files
Filter log entries by the parser-resolved level field instead of rescanning the raw input line.
  • Removed pre-parse detect_log_level call in the CLI file parsing loop and associated import.
  • Ensured each line is parsed first and skipped if parsing fails before any level-based filtering is applied.
  • Applied level-filtering using parsed.level against include_levels, keeping only entries whose parsed level matches the requested levels.
src/log_analyzer_cli/cli.py
Detect log level from the extracted message portion in the generic parser to avoid misclassification from timestamp or metadata text.
  • Changed generic parser to compute message substring immediately after the timestamp match and strip it.
  • Replaced level detection on the full raw line with detect_log_level(message) so only the message body influences level classification.
  • Preserved the existing ParsedEntry structure while sourcing level from the message-only detection.
src/log_analyzer_cli/parsers/generic.py
Add a regression test ensuring --levels filtering uses the parsed level rather than raw-line heuristics.
  • Introduced a test that writes a JSON log line with level WARNING but message text containing the word 'error'.
  • Invoked the CLI with --levels=WARNING -v and asserted that the WARNING entry is kept and no ERROR level row is reported in the output.
  • Documented the scenario in comments to explain the previously incorrect behavior and the expected corrected behavior.
tests/test_cli.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

GenericParser.parse() now calls detect_log_level on the extracted message substring instead of the full raw line. _parse_file in the CLI is reordered to parse each line first, skip unparsable lines, and then filter by parsed.level against include_levels. A regression test covering the fix is added.

Changes

Level Filtering Fix

Layer / File(s) Summary
Parser and CLI level-filtering logic
src/log_analyzer_cli/parsers/generic.py, src/log_analyzer_cli/cli.py
GenericParser.parse() passes the extracted message substring (not the full raw line) to detect_log_level, preventing misclassification from timestamp/hostname tokens. _parse_file is reworked to call parser.parse first, skip failures, then compare parsed.level against include_levels; imports updated accordingly.
Regression test
tests/test_cli.py
Adds test_analyze_level_filter_uses_parsed_level_not_raw_line: a temp JSON log line has "level": "WARNING" but the message text contains "error"; the test runs main analyze -l WARNING -v and asserts no ERROR rows appear in the output.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 A bunny once parsed a suspicious line,
Where "error" hid in a message's shine.
But the level field said "WARNING" — quite clear!
So we read the real level and kept the right tier.
No false alarms now, just parsed truth so fine! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: shifting log level filtering from raw line inspection to parsed entry level, which directly addresses the core issue described in the PR objectives.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/cli-filter-on-parsed-level

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • In test_analyze_level_filter_uses_parsed_level_not_raw_line the inline comment starting with # The pre- is truncated; either complete or remove it to avoid confusing future readers.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `test_analyze_level_filter_uses_parsed_level_not_raw_line` the inline comment starting with `# The pre-` is truncated; either complete or remove it to avoid confusing future readers.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/log_analyzer_cli/cli.py`:
- Around line 211-215: The timestamp filtering at lines 211-215 uses
parse_timestamp(line) directly for the start_time and end_time checks, which can
disagree with the parser-specific timestamp extraction from parser.parse(line),
allowing entries with supported timestamp formats to bypass the time-window
filters. After calling parser.parse(line) to get the parsed object, use
parsed.timestamp instead of parse_timestamp(line) for the timestamp comparisons
against start_time and end_time throughout the filtering block (lines 211-215
and also lines 217-231). This ensures both the filtering and subsequent
reporting use the same timestamp extraction logic from the parser.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 36ac41f6-19be-45cd-8a58-9c9aa690389c

📥 Commits

Reviewing files that changed from the base of the PR and between e93757f and c79b31b.

📒 Files selected for processing (3)
  • src/log_analyzer_cli/cli.py
  • src/log_analyzer_cli/parsers/generic.py
  • tests/test_cli.py

Comment on lines 211 to 215
timestamp = parse_timestamp(line)
if start_time and timestamp and timestamp < start_time:
continue
if end_time and timestamp and timestamp > end_time:
continue

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use parsed.timestamp for time-window filtering to avoid parser/filter drift.

At Line 211-Line 215, timestamp gating uses parse_timestamp(line), which can disagree with parser-specific timestamp extraction (e.g., generic parser supports additional formats). That lets entries bypass --start-time/--end-time even though they parse successfully. Filter on parsed.timestamp after parser.parse(line) so filtering and reporting use the same parsed contract.

Proposed fix
-    from log_analyzer_cli.utils import parse_timestamp
+    from log_analyzer_cli.utils import parse_timestamp
     import re

     compiled_pattern = re.compile(search_pattern) if search_pattern else None

     for line in read_log_file(file_path):
@@
-        timestamp = parse_timestamp(line)
-        if start_time and timestamp and timestamp < start_time:
-            continue
-        if end_time and timestamp and timestamp > end_time:
-            continue
-
         parsed = parser.parse(line)
         if not parsed:
             continue
+
+        timestamp = parsed.timestamp
+        if start_time and timestamp and timestamp < start_time:
+            continue
+        if end_time and timestamp and timestamp > end_time:
+            continue

Also applies to: 217-231

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/log_analyzer_cli/cli.py` around lines 211 - 215, The timestamp filtering
at lines 211-215 uses parse_timestamp(line) directly for the start_time and
end_time checks, which can disagree with the parser-specific timestamp
extraction from parser.parse(line), allowing entries with supported timestamp
formats to bypass the time-window filters. After calling parser.parse(line) to
get the parsed object, use parsed.timestamp instead of parse_timestamp(line) for
the timestamp comparisons against start_time and end_time throughout the
filtering block (lines 211-215 and also lines 217-231). This ensures both the
filtering and subsequent reporting use the same timestamp extraction logic from
the parser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant