Skip to content

report parse_errors and skipped_filtered lines in the analyze summary#7

Open
HrachShah wants to merge 11 commits into
mainfrom
patch-2
Open

report parse_errors and skipped_filtered lines in the analyze summary#7
HrachShah wants to merge 11 commits into
mainfrom
patch-2

Conversation

@HrachShah

@HrachShah HrachShah commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Reports parse_errors and skipped_filtered lines in the analyze command's output. _parse_file used to return only the entries list, so a 5-line JSON log with 3 valid lines and 2 garbage lines came back as 'Total Lines: 3, Parsed Entries: 3' — there was no way for a user to see that the parser was silently dropping 40% of their input, or that the level/pattern/time filters had quietly dropped lines before the parser ever saw them. Returns a (entries, stats) tuple. stats carries total_lines, parse_errors, and skipped_filtered. analyze() writes them into AnalysisResult so both the text and JSON formatters surface them, and adds a 'Skipped N line(s)' warning when filters drop input. 50/50 tests pass.

Summary by Sourcery

Track and surface parsing and filtering statistics in the analyze command while tightening log parsing and documentation.

New Features:

  • Expose total_lines, parse_errors, and skipped_filtered statistics from log parsing in the analyze command output, including a warning when filters skip lines.

Bug Fixes:

  • Correct Apache access log parsing by fixing the user field pattern and ensuring timestamps are parsed as UTC datetimes.
  • Ensure syslog and Apache timestamps are consistently timezone-aware to match CLI time filtering.
  • Improve error pattern normalization so file paths and embedded ports are handled without corrupting existing : markers.

Enhancements:

  • Refine the CLI analyze workflow to propagate parsing statistics into AnalysisResult for both text and JSON formatters.
  • Clarify and restructure the README with proper headings, usage examples, and project layout.
  • Add an Apache log example file to illustrate supported formats.

Tests:

  • Add CLI tests to validate total_lines, parsed_entries, parse_errors accounting, filtered-line warnings, and behavior on empty input files.

Zo Agent and others added 11 commits April 21, 2026 01:27
The _parse_timestamp method only applied the current-year correction
to the "%%b %%d %%H:%%M:%%S" (RFC 3164) format, but the same issue
affects "%%Y-%%m-%%d %%H:%%M:%%S" which also omits the year.
Now all formats that lack an explicit year default to the current year.
The COMBINED_PATTERN used (?P<user>\s+) which only matches whitespace,
failing to capture the actual user value like '-' or 'frank'.
Changed to (?P<user>\S+) to correctly capture the user identifier.

Also removed unnecessary .*$ at the end of the combined pattern.
_parse_file used to return only the entries list, so the analyze command
showed 'Total Lines: 3' for a 5-line log that mixed 3 valid JSON lines
with 2 garbage lines, and it offered no way to tell that the level or
pattern filter had silently dropped half the input.

Returns a (entries, stats) tuple from _parse_file. stats carries
total_lines (lines that survived the include-level/pattern/time filters),
parse_errors (lines the parser could not turn into a ParsedEntry), and
skipped_filtered (lines dropped by the filters). analyze() now writes
total_lines, parse_errors, and a 'Skipped N line(s)' warning into the
AnalysisResult so both the text and JSON formatters surface the data-
quality issue that used to be invisible. Three new tests cover a
mixed-quality file, a level-filtered file, and an empty file.

All 50 tests pass.
@sourcery-ai

sourcery-ai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Reviewer's Guide

Adds per-line accounting for total, parse-error, and filtered log lines to the analyze command, wires those stats through to the analysis result and JSON output, adjusts timestamp handling and parsing in CLI and parsers, refines error-pattern normalization to better treat paths and ports, and updates documentation and tests accordingly.

Sequence diagram for analyze command with parse stats propagation

sequenceDiagram
    actor User
    participant CLI as cli_analyze
    participant Parser as LogParser
    participant Analyzer as analyze_log_entries
    participant Formatter as format_output

    User->>CLI: analyze(file, levels, pattern, start_time, end_time)
    CLI->>CLI: _parse_file(parser, file, level_filter, pattern, start_dt, end_dt)
    CLI->>Parser: parse(line) [inside _parse_file loop]
    Parser-->>CLI: ParsedEntry or None
    CLI-->>CLI: return entries, stats
    CLI->>Analyzer: analyze_log_entries(entries, group_errors)
    Analyzer-->>CLI: AnalysisResult result
    CLI->>CLI: set result.total_lines = stats[total_lines]
    CLI->>CLI: set result.parse_errors = stats[parse_errors]
    CLI->>CLI: append warning to result.warnings when stats[skipped_filtered] > 0
    CLI->>Formatter: format_output(result, output, verbose)
    Formatter-->>CLI: output_str
    CLI-->>User: print output_str
Loading

File-Level Changes

Change Details Files
Track total_lines, parse_errors, and skipped_filtered in file parsing and surface them in analyze results and CLI output.
  • Change _parse_file to return a (entries, stats) tuple that counts non-empty lines after filters, parse failures, and lines dropped by filters.
  • Increment skipped_filtered for lines excluded by level, pattern, or time filters, and track parse_errors when parser.parse fails.
  • In analyze(), consume the new stats tuple, set total_lines and parse_errors on the AnalysisResult, and append a warning when any lines were skipped by filters.
src/log_analyzer_cli/cli.py
Make timestamp handling consistent and timezone-aware for CLI time filters and Apache timestamps, and slightly simplify syslog timestamp parsing.
  • In CLI, parse start-time and end-time as naive datetimes and then set tzinfo=UTC before passing to analysis.
  • In ApacheParser, fix the COMBINED log regex to capture the user field correctly and set parsed timestamps to UTC-aware datetimes.
  • In SyslogParser, refactor _parse_timestamp to avoid double-parsing and ensure year injection only for the month-day-time format.
src/log_analyzer_cli/cli.py
src/log_analyzer_cli/parsers/apache.py
src/log_analyzer_cli/parsers/syslog.py
Refine normalize_error_pattern so that file paths are normalized before port replacement and ports embedded in paths are preserved correctly.
  • Change path normalization to match only strings starting with '/' using a narrower regex.
  • Introduce a temporary '<PROTECTED_PORT>' marker to avoid double-processing ports that were already normalized as part of a path, then restore them after generic port replacement.
  • Remove the previous broad '/[^\s]+' path replacement in favor of the new ordering and matching.
src/log_analyzer_cli/utils.py
Add regression tests around parse error accounting, filtered line warnings, and empty-file behavior in the CLI JSON output.
  • Introduce TestParseErrors in test_cli.py with helper to create temporary log files.
  • Add test to assert total_lines, parsed_entries, and parse_errors for mixed-valid/invalid JSON logs in JSON output.
  • Add test to verify that filtered lines trigger a "Skipped N line(s)" warning and that empty files produce the "No log entries" message without crashing.
tests/test_cli.py
Clean up and expand README to present proper project documentation structure without placeholder markers.
  • Replace placeholder 'README: updated' markers with proper Markdown headings (e.g., '# Log Analyzer CLI', '## Features', '### Commands').
  • Fix example command descriptions and comments (e.g., label JSON output, warnings-only, pattern matching, time-based filtering).
  • Tidy the project structure tree to remove inline comments and show accurate file layout.
README.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@HrachShah, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 56 minutes and 51 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 30b01124-ff7c-4945-a0f8-c78954ae37b1

📥 Commits

Reviewing files that changed from the base of the PR and between e93757f and 2c5a9ed.

⛔ Files ignored due to path filters (1)
  • examples/apache-sample.log is excluded by !**/*.log
📒 Files selected for processing (7)
  • .gitignore
  • README.md
  • src/log_analyzer_cli/cli.py
  • src/log_analyzer_cli/parsers/apache.py
  • src/log_analyzer_cli/parsers/syslog.py
  • src/log_analyzer_cli/utils.py
  • tests/test_cli.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch patch-2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The change to make start_dt/end_dt timezone-aware (timezone.utc) may conflict with whatever parse_timestamp returns (often naive datetimes), which can lead to TypeError: can't compare offset-naive and offset-aware datetimes; consider normalizing all parsed timestamps to a consistent timezone or keeping the CLI filters naive for now.
  • The new stats dictionary returned from _parse_file (total_lines, parse_errors, skipped_filtered) is accessed via string keys throughout; introducing a small dataclass or TypedDict for these stats would make the interface clearer and reduce the chance of typos or key mismatches.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The change to make `start_dt`/`end_dt` timezone-aware (`timezone.utc`) may conflict with whatever `parse_timestamp` returns (often naive datetimes), which can lead to `TypeError: can't compare offset-naive and offset-aware datetimes`; consider normalizing all parsed timestamps to a consistent timezone or keeping the CLI filters naive for now.
- The new stats dictionary returned from `_parse_file` (`total_lines`, `parse_errors`, `skipped_filtered`) is accessed via string keys throughout; introducing a small dataclass or TypedDict for these stats would make the interface clearer and reduce the chance of typos or key mismatches.

## Individual Comments

### Comment 1
<location path="src/log_analyzer_cli/parsers/apache.py" line_range="118-119" />
<code_context>
         try:
             ts_str_naive = ts_str.split()[0]
-            return datetime.strptime(ts_str_naive, "%d/%b/%Y:%H:%M:%S")
+            dt = datetime.strptime(ts_str_naive, "%d/%b/%Y:%H:%M:%S")
+            return dt.replace(tzinfo=timezone.utc)
         except ValueError:
             pass
</code_context>
<issue_to_address>
**issue (bug_risk):** Apache timestamps are now labeled as UTC but the original offset component is still discarded, which may misalign time-based filtering.

The code now parses only the `10/Oct/2000:13:55:36` part and then sets `tzinfo=timezone.utc`, effectively reinterpreting a local time as UTC and shifting it by the original offset. That can break UTC-based time-range filtering. Either parse the full `%d/%b/%Y:%H:%M:%S %z` format and preserve the real offset, or keep these as naive local timestamps and ensure any filters remain naive for this parser.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +118 to +119
dt = datetime.strptime(ts_str_naive, "%d/%b/%Y:%H:%M:%S")
return dt.replace(tzinfo=timezone.utc)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Apache timestamps are now labeled as UTC but the original offset component is still discarded, which may misalign time-based filtering.

The code now parses only the 10/Oct/2000:13:55:36 part and then sets tzinfo=timezone.utc, effectively reinterpreting a local time as UTC and shifting it by the original offset. That can break UTC-based time-range filtering. Either parse the full %d/%b/%Y:%H:%M:%S %z format and preserve the real offset, or keep these as naive local timestamps and ensure any filters remain naive for this parser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant