report parse_errors and skipped_filtered lines in the analyze summary by HrachShah · Pull Request #7 · HrachShah/log-analyzer-cli

HrachShah · 2026-06-10T13:32:31Z

Reports parse_errors and skipped_filtered lines in the analyze command's output. _parse_file used to return only the entries list, so a 5-line JSON log with 3 valid lines and 2 garbage lines came back as 'Total Lines: 3, Parsed Entries: 3' — there was no way for a user to see that the parser was silently dropping 40% of their input, or that the level/pattern/time filters had quietly dropped lines before the parser ever saw them. Returns a (entries, stats) tuple. stats carries total_lines, parse_errors, and skipped_filtered. analyze() writes them into AnalysisResult so both the text and JSON formatters surface them, and adds a 'Skipped N line(s)' warning when filters drop input. 50/50 tests pass.

Summary by Sourcery

Track and surface parsing and filtering statistics in the analyze command while tightening log parsing and documentation.

New Features:

Expose total_lines, parse_errors, and skipped_filtered statistics from log parsing in the analyze command output, including a warning when filters skip lines.

Bug Fixes:

Correct Apache access log parsing by fixing the user field pattern and ensuring timestamps are parsed as UTC datetimes.
Ensure syslog and Apache timestamps are consistently timezone-aware to match CLI time filtering.
Improve error pattern normalization so file paths and embedded ports are handled without corrupting existing : markers.

Enhancements:

Refine the CLI analyze workflow to propagate parsing statistics into AnalysisResult for both text and JSON formatters.
Clarify and restructure the README with proper headings, usage examples, and project layout.
Add an Apache log example file to illustrate supported formats.

Tests:

Add CLI tests to validate total_lines, parsed_entries, parse_errors accounting, filtered-line warnings, and behavior on empty input files.

…les rule

The _parse_timestamp method only applied the current-year correction to the "%%b %%d %%H:%%M:%%S" (RFC 3164) format, but the same issue affects "%%Y-%%m-%%d %%H:%%M:%%S" which also omits the year. Now all formats that lack an explicit year default to the current year.

The COMBINED_PATTERN used (?P<user>\s+) which only matches whitespace, failing to capture the actual user value like '-' or 'frank'. Changed to (?P<user>\S+) to correctly capture the user identifier. Also removed unnecessary .*$ at the end of the combined pattern.

_parse_file used to return only the entries list, so the analyze command showed 'Total Lines: 3' for a 5-line log that mixed 3 valid JSON lines with 2 garbage lines, and it offered no way to tell that the level or pattern filter had silently dropped half the input. Returns a (entries, stats) tuple from _parse_file. stats carries total_lines (lines that survived the include-level/pattern/time filters), parse_errors (lines the parser could not turn into a ParsedEntry), and skipped_filtered (lines dropped by the filters). analyze() now writes total_lines, parse_errors, and a 'Skipped N line(s)' warning into the AnalysisResult so both the text and JSON formatters surface the data- quality issue that used to be invisible. Three new tests cover a mixed-quality file, a level-filtered file, and an empty file. All 50 tests pass.

sourcery-ai · 2026-06-10T13:32:39Z

Reviewer's Guide

Adds per-line accounting for total, parse-error, and filtered log lines to the analyze command, wires those stats through to the analysis result and JSON output, adjusts timestamp handling and parsing in CLI and parsers, refines error-pattern normalization to better treat paths and ports, and updates documentation and tests accordingly.

Sequence diagram for analyze command with parse stats propagation

sequenceDiagram
    actor User
    participant CLI as cli_analyze
    participant Parser as LogParser
    participant Analyzer as analyze_log_entries
    participant Formatter as format_output

    User->>CLI: analyze(file, levels, pattern, start_time, end_time)
    CLI->>CLI: _parse_file(parser, file, level_filter, pattern, start_dt, end_dt)
    CLI->>Parser: parse(line) [inside _parse_file loop]
    Parser-->>CLI: ParsedEntry or None
    CLI-->>CLI: return entries, stats
    CLI->>Analyzer: analyze_log_entries(entries, group_errors)
    Analyzer-->>CLI: AnalysisResult result
    CLI->>CLI: set result.total_lines = stats[total_lines]
    CLI->>CLI: set result.parse_errors = stats[parse_errors]
    CLI->>CLI: append warning to result.warnings when stats[skipped_filtered] > 0
    CLI->>Formatter: format_output(result, output, verbose)
    Formatter-->>CLI: output_str
    CLI-->>User: print output_str

File-Level Changes

Change	Details	Files
Track total_lines, parse_errors, and skipped_filtered in file parsing and surface them in analyze results and CLI output.	Change _parse_file to return a (entries, stats) tuple that counts non-empty lines after filters, parse failures, and lines dropped by filters. Increment skipped_filtered for lines excluded by level, pattern, or time filters, and track parse_errors when parser.parse fails. In analyze(), consume the new stats tuple, set total_lines and parse_errors on the AnalysisResult, and append a warning when any lines were skipped by filters.	`src/log_analyzer_cli/cli.py`
Make timestamp handling consistent and timezone-aware for CLI time filters and Apache timestamps, and slightly simplify syslog timestamp parsing.	In CLI, parse start-time and end-time as naive datetimes and then set tzinfo=UTC before passing to analysis. In ApacheParser, fix the COMBINED log regex to capture the user field correctly and set parsed timestamps to UTC-aware datetimes. In SyslogParser, refactor _parse_timestamp to avoid double-parsing and ensure year injection only for the month-day-time format.	`src/log_analyzer_cli/cli.py` `src/log_analyzer_cli/parsers/apache.py` `src/log_analyzer_cli/parsers/syslog.py`
Refine normalize_error_pattern so that file paths are normalized before port replacement and ports embedded in paths are preserved correctly.	Change path normalization to match only strings starting with '/' using a narrower regex. Introduce a temporary '<PROTECTED_PORT>' marker to avoid double-processing ports that were already normalized as part of a path, then restore them after generic port replacement. Remove the previous broad '/[^\s]+' path replacement in favor of the new ordering and matching.	`src/log_analyzer_cli/utils.py`
Add regression tests around parse error accounting, filtered line warnings, and empty-file behavior in the CLI JSON output.	Introduce TestParseErrors in test_cli.py with helper to create temporary log files. Add test to assert total_lines, parsed_entries, and parse_errors for mixed-valid/invalid JSON logs in JSON output. Add test to verify that filtered lines trigger a "Skipped N line(s)" warning and that empty files produce the "No log entries" message without crashing.	`tests/test_cli.py`
Clean up and expand README to present proper project documentation structure without placeholder markers.	Replace placeholder 'README: updated' markers with proper Markdown headings (e.g., '# Log Analyzer CLI', '## Features', '### Commands'). Fix example command descriptions and comments (e.g., label JSON output, warnings-only, pattern matching, time-based filtering). Tidy the project structure tree to remove inline comments and show accurate file layout.	`README.md`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

coderabbitai · 2026-06-10T13:32:41Z

Warning

Review limit reached

@HrachShah, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 56 minutes and 51 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 30b01124-ff7c-4945-a0f8-c78954ae37b1

📥 Commits

Reviewing files that changed from the base of the PR and between e93757f and 2c5a9ed.

⛔ Files ignored due to path filters (1)

examples/apache-sample.log is excluded by !**/*.log

📒 Files selected for processing (7)

.gitignore
README.md
src/log_analyzer_cli/cli.py
src/log_analyzer_cli/parsers/apache.py
src/log_analyzer_cli/parsers/syslog.py
src/log_analyzer_cli/utils.py
tests/test_cli.py

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch patch-2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

The change to make start_dt/end_dt timezone-aware (timezone.utc) may conflict with whatever parse_timestamp returns (often naive datetimes), which can lead to TypeError: can't compare offset-naive and offset-aware datetimes; consider normalizing all parsed timestamps to a consistent timezone or keeping the CLI filters naive for now.
The new stats dictionary returned from _parse_file (total_lines, parse_errors, skipped_filtered) is accessed via string keys throughout; introducing a small dataclass or TypedDict for these stats would make the interface clearer and reduce the chance of typos or key mismatches.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The change to make `start_dt`/`end_dt` timezone-aware (`timezone.utc`) may conflict with whatever `parse_timestamp` returns (often naive datetimes), which can lead to `TypeError: can't compare offset-naive and offset-aware datetimes`; consider normalizing all parsed timestamps to a consistent timezone or keeping the CLI filters naive for now.
- The new stats dictionary returned from `_parse_file` (`total_lines`, `parse_errors`, `skipped_filtered`) is accessed via string keys throughout; introducing a small dataclass or TypedDict for these stats would make the interface clearer and reduce the chance of typos or key mismatches.

## Individual Comments

### Comment 1
<location path="src/log_analyzer_cli/parsers/apache.py" line_range="118-119" />
<code_context>
         try:
             ts_str_naive = ts_str.split()[0]
-            return datetime.strptime(ts_str_naive, "%d/%b/%Y:%H:%M:%S")
+            dt = datetime.strptime(ts_str_naive, "%d/%b/%Y:%H:%M:%S")
+            return dt.replace(tzinfo=timezone.utc)
         except ValueError:
             pass
</code_context>
<issue_to_address>
**issue (bug_risk):** Apache timestamps are now labeled as UTC but the original offset component is still discarded, which may misalign time-based filtering.

The code now parses only the `10/Oct/2000:13:55:36` part and then sets `tzinfo=timezone.utc`, effectively reinterpreting a local time as UTC and shifting it by the original offset. That can break UTC-based time-range filtering. Either parse the full `%d/%b/%Y:%H:%M:%S %z` format and preserve the real offset, or keep these as naive local timestamps and ensure any filters remain naive for this parser.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-06-10T13:35:03Z

+            dt = datetime.strptime(ts_str_naive, "%d/%b/%Y:%H:%M:%S")
+            return dt.replace(tzinfo=timezone.utc)


issue (bug_risk): Apache timestamps are now labeled as UTC but the original offset component is still discarded, which may misalign time-based filtering.

The code now parses only the 10/Oct/2000:13:55:36 part and then sets tzinfo=timezone.utc, effectively reinterpreting a local time as UTC and shifting it by the original offset. That can break UTC-based time-range filtering. Either parse the full %d/%b/%Y:%H:%M:%S %z format and preserve the real offset, or keep these as naive local timestamps and ensure any filters remain naive for this parser.

Zo Agent and others added 11 commits April 21, 2026 01:27

Fix .gitignore: remove local_settings.py/db.sqlite3 lines, keep examp…

8e015b3

…les rule

docs: clean up README formatting artifacts

606a09d

normalize: capture path-embedded ports before generic port replacement

d96077f

add timezone info to time filter CLI args

6ebdaba

attach UTC timezone to parsed timestamps for consistency

5809022

tighten path placeholder regex to avoid consuming port numbers

cabbbad

ensure naive apache timestamps are tz-aware for accurate comparisons

ec1bf32

normalize parameter naming in _parse_file and add missing re import

05c73e0

sourcery-ai Bot reviewed Jun 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

report parse_errors and skipped_filtered lines in the analyze summary#7

report parse_errors and skipped_filtered lines in the analyze summary#7
HrachShah wants to merge 11 commits into
mainfrom
patch-2

HrachShah commented Jun 10, 2026 •

edited by sourcery-ai Bot

Loading

Uh oh!

sourcery-ai Bot commented Jun 10, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

coderabbitai Bot commented Jun 10, 2026

Review limit reached

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

sourcery-ai Bot Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		dt = datetime.strptime(ts_str_naive, "%d/%b/%Y:%H:%M:%S")
		return dt.replace(tzinfo=timezone.utc)

Conversation

HrachShah commented Jun 10, 2026 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for analyze command with parse stats propagation

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

coderabbitai Bot commented Jun 10, 2026

Review limit reached

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HrachShah commented Jun 10, 2026 •

edited by sourcery-ai Bot

Loading

sourcery-ai Bot commented Jun 10, 2026 •

edited

Loading