filter log entries from parsed levels and timezone-normalise CLI filters#17
filter log entries from parsed levels and timezone-normalise CLI filters#17HrachShah wants to merge 1 commit into
Conversation
The CLI's `_parse_file` filtered by `detect_log_level(line)` (raw text scan) and `parse_timestamp(line)` (extracted from the raw line) before calling `parser.parse(line)`. Both disagree with the values the parser actually assigns, so the filters and the analyzer were working from different inputs. Level filter: an Apache 200 with "ERROR" in the User-Agent was INFO per the parser (200 status code) but ERROR per the raw-text scan, so `--levels INFO` wrongly dropped it; an Apache 500 with no level keyword in the line was ERROR per the parser but UNKNOWN per the raw-text scan, so `--levels ERROR` wrongly dropped it. Fix: parse the line first, then filter by `parsed.level`. Time filter: the JSON parser returns timezone-aware datetimes (the Z suffix is converted to +00:00 in `_parse_timestamp_string`); the CLI parsed `--start-time` / `--end-time` as naive datetimes and then compared them, raising "can't compare offset-naive and offset-aware datetimes" inside the outer try/except in `analyze()`. Fix: attach `tzinfo=timezone.utc` to the parsed filter values so a naive `YYYY-MM-DD HH:MM:SS` argument is treated as UTC and the comparison works against both naive and aware log timestamps. Includes two regression tests: - `test_analyze_time_filter_naive_start_with_aware_log` passes a naive `--start-time` against a JSON log with `Z` timestamps and asserts the late entry is kept and the early one is dropped - `test_analyze_level_filter_uses_parser_level_not_raw_text` writes an Apache log with a 200 + "ERROR" UA line and a 500 line, and checks that `--levels ERROR` keeps only the 500 and `--levels INFO` keeps only the 200 All 49 tests in the suite pass.
Reviewer's GuideAdjusts the CLI log analysis pipeline so level and time filters operate on parser-derived values instead of raw-line heuristics, and normalizes CLI time filters to UTC, with tests covering timezone-aware logs and parser-based level filtering. Flow diagram for updated CLI log filtering pipelineflowchart TD
A[analyze] --> B[parse start_time and end_time]
B --> C[replace tzinfo with timezone.utc]
C --> D[_parse_file]
subgraph _parse_file
D --> E[read_log_file]
E --> F[strip empty lines]
F --> G{search_pattern matches?}
G -- no --> E
G -- yes --> H[parser.parse]
H --> I{parsed is not None?}
I -- no --> E
I -- yes --> J{include_levels set?}
J -- yes --> K{parsed.level in include_levels?}
K -- no --> E
K -- yes --> L
J -- no --> L
L{within start_time/end_time?} -->|no| E
L -->|yes| M[append parsed to entries]
M --> E
end
D --> N[return entries to analyze]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
📝 WalkthroughWalkthrough
ChangesPost-parse filtering pipeline and UTC coercion
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- In
_parse_file, the imports ofdetect_log_levelandparse_timestampare now unused after switching to filtering onparsedentries; consider removing these imports to avoid confusion and keep the function focused. - Tagging
start_dtandend_dtwithtimezone.utcunconditionally fixes the naive-vs-aware comparison error but may be semantically wrong for non-UTC logs; consider deriving the timezone from the parsed log timestamps or making the CLI accept an explicit timezone for these filters.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `_parse_file`, the imports of `detect_log_level` and `parse_timestamp` are now unused after switching to filtering on `parsed` entries; consider removing these imports to avoid confusion and keep the function focused.
- Tagging `start_dt` and `end_dt` with `timezone.utc` unconditionally fixes the naive-vs-aware comparison error but may be semantically wrong for non-UTC logs; consider deriving the timezone from the parsed log timestamps or making the CLI accept an explicit timezone for these filters.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/log_analyzer_cli/cli.py`:
- Around line 232-235: The datetime comparison logic that checks
parsed.timestamp against start_time and end_time will fail with a TypeError when
the timestamp is naive (lacks timezone info), because Python cannot compare
aware and naive datetime objects. Before performing the comparison checks in the
filtering block (where start_time and end_time are compared), normalize any
naive timestamps by detecting if parsed.timestamp lacks tzinfo and converting it
to a UTC-aware datetime object. This ensures all datetime comparisons are
between objects in the same timezone state, preventing the TypeError from
occurring.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7bff2899-06a6-415b-9f23-40cdd622ef71
📒 Files selected for processing (2)
src/log_analyzer_cli/cli.pytests/test_cli.py
| if start_time and parsed.timestamp and parsed.timestamp < start_time: | ||
| continue | ||
| if end_time and parsed.timestamp and parsed.timestamp > end_time: | ||
| continue |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Compare points introduced in CLI =="
rg -n -C2 --type=py 'parsed\.timestamp < start_time|parsed\.timestamp > end_time' src/log_analyzer_cli/cli.py
echo
echo "== Parser timestamp construction (look for naive datetime creation) =="
rg -n -C3 --type=py 'def parse\(|ParsedEntry\(|timestamp\s*=|strptime\(|fromisoformat\(|tzinfo|timezone|astimezone' src/log_analyzer_cli/parsers
echo
echo "== Any timestamp parsing helpers outside parsers =="
rg -n -C3 --type=py 'parse_timestamp|_parse_timestamp|strptime\(|fromisoformat\(' src/log_analyzer_cliRepository: HrachShah/log-analyzer-cli
Length of output: 22486
🏁 Script executed:
# Check lines around 108 and 117 for UTC coercion
head -130 src/log_analyzer_cli/cli.py | tail -40Repository: HrachShah/log-analyzer-cli
Length of output: 1535
🏁 Script executed:
# Check lines 225-245 to see full filter context and variable names
sed -n '225,245p' src/log_analyzer_cli/cli.pyRepository: HrachShah/log-analyzer-cli
Length of output: 525
Datetime comparison will fail when parsers return naive timestamps.
Lines 232–234 compare parsed.timestamp against UTC-aware datetimes (start_time and end_time passed from lines 108 and 117). Python raises TypeError when comparing aware and naive datetime objects. Multiple parsers emit naive datetimes:
- syslog.py (lines 128–131): Returns naive datetime from
strptime() - generic.py (lines 87–92): Returns naive datetime from
strptime() - apache.py (line 119): Returns naive datetime when logs lack timezone info
- json_log.py (line 81): Returns naive datetime for certain timestamp values
When a log entry has a naive timestamp, the comparison at lines 232–234 will crash with TypeError: '<' not supported between instances of 'datetime.datetime' and 'datetime.datetime' instead of being caught by the outer exception handler (lines 96–99, which does not list TypeError).
Normalize naive timestamps before comparison, or convert all timestamps to a consistent timezone state.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/log_analyzer_cli/cli.py` around lines 232 - 235, The datetime comparison
logic that checks parsed.timestamp against start_time and end_time will fail
with a TypeError when the timestamp is naive (lacks timezone info), because
Python cannot compare aware and naive datetime objects. Before performing the
comparison checks in the filtering block (where start_time and end_time are
compared), normalize any naive timestamps by detecting if parsed.timestamp lacks
tzinfo and converting it to a UTC-aware datetime object. This ensures all
datetime comparisons are between objects in the same timezone state, preventing
the TypeError from occurring.
Summary
The CLI's
_parse_filefiltered bydetect_log_level(line)(raw text scan) andparse_timestamp(line)(extracted from the raw line) before callingparser.parse(line). Both disagree with the values the parser actually assigns, so the filters and the analyzer were working from different inputs.Level filter mismatch:
INFOper the parser (200 status code) butERRORper the raw-text scan, so--levels INFOwrongly dropped it.ERRORper the parser butUNKNOWNper the raw-text scan, so--levels ERRORwrongly dropped it.Time filter mismatch:
The CLI parsed
--start-time "2025-03-20 10:00:00"as a naive datetime while the JSON log carries timezone-aware timestamps (e.g.2025-03-20T10:15:32.123Z). The comparisontimestamp < start_timeraisedTypeError: can't compare offset-naive and offset-aware datetimesinside the outertry/except (OSError, ValueError, RuntimeError)inanalyze(), where it was caught and reported as a generic "Error: ..." exit code 1. The CLI's documented exit code 0 for a successful filtered run was unreachable for the common case of a JSON log with TZ-aware timestamps.Fix:
In
src/log_analyzer_cli/cli.py:_parse_filenow callsparser.parse(line)first and filters on the parsed entry'slevelandtimestamp. The pattern filter still runs against the raw line (it has no parsed equivalent and the user wrote a regex against the line text on purpose).analyzenow tags the parsed--start-time/--end-timewithtzinfo=timezone.utcso the comparison against a TZ-aware log timestamp works.Tests added (in
tests/test_cli.py):test_analyze_time_filter_naive_start_with_aware_log— writes a JSON log with2025-03-20T10:15:32.123Zand2025-03-20T10:17:00.000Zentries, runs the CLI with--start-time 2025-03-20 10:16:00, asserts exit code 0, asserts the 10:15 entry is excluded and the 10:17 entry is included.test_analyze_level_filter_uses_parser_level_not_raw_text— writes an Apache log with a 200 whose User-Agent contains "ERROR" and a 500 with no level keyword, asserts that--levels ERRORreturns exactly the 500 and--levels INFOreturns exactly the 200.Both tests fail on
mainand pass after the fix.pytest tests/is 49 passed, 0 failed (up from 47 passed, 2 failed on main). The remaining on-disk failure onmain(test_analyze_apache_file) is a pre-existing fixture issue — the test referencesexamples/apache-sample.logwhich is gitignored and not committed; the fix here is unrelated.Summary by Sourcery
Align CLI log filtering with parser output and normalise time filters to work with timezone-aware log timestamps.
Bug Fixes:
Tests:
Summary by CodeRabbit
Bug Fixes
Tests