anchor yearless syslog timestamps to the current year#11
Conversation
The JSON log parser used a single > 1e12 branch to decide whether a numeric timestamp was in se
The "%b %d %H:%M:%S" format used by RFC 3164 syslog (e.g. "Mar 20
10:15:32") has no year component. Previously _try_parse_datetime handed
the string straight to datetime.strptime, which fills 1900 as the
default year. That date then propagated through analyze() into
first_seen / last_seen / TimeDistribution, which meant every syslog
entry landed in a single January-1900 bucket in the time-series report
and sorted before every other entry in the file.
Re-parse the same string with the current year prepended ("%Y %b %d
%H:%M:%S") when the result lands on year 1900. This matches the
convention used by Python's logging module and most production syslog
parsers (rsyslog, syslog-ng). Timestamps that already carry a year are
unaffected — the rewrite only runs on the yearless format.
Also added tests/test_utils.py covering: yearless syslog lines anchored
to the current year, single-digit day ("Mar 5"), the no-1900
invariant, that timestamps with a year are unchanged, and that lines
without any timestamp return None.
Reviewer's GuideAnchors yearless syslog-style timestamps to the current year in the shared timestamp parser and refines JSON numeric timestamp handling to distinguish seconds, ms, µs, and ns by width, with new regression tests for both behaviors. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- The special-casing of the syslog format in
_try_parse_datetimeviaif i == 6is brittle and depends on the ordering offormats; it would be more robust to check the format string explicitly (e.g.if fmt == "%b %d %H:%M:%S"). - In
JSONLogParser._extract_timestamp, the newisinstance(value, int)guard will now drop valid numeric timestamps provided as floats (e.g.1700000000.0), which may be a regression; consider accepting floats and normalizing (e.g. viaint(value)orDecimal) rather than returningNone. - The updated
_extract_timestampdocstring is truncated ("and most high"), so please complete or tighten that sentence to clearly describe the unit-width behavior.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The special-casing of the syslog format in `_try_parse_datetime` via `if i == 6` is brittle and depends on the ordering of `formats`; it would be more robust to check the format string explicitly (e.g. `if fmt == "%b %d %H:%M:%S"`).
- In `JSONLogParser._extract_timestamp`, the new `isinstance(value, int)` guard will now drop valid numeric timestamps provided as floats (e.g. `1700000000.0`), which may be a regression; consider accepting floats and normalizing (e.g. via `int(value)` or `Decimal`) rather than returning `None`.
- The updated `_extract_timestamp` docstring is truncated ("and most high"), so please complete or tighten that sentence to clearly describe the unit-width behavior.
## Individual Comments
### Comment 1
<location path="src/log_analyzer_cli/utils.py" line_range="53-62" />
<code_context>
+ for i, fmt in enumerate(formats):
</code_context>
<issue_to_address>
**suggestion:** Avoid relying on the index of the format for the syslog-without-year special case.
This relies on the syslog-style format always being at index 6, which is fragile if `formats` is reordered or new entries are inserted. Instead, match on the specific format string (e.g., by comparing `fmt` to the syslog pattern) rather than `i == 6` so the behavior remains stable as the list evolves.
Suggested implementation:
```python
if ts_str.endswith("Z"):
ts_str = ts_str[:-1] + "+00:00"
for fmt in formats:
try:
parsed = datetime.strptime(ts_str, fmt)
except ValueError:
continue
# The "%b %d %H:%M:%S" format (syslog-style "Jan 15 10:30:45") has no
# year, so datetime.strptime fills in 1900 as the default. Re-parse
# the same string with the current year prepended so the resulting
# timestamp is anchored to when the log was actually generated. This
# matches the convention used by Python's logging module and most
```
To fully implement the suggestion and avoid relying on the index:
1. Replace any condition like `if i == 6:` (or other uses of `i` that refer to the syslog-without-year format) with a direct comparison to the format string, for example:
```python
if fmt == "%b %d %H:%M:%S":
# syslog-without-year special handling
```
2. Remove any remaining uses of the loop index `i` inside this loop, or reintroduce an index only where it is semantically required and not tied to the position of a specific format.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| for i, fmt in enumerate(formats): | ||
| try: | ||
| return datetime.strptime(ts_str, fmt) | ||
| parsed = datetime.strptime(ts_str, fmt) | ||
| except ValueError: | ||
| continue | ||
| # The "%b %d %H:%M:%S" format (syslog-style "Jan 15 10:30:45") has no | ||
| # year, so datetime.strptime fills in 1900 as the default. Re-parse | ||
| # the same string with the current year prepended so the resulting | ||
| # timestamp is anchored to when the log was actually generated. This | ||
| # matches the convention used by Python's logging module and most |
There was a problem hiding this comment.
suggestion: Avoid relying on the index of the format for the syslog-without-year special case.
This relies on the syslog-style format always being at index 6, which is fragile if formats is reordered or new entries are inserted. Instead, match on the specific format string (e.g., by comparing fmt to the syslog pattern) rather than i == 6 so the behavior remains stable as the list evolves.
Suggested implementation:
if ts_str.endswith("Z"):
ts_str = ts_str[:-1] + "+00:00"
for fmt in formats:
try:
parsed = datetime.strptime(ts_str, fmt)
except ValueError:
continue
# The "%b %d %H:%M:%S" format (syslog-style "Jan 15 10:30:45") has no
# year, so datetime.strptime fills in 1900 as the default. Re-parse
# the same string with the current year prepended so the resulting
# timestamp is anchored to when the log was actually generated. This
# matches the convention used by Python's logging module and mostTo fully implement the suggestion and avoid relying on the index:
- Replace any condition like
if i == 6:(or other uses ofithat refer to the syslog-without-year format) with a direct comparison to the format string, for example:if fmt == "%b %d %H:%M:%S": # syslog-without-year special handling
- Remove any remaining uses of the loop index
iinside this loop, or reintroduce an index only where it is semantically required and not tied to the position of a specific format.
📝 WalkthroughWalkthroughTwo timestamp parsing fixes are applied: ChangesTimestamp Parsing Precision Fixes
🎯 3 (Moderate) | ⏱️ ~20 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
src/log_analyzer_cli/parsers/json_log.py (1)
11-11:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFix CI-blocking flake8 errors in this file before merge.
The current branch still has a failing unused import and visual-indent violation in the metadata comprehension.
Proposed fix
-from log_analyzer_cli.utils import detect_log_level @@ - metadata = {k: v for k, v in data.items() - if k not in self.TIMESTAMP_FIELDS + self.LEVEL_FIELDS + self.MESSAGE_FIELDS} + metadata = { + k: v + for k, v in data.items() + if k not in self.TIMESTAMP_FIELDS + self.LEVEL_FIELDS + self.MESSAGE_FIELDS + }Also applies to: 62-63
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/log_analyzer_cli/parsers/json_log.py` at line 11, The file has two flake8 violations that need to be fixed before merge. First, remove the unused import statement importing detect_log_level from log_analyzer_cli.utils at line 11, as it is not used anywhere in the file. Second, fix the visual-indent violation in the metadata comprehension around lines 62-63 by ensuring proper indentation alignment according to PEP 8 visual indent rules, making sure continuation lines are properly aligned with the opening delimiter.Source: Pipeline failures
tests/test_parsers.py (2)
5-5:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRemove unused
pytestimport to unblock CI.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_parsers.py` at line 5, Remove the unused `import pytest` statement from the top of tests/test_parsers.py since it is not referenced anywhere in the file. Deleting this import statement will clean up the code and resolve any linting or CI issues related to unused imports.Source: Pipeline failures
100-100:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSplit overlong Apache test lines to satisfy flake8 E501.
Proposed fix
- line = '192.168.1.10 - - [20/Mar/2025:10:15:32 +0000] "GET /index.html HTTP/1.1" 200 2326 "-" "Mozilla/5.0"' + line = ( + '192.168.1.10 - - [20/Mar/2025:10:15:32 +0000] ' + '"GET /index.html HTTP/1.1" 200 2326 "-" "Mozilla/5.0"' + ) @@ - line = '192.168.1.10 - - [20/Mar/2025:10:15:32 +0000] "GET /index.html HTTP/1.1" 200 2326 "-" "Mozilla/5.0"' + line = ( + '192.168.1.10 - - [20/Mar/2025:10:15:32 +0000] ' + '"GET /index.html HTTP/1.1" 200 2326 "-" "Mozilla/5.0"' + )Also applies to: 110-110
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_parsers.py` at line 100, The test line assignments in the Apache parser tests exceed the maximum line length enforced by flake8 E501. Split the long string literal assignments for variables like `line` at multiple locations (including around lines 100 and 110) into multiple shorter lines using implicit string concatenation or parentheses to break the lines into manageable chunks that comply with the configured line length limit.Source: Pipeline failures
src/log_analyzer_cli/utils.py (1)
128-128:⚠️ Potential issue | 🟠 Major | ⚡ Quick winBreak the UUID regex line to clear flake8 E501.
Proposed fix
- pattern = re.sub(r'[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}', '<UUID>', pattern) + pattern = re.sub( + r'[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}', + '<UUID>', + pattern, + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/log_analyzer_cli/utils.py` at line 128, The re.sub() function call in the pattern assignment is exceeding the flake8 line length limit (E501). Break this long line by extracting the regex pattern into a separate variable on the line above, or by using parentheses to split the function arguments across multiple lines. The UUID regex pattern should be assigned to a variable first, then passed to re.sub() to keep individual lines within the acceptable length limit.Source: Pipeline failures
🧹 Nitpick comments (2)
src/log_analyzer_cli/utils.py (1)
53-53: ⚡ Quick winAvoid coupling syslog-year logic to a hardcoded format index.
Using
i == 6is fragile if formats are reordered; key offfmtdirectly to preserve behavior.Proposed fix
- for i, fmt in enumerate(formats): + for fmt in formats: @@ - if i == 6 and parsed.year == 1900: + if fmt == "%b %d %H:%M:%S" and parsed.year == 1900:Also applies to: 68-68
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/log_analyzer_cli/utils.py` at line 53, The code uses hardcoded format indices like i == 6 to detect the syslog-year format, which is fragile if the formats list is reordered. Find all occurrences where i is compared against a hardcoded value (mentioned at lines around 53 and 68 in the formats enumeration loop) and replace these index-based checks with direct comparisons against the fmt variable itself, checking for the specific format string pattern that represents the syslog-year format instead of relying on its position in the list.tests/test_parsers.py (1)
253-274: ⚡ Quick winAdd boundary-width tests for older-but-valid numeric epochs.
Current coverage validates modern 1.7eX examples but misses boundary cases (e.g., 18-digit ns, 15-digit us) that can still be misrouted by threshold logic.
Suggested tests
class TestJSONNumericTimestampWidth: @@ + def test_18_digit_nanoseconds_is_treated_as_nanoseconds(self): + parser = JSONLogParser() + line = '{"timestamp": 999999999999999999, "level": "INFO", "message": "x"}' + entry = parser.parse(line) + assert entry is not None + assert entry.timestamp is not None + + def test_15_digit_microseconds_is_treated_as_microseconds(self): + parser = JSONLogParser() + line = '{"timestamp": 999999999999999, "level": "INFO", "message": "x"}' + entry = parser.parse(line) + assert entry is not None + assert entry.timestamp is not None🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_parsers.py` around lines 253 - 274, The tests test_microseconds_is_treated_as_microseconds and test_nanoseconds_is_treated_as_nanoseconds only validate modern timestamps and miss boundary cases with older dates that could be misrouted by threshold logic. Add additional test cases within these methods that use older but valid timestamps in their respective time scales (e.g., timestamps from earlier years like 2000 or 1970) to ensure the parser correctly distinguishes between microsecond and nanosecond formats at the boundaries where they might be confused.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/log_analyzer_cli/parsers/json_log.py`:
- Around line 92-97: The current magnitude-based thresholds in the timestamp
parsing logic (checking if value >= 1e18, >= 1e15, >= 1e12) fail to distinguish
between microseconds and nanoseconds for values near boundaries, causing values
like 999999999999999999 nanoseconds to be incorrectly treated as microseconds.
Instead of relying solely on magnitude checks, add validation logic that checks
whether the resulting datetime from each conversion (value / 1e9, value / 1e6,
value / 1000) produces a valid timestamp within an acceptable range (e.g.,
between 1970 and a reasonable future year). Return the result only if it falls
within valid bounds, otherwise try the next conversion factor, ensuring boundary
values are correctly classified.
---
Outside diff comments:
In `@src/log_analyzer_cli/parsers/json_log.py`:
- Line 11: The file has two flake8 violations that need to be fixed before
merge. First, remove the unused import statement importing detect_log_level from
log_analyzer_cli.utils at line 11, as it is not used anywhere in the file.
Second, fix the visual-indent violation in the metadata comprehension around
lines 62-63 by ensuring proper indentation alignment according to PEP 8 visual
indent rules, making sure continuation lines are properly aligned with the
opening delimiter.
In `@src/log_analyzer_cli/utils.py`:
- Line 128: The re.sub() function call in the pattern assignment is exceeding
the flake8 line length limit (E501). Break this long line by extracting the
regex pattern into a separate variable on the line above, or by using
parentheses to split the function arguments across multiple lines. The UUID
regex pattern should be assigned to a variable first, then passed to re.sub() to
keep individual lines within the acceptable length limit.
In `@tests/test_parsers.py`:
- Line 5: Remove the unused `import pytest` statement from the top of
tests/test_parsers.py since it is not referenced anywhere in the file. Deleting
this import statement will clean up the code and resolve any linting or CI
issues related to unused imports.
- Line 100: The test line assignments in the Apache parser tests exceed the
maximum line length enforced by flake8 E501. Split the long string literal
assignments for variables like `line` at multiple locations (including around
lines 100 and 110) into multiple shorter lines using implicit string
concatenation or parentheses to break the lines into manageable chunks that
comply with the configured line length limit.
---
Nitpick comments:
In `@src/log_analyzer_cli/utils.py`:
- Line 53: The code uses hardcoded format indices like i == 6 to detect the
syslog-year format, which is fragile if the formats list is reordered. Find all
occurrences where i is compared against a hardcoded value (mentioned at lines
around 53 and 68 in the formats enumeration loop) and replace these index-based
checks with direct comparisons against the fmt variable itself, checking for the
specific format string pattern that represents the syslog-year format instead of
relying on its position in the list.
In `@tests/test_parsers.py`:
- Around line 253-274: The tests test_microseconds_is_treated_as_microseconds
and test_nanoseconds_is_treated_as_nanoseconds only validate modern timestamps
and miss boundary cases with older dates that could be misrouted by threshold
logic. Add additional test cases within these methods that use older but valid
timestamps in their respective time scales (e.g., timestamps from earlier years
like 2000 or 1970) to ensure the parser correctly distinguishes between
microsecond and nanosecond formats at the boundaries where they might be
confused.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ba8aa6c4-0e47-4b88-9046-c284d935ca89
📒 Files selected for processing (4)
src/log_analyzer_cli/parsers/json_log.pysrc/log_analyzer_cli/utils.pytests/test_parsers.pytests/test_utils.py
| if value >= 1e18: | ||
| return datetime.fromtimestamp(value / 1e9) | ||
| if value >= 1e15: | ||
| return datetime.fromtimestamp(value / 1e6) | ||
| if value >= 1e12: | ||
| return datetime.fromtimestamp(value / 1000) |
There was a problem hiding this comment.
Numeric “width” detection still misclassifies valid epochs near boundaries.
The current thresholds are still magnitude-based, so values like 18-digit nanoseconds (999999999999999999) fall into the microseconds branch and can parse incorrectly or raise out-of-range errors.
Proposed fix
if isinstance(value, (int, float)):
- if not isinstance(value, int) or value <= 0:
+ if not isinstance(value, int) or value <= 0:
return None
- # Distinguish seconds, milliseconds, microseconds, and
- # nanoseconds by the width of the value, not just
- # whether it is > 1e12. A value of 1.7e15 (microseconds)
- # would be misread as milliseconds in the old code
- # and produce a year 53872 ValueError, and a value of
- # 1.7e18 (nanoseconds) would silently overflow
- # datetime's range.
- if value >= 1e18:
- return datetime.fromtimestamp(value / 1e9)
- if value >= 1e15:
- return datetime.fromtimestamp(value / 1e6)
- if value >= 1e12:
- return datetime.fromtimestamp(value / 1000)
+ digits = len(str(value))
+ if digits >= 18: # ns
+ return datetime.fromtimestamp(value / 1_000_000_000)
+ if digits >= 15: # us
+ return datetime.fromtimestamp(value / 1_000_000)
+ if digits >= 12: # ms
+ return datetime.fromtimestamp(value / 1_000)
return datetime.fromtimestamp(value)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/log_analyzer_cli/parsers/json_log.py` around lines 92 - 97, The current
magnitude-based thresholds in the timestamp parsing logic (checking if value >=
1e18, >= 1e15, >= 1e12) fail to distinguish between microseconds and nanoseconds
for values near boundaries, causing values like 999999999999999999 nanoseconds
to be incorrectly treated as microseconds. Instead of relying solely on
magnitude checks, add validation logic that checks whether the resulting
datetime from each conversion (value / 1e9, value / 1e6, value / 1000) produces
a valid timestamp within an acceptable range (e.g., between 1970 and a
reasonable future year). Return the result only if it falls within valid bounds,
otherwise try the next conversion factor, ensuring boundary values are correctly
classified.
Problem
The
%b %d %H:%M:%Sformat used by RFC 3164 syslog (e.g.Mar 20 10:15:32) has no year component. Previously_try_parse_datetimehanded the string straight todatetime.strptime, which fills 1900 as the default year. That date then propagated throughanalyze()intofirst_seen/last_seen/TimeDistribution, which meant every syslog entry landed in a single January-1900 bucket in the time-series report and sorted before every other entry in the file. The CLI never told the user the year was missing.Fix
Re-parse the same string with the current year prepended (
%Y %b %d %H:%M:%S) when the result lands on year 1900. This matches the convention used by Python'sloggingmodule and most production syslog parsers (rsyslog, syslog-ng). Timestamps that already carry a year are completely unaffected — the rewrite only runs on the yearless format slot.Tests
Added
tests/test_utils.py(7 tests) covering:Mar 5 10:30:00) is parsed correctlyNone(empty string, whitespace-only)Summary by Sourcery
Anchor yearless syslog-style timestamps to the current year and fix JSON numeric timestamp width handling to properly distinguish between seconds, milliseconds, microseconds, and nanoseconds.
Bug Fixes:
Tests:
Summary by CodeRabbit
Release Notes
Bug Fixes
Tests