Skip to content

anchor yearless syslog timestamps to the current year#11

Open
HrachShah wants to merge 2 commits into
mainfrom
fix/utils-syslog-year-fill
Open

anchor yearless syslog timestamps to the current year#11
HrachShah wants to merge 2 commits into
mainfrom
fix/utils-syslog-year-fill

Conversation

@HrachShah

@HrachShah HrachShah commented Jun 17, 2026

Copy link
Copy Markdown
Owner

Problem

The %b %d %H:%M:%S format used by RFC 3164 syslog (e.g. Mar 20 10:15:32) has no year component. Previously _try_parse_datetime handed the string straight to datetime.strptime, which fills 1900 as the default year. That date then propagated through analyze() into first_seen / last_seen / TimeDistribution, which meant every syslog entry landed in a single January-1900 bucket in the time-series report and sorted before every other entry in the file. The CLI never told the user the year was missing.

Fix

Re-parse the same string with the current year prepended (%Y %b %d %H:%M:%S) when the result lands on year 1900. This matches the convention used by Python's logging module and most production syslog parsers (rsyslog, syslog-ng). Timestamps that already carry a year are completely unaffected — the rewrite only runs on the yearless format slot.

Tests

Added tests/test_utils.py (7 tests) covering:

  • yearless syslog lines are anchored to the current year
  • the result is never year 1900
  • ISO timestamps with an explicit year pass through unchanged
  • single-digit day (Mar 5 10:30:00) is parsed correctly
  • lines without any timestamp return None (empty string, whitespace-only)

Summary by Sourcery

Anchor yearless syslog-style timestamps to the current year and fix JSON numeric timestamp width handling to properly distinguish between seconds, milliseconds, microseconds, and nanoseconds.

Bug Fixes:

  • Ensure syslog-style timestamps without a year are anchored to the current year instead of defaulting to 1900, preventing mis-grouped time-series reports.
  • Correct JSON numeric timestamp parsing to detect timestamp units by width, avoiding overflow and out-of-range year errors for microsecond and nanosecond values.
  • Treat non-positive and non-integer numeric JSON timestamps as missing, returning no timestamp instead of invalid datetimes.

Tests:

  • Add regression tests for syslog-style yearless timestamps, ISO timestamps, and lines without timestamps in the shared utils module.
  • Add tests covering JSON timestamp parsing across string and numeric forms, including seconds, milliseconds, microseconds, nanoseconds, and invalid numeric values.

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved numeric timestamp validation and parsing for enhanced reliability
    • Fixed syslog-style timestamp parsing to correctly use the current year instead of an incorrect default year
  • Tests

    • Added comprehensive test coverage for timestamp parsing across multiple formats and edge cases

Zo Bot added 2 commits June 17, 2026 01:51
The JSON log parser used a single > 1e12 branch to decide whether a numeric timestamp was in se
The "%b %d %H:%M:%S" format used by RFC 3164 syslog (e.g. "Mar 20
10:15:32") has no year component. Previously _try_parse_datetime handed
the string straight to datetime.strptime, which fills 1900 as the
default year. That date then propagated through analyze() into
first_seen / last_seen / TimeDistribution, which meant every syslog
entry landed in a single January-1900 bucket in the time-series report
and sorted before every other entry in the file.

Re-parse the same string with the current year prepended ("%Y %b %d
%H:%M:%S") when the result lands on year 1900. This matches the
convention used by Python's logging module and most production syslog
parsers (rsyslog, syslog-ng). Timestamps that already carry a year are
unaffected — the rewrite only runs on the yearless format.

Also added tests/test_utils.py covering: yearless syslog lines anchored
to the current year, single-digit day ("Mar  5"), the no-1900
invariant, that timestamps with a year are unchanged, and that lines
without any timestamp return None.
@sourcery-ai

sourcery-ai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Reviewer's Guide

Anchors yearless syslog-style timestamps to the current year in the shared timestamp parser and refines JSON numeric timestamp handling to distinguish seconds, ms, µs, and ns by width, with new regression tests for both behaviors.

File-Level Changes

Change Details Files
Anchor RFC 3164-style yearless syslog timestamps to the current year in the shared timestamp parsing utility.
  • Modify the datetime parsing loop to capture the parsed value and its format index instead of returning immediately
  • Detect when the syslog-style "%b %d %H:%M:%S" format (specific slot in the formats list) produces year 1900 and re-parse the string with the current year prepended using "%Y %b %d %H:%M:%S"
  • Fallback to the original parsed value if the re-parse fails
  • Keep all other formats unchanged and continue returning None when no format matches
src/log_analyzer_cli/utils.py
Add focused regression tests for shared timestamp parsing, especially yearless syslog formats and no-timestamp lines.
  • Introduce tests ensuring syslog-style timestamps are anchored to datetime.now().year and never 1900
  • Verify ISO timestamps with explicit years are not rewritten by the year-fixing logic
  • Assert that lines with no detectable timestamp (plain text, empty, whitespace-only) return None
  • Cover RFC 3164 single-digit day variants (double-space between month and day) as valid inputs
tests/test_utils.py
Refine JSON parser timestamp extraction to use width-based detection for numeric timestamps and to handle invalid or non-positive numeric values safely.
  • Extend the JSONLogParser._extract_timestamp docstring to describe numeric timestamp handling
  • Change numeric timestamp handling to treat only positive integers as valid, returning None for floats or non-positive values
  • Use thresholds based on magnitude to distinguish seconds, milliseconds, microseconds, and nanoseconds (>=1e18 as ns, >=1e15 as µs, >=1e12 as ms, else seconds) and convert appropriately via datetime.fromtimestamp
  • Preserve existing string timestamp parsing logic
src/log_analyzer_cli/parsers/json_log.py
Add regression tests for JSON timestamp parsing, both string and numeric, including width-based numeric timestamp handling.
  • Add tests verifying JSONLogParser.can_parse accepts ISO timestamps with seconds, milliseconds, and microseconds
  • Add tests verifying parse() correctly handles numeric timestamps expressed in seconds and milliseconds and produces non-None entries
  • Introduce a dedicated test class that checks width-based handling for seconds, ms, µs, and ns numeric timestamps with expected calendar dates
  • Add tests asserting that zero and negative numeric timestamps result in entries whose timestamp field is None
tests/test_parsers.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The special-casing of the syslog format in _try_parse_datetime via if i == 6 is brittle and depends on the ordering of formats; it would be more robust to check the format string explicitly (e.g. if fmt == "%b %d %H:%M:%S").
  • In JSONLogParser._extract_timestamp, the new isinstance(value, int) guard will now drop valid numeric timestamps provided as floats (e.g. 1700000000.0), which may be a regression; consider accepting floats and normalizing (e.g. via int(value) or Decimal) rather than returning None.
  • The updated _extract_timestamp docstring is truncated ("and most high"), so please complete or tighten that sentence to clearly describe the unit-width behavior.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The special-casing of the syslog format in `_try_parse_datetime` via `if i == 6` is brittle and depends on the ordering of `formats`; it would be more robust to check the format string explicitly (e.g. `if fmt == "%b %d %H:%M:%S"`).
- In `JSONLogParser._extract_timestamp`, the new `isinstance(value, int)` guard will now drop valid numeric timestamps provided as floats (e.g. `1700000000.0`), which may be a regression; consider accepting floats and normalizing (e.g. via `int(value)` or `Decimal`) rather than returning `None`.
- The updated `_extract_timestamp` docstring is truncated ("and most high"), so please complete or tighten that sentence to clearly describe the unit-width behavior.

## Individual Comments

### Comment 1
<location path="src/log_analyzer_cli/utils.py" line_range="53-62" />
<code_context>
+    for i, fmt in enumerate(formats):
</code_context>
<issue_to_address>
**suggestion:** Avoid relying on the index of the format for the syslog-without-year special case.

This relies on the syslog-style format always being at index 6, which is fragile if `formats` is reordered or new entries are inserted. Instead, match on the specific format string (e.g., by comparing `fmt` to the syslog pattern) rather than `i == 6` so the behavior remains stable as the list evolves.

Suggested implementation:

```python
    if ts_str.endswith("Z"):
        ts_str = ts_str[:-1] + "+00:00"

    for fmt in formats:
        try:
            parsed = datetime.strptime(ts_str, fmt)
        except ValueError:
            continue
        # The "%b %d %H:%M:%S" format (syslog-style "Jan 15 10:30:45") has no
        # year, so datetime.strptime fills in 1900 as the default. Re-parse
        # the same string with the current year prepended so the resulting
        # timestamp is anchored to when the log was actually generated. This
        # matches the convention used by Python's logging module and most

```

To fully implement the suggestion and avoid relying on the index:

1. Replace any condition like `if i == 6:` (or other uses of `i` that refer to the syslog-without-year format) with a direct comparison to the format string, for example:
   ```python
   if fmt == "%b %d %H:%M:%S":
       # syslog-without-year special handling
   ```
2. Remove any remaining uses of the loop index `i` inside this loop, or reintroduce an index only where it is semantically required and not tied to the position of a specific format.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +53 to +62
for i, fmt in enumerate(formats):
try:
return datetime.strptime(ts_str, fmt)
parsed = datetime.strptime(ts_str, fmt)
except ValueError:
continue
# The "%b %d %H:%M:%S" format (syslog-style "Jan 15 10:30:45") has no
# year, so datetime.strptime fills in 1900 as the default. Re-parse
# the same string with the current year prepended so the resulting
# timestamp is anchored to when the log was actually generated. This
# matches the convention used by Python's logging module and most

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Avoid relying on the index of the format for the syslog-without-year special case.

This relies on the syslog-style format always being at index 6, which is fragile if formats is reordered or new entries are inserted. Instead, match on the specific format string (e.g., by comparing fmt to the syslog pattern) rather than i == 6 so the behavior remains stable as the list evolves.

Suggested implementation:

    if ts_str.endswith("Z"):
        ts_str = ts_str[:-1] + "+00:00"

    for fmt in formats:
        try:
            parsed = datetime.strptime(ts_str, fmt)
        except ValueError:
            continue
        # The "%b %d %H:%M:%S" format (syslog-style "Jan 15 10:30:45") has no
        # year, so datetime.strptime fills in 1900 as the default. Re-parse
        # the same string with the current year prepended so the resulting
        # timestamp is anchored to when the log was actually generated. This
        # matches the convention used by Python's logging module and most

To fully implement the suggestion and avoid relying on the index:

  1. Replace any condition like if i == 6: (or other uses of i that refer to the syslog-without-year format) with a direct comparison to the format string, for example:
    if fmt == "%b %d %H:%M:%S":
        # syslog-without-year special handling
  2. Remove any remaining uses of the loop index i inside this loop, or reintroduce an index only where it is semantically required and not tied to the position of a specific format.

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Two timestamp parsing fixes are applied: JSONLogParser._extract_timestamp now validates that numeric timestamps are positive integers and selects the epoch unit (nanoseconds, microseconds, milliseconds, or seconds) by digit width. _try_parse_datetime in utils.py now re-anchors syslog-style yearless timestamps to the current year instead of defaulting to 1900.

Changes

Timestamp Parsing Precision Fixes

Layer / File(s) Summary
JSON numeric timestamp unit detection
src/log_analyzer_cli/parsers/json_log.py, tests/test_parsers.py
_extract_timestamp rejects non-integer and non-positive numeric values, then selects the divisor (1e9, 1e6, 1e3, or 1) by digit width to handle nanosecond through second epoch timestamps. TestJSONTimestampWidth and TestJSONNumericTimestampWidth cover all four numeric magnitudes, fractional-second string formats, and edge cases where zero or negative values should produce None.
Syslog yearless timestamp year anchoring
src/log_analyzer_cli/utils.py, tests/test_utils.py
_try_parse_datetime uses enumerate to detect the "%b %d %H:%M:%S" syslog format; when it parses with year 1900, the timestamp is re-parsed with the current year prepended. New test_utils.py tests cover current-year anchoring, ISO year preservation, None returns for empty/missing input, and single-digit day parsing.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🐇 Two clocks were ticking wrong before,
One forgot the year, one mixed up the score.
Nanoseconds, microseconds, all in a row,
Syslog now knows which year logs flow.
The rabbit hops on — timestamps glow! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.64% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'anchor yearless syslog timestamps to the current year' directly and accurately describes the main objective of the PR—fixing RFC 3164 syslog timestamp parsing by anchoring yearless timestamps to the current year instead of defaulting to 1900.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/utils-syslog-year-fill

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
src/log_analyzer_cli/parsers/json_log.py (1)

11-11: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix CI-blocking flake8 errors in this file before merge.

The current branch still has a failing unused import and visual-indent violation in the metadata comprehension.

Proposed fix
-from log_analyzer_cli.utils import detect_log_level
@@
-        metadata = {k: v for k, v in data.items() 
-                   if k not in self.TIMESTAMP_FIELDS + self.LEVEL_FIELDS + self.MESSAGE_FIELDS}
+        metadata = {
+            k: v
+            for k, v in data.items()
+            if k not in self.TIMESTAMP_FIELDS + self.LEVEL_FIELDS + self.MESSAGE_FIELDS
+        }

Also applies to: 62-63

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/log_analyzer_cli/parsers/json_log.py` at line 11, The file has two flake8
violations that need to be fixed before merge. First, remove the unused import
statement importing detect_log_level from log_analyzer_cli.utils at line 11, as
it is not used anywhere in the file. Second, fix the visual-indent violation in
the metadata comprehension around lines 62-63 by ensuring proper indentation
alignment according to PEP 8 visual indent rules, making sure continuation lines
are properly aligned with the opening delimiter.

Source: Pipeline failures

tests/test_parsers.py (2)

5-5: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove unused pytest import to unblock CI.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_parsers.py` at line 5, Remove the unused `import pytest` statement
from the top of tests/test_parsers.py since it is not referenced anywhere in the
file. Deleting this import statement will clean up the code and resolve any
linting or CI issues related to unused imports.

Source: Pipeline failures


100-100: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Split overlong Apache test lines to satisfy flake8 E501.

Proposed fix
-        line = '192.168.1.10 - - [20/Mar/2025:10:15:32 +0000] "GET /index.html HTTP/1.1" 200 2326 "-" "Mozilla/5.0"'
+        line = (
+            '192.168.1.10 - - [20/Mar/2025:10:15:32 +0000] '
+            '"GET /index.html HTTP/1.1" 200 2326 "-" "Mozilla/5.0"'
+        )
@@
-        line = '192.168.1.10 - - [20/Mar/2025:10:15:32 +0000] "GET /index.html HTTP/1.1" 200 2326 "-" "Mozilla/5.0"'
+        line = (
+            '192.168.1.10 - - [20/Mar/2025:10:15:32 +0000] '
+            '"GET /index.html HTTP/1.1" 200 2326 "-" "Mozilla/5.0"'
+        )

Also applies to: 110-110

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_parsers.py` at line 100, The test line assignments in the Apache
parser tests exceed the maximum line length enforced by flake8 E501. Split the
long string literal assignments for variables like `line` at multiple locations
(including around lines 100 and 110) into multiple shorter lines using implicit
string concatenation or parentheses to break the lines into manageable chunks
that comply with the configured line length limit.

Source: Pipeline failures

src/log_analyzer_cli/utils.py (1)

128-128: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Break the UUID regex line to clear flake8 E501.

Proposed fix
-    pattern = re.sub(r'[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}', '<UUID>', pattern)
+    pattern = re.sub(
+        r'[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}',
+        '<UUID>',
+        pattern,
+    )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/log_analyzer_cli/utils.py` at line 128, The re.sub() function call in the
pattern assignment is exceeding the flake8 line length limit (E501). Break this
long line by extracting the regex pattern into a separate variable on the line
above, or by using parentheses to split the function arguments across multiple
lines. The UUID regex pattern should be assigned to a variable first, then
passed to re.sub() to keep individual lines within the acceptable length limit.

Source: Pipeline failures

🧹 Nitpick comments (2)
src/log_analyzer_cli/utils.py (1)

53-53: ⚡ Quick win

Avoid coupling syslog-year logic to a hardcoded format index.

Using i == 6 is fragile if formats are reordered; key off fmt directly to preserve behavior.

Proposed fix
-    for i, fmt in enumerate(formats):
+    for fmt in formats:
@@
-        if i == 6 and parsed.year == 1900:
+        if fmt == "%b %d %H:%M:%S" and parsed.year == 1900:

Also applies to: 68-68

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/log_analyzer_cli/utils.py` at line 53, The code uses hardcoded format
indices like i == 6 to detect the syslog-year format, which is fragile if the
formats list is reordered. Find all occurrences where i is compared against a
hardcoded value (mentioned at lines around 53 and 68 in the formats enumeration
loop) and replace these index-based checks with direct comparisons against the
fmt variable itself, checking for the specific format string pattern that
represents the syslog-year format instead of relying on its position in the
list.
tests/test_parsers.py (1)

253-274: ⚡ Quick win

Add boundary-width tests for older-but-valid numeric epochs.

Current coverage validates modern 1.7eX examples but misses boundary cases (e.g., 18-digit ns, 15-digit us) that can still be misrouted by threshold logic.

Suggested tests
 class TestJSONNumericTimestampWidth:
@@
+    def test_18_digit_nanoseconds_is_treated_as_nanoseconds(self):
+        parser = JSONLogParser()
+        line = '{"timestamp": 999999999999999999, "level": "INFO", "message": "x"}'
+        entry = parser.parse(line)
+        assert entry is not None
+        assert entry.timestamp is not None
+
+    def test_15_digit_microseconds_is_treated_as_microseconds(self):
+        parser = JSONLogParser()
+        line = '{"timestamp": 999999999999999, "level": "INFO", "message": "x"}'
+        entry = parser.parse(line)
+        assert entry is not None
+        assert entry.timestamp is not None
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_parsers.py` around lines 253 - 274, The tests
test_microseconds_is_treated_as_microseconds and
test_nanoseconds_is_treated_as_nanoseconds only validate modern timestamps and
miss boundary cases with older dates that could be misrouted by threshold logic.
Add additional test cases within these methods that use older but valid
timestamps in their respective time scales (e.g., timestamps from earlier years
like 2000 or 1970) to ensure the parser correctly distinguishes between
microsecond and nanosecond formats at the boundaries where they might be
confused.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/log_analyzer_cli/parsers/json_log.py`:
- Around line 92-97: The current magnitude-based thresholds in the timestamp
parsing logic (checking if value >= 1e18, >= 1e15, >= 1e12) fail to distinguish
between microseconds and nanoseconds for values near boundaries, causing values
like 999999999999999999 nanoseconds to be incorrectly treated as microseconds.
Instead of relying solely on magnitude checks, add validation logic that checks
whether the resulting datetime from each conversion (value / 1e9, value / 1e6,
value / 1000) produces a valid timestamp within an acceptable range (e.g.,
between 1970 and a reasonable future year). Return the result only if it falls
within valid bounds, otherwise try the next conversion factor, ensuring boundary
values are correctly classified.

---

Outside diff comments:
In `@src/log_analyzer_cli/parsers/json_log.py`:
- Line 11: The file has two flake8 violations that need to be fixed before
merge. First, remove the unused import statement importing detect_log_level from
log_analyzer_cli.utils at line 11, as it is not used anywhere in the file.
Second, fix the visual-indent violation in the metadata comprehension around
lines 62-63 by ensuring proper indentation alignment according to PEP 8 visual
indent rules, making sure continuation lines are properly aligned with the
opening delimiter.

In `@src/log_analyzer_cli/utils.py`:
- Line 128: The re.sub() function call in the pattern assignment is exceeding
the flake8 line length limit (E501). Break this long line by extracting the
regex pattern into a separate variable on the line above, or by using
parentheses to split the function arguments across multiple lines. The UUID
regex pattern should be assigned to a variable first, then passed to re.sub() to
keep individual lines within the acceptable length limit.

In `@tests/test_parsers.py`:
- Line 5: Remove the unused `import pytest` statement from the top of
tests/test_parsers.py since it is not referenced anywhere in the file. Deleting
this import statement will clean up the code and resolve any linting or CI
issues related to unused imports.
- Line 100: The test line assignments in the Apache parser tests exceed the
maximum line length enforced by flake8 E501. Split the long string literal
assignments for variables like `line` at multiple locations (including around
lines 100 and 110) into multiple shorter lines using implicit string
concatenation or parentheses to break the lines into manageable chunks that
comply with the configured line length limit.

---

Nitpick comments:
In `@src/log_analyzer_cli/utils.py`:
- Line 53: The code uses hardcoded format indices like i == 6 to detect the
syslog-year format, which is fragile if the formats list is reordered. Find all
occurrences where i is compared against a hardcoded value (mentioned at lines
around 53 and 68 in the formats enumeration loop) and replace these index-based
checks with direct comparisons against the fmt variable itself, checking for the
specific format string pattern that represents the syslog-year format instead of
relying on its position in the list.

In `@tests/test_parsers.py`:
- Around line 253-274: The tests test_microseconds_is_treated_as_microseconds
and test_nanoseconds_is_treated_as_nanoseconds only validate modern timestamps
and miss boundary cases with older dates that could be misrouted by threshold
logic. Add additional test cases within these methods that use older but valid
timestamps in their respective time scales (e.g., timestamps from earlier years
like 2000 or 1970) to ensure the parser correctly distinguishes between
microsecond and nanosecond formats at the boundaries where they might be
confused.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ba8aa6c4-0e47-4b88-9046-c284d935ca89

📥 Commits

Reviewing files that changed from the base of the PR and between e93757f and 824006c.

📒 Files selected for processing (4)
  • src/log_analyzer_cli/parsers/json_log.py
  • src/log_analyzer_cli/utils.py
  • tests/test_parsers.py
  • tests/test_utils.py

Comment on lines +92 to 97
if value >= 1e18:
return datetime.fromtimestamp(value / 1e9)
if value >= 1e15:
return datetime.fromtimestamp(value / 1e6)
if value >= 1e12:
return datetime.fromtimestamp(value / 1000)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Numeric “width” detection still misclassifies valid epochs near boundaries.

The current thresholds are still magnitude-based, so values like 18-digit nanoseconds (999999999999999999) fall into the microseconds branch and can parse incorrectly or raise out-of-range errors.

Proposed fix
                 if isinstance(value, (int, float)):
-                    if not isinstance(value, int) or value <= 0:
+                    if not isinstance(value, int) or value <= 0:
                         return None
-                    # Distinguish seconds, milliseconds, microseconds, and
-                    # nanoseconds by the width of the value, not just
-                    # whether it is > 1e12. A value of 1.7e15 (microseconds)
-                    # would be misread as milliseconds in the old code
-                    # and produce a year 53872 ValueError, and a value of
-                    # 1.7e18 (nanoseconds) would silently overflow
-                    # datetime's range.
-                    if value >= 1e18:
-                        return datetime.fromtimestamp(value / 1e9)
-                    if value >= 1e15:
-                        return datetime.fromtimestamp(value / 1e6)
-                    if value >= 1e12:
-                        return datetime.fromtimestamp(value / 1000)
+                    digits = len(str(value))
+                    if digits >= 18:  # ns
+                        return datetime.fromtimestamp(value / 1_000_000_000)
+                    if digits >= 15:  # us
+                        return datetime.fromtimestamp(value / 1_000_000)
+                    if digits >= 12:  # ms
+                        return datetime.fromtimestamp(value / 1_000)
                     return datetime.fromtimestamp(value)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/log_analyzer_cli/parsers/json_log.py` around lines 92 - 97, The current
magnitude-based thresholds in the timestamp parsing logic (checking if value >=
1e18, >= 1e15, >= 1e12) fail to distinguish between microseconds and nanoseconds
for values near boundaries, causing values like 999999999999999999 nanoseconds
to be incorrectly treated as microseconds. Instead of relying solely on
magnitude checks, add validation logic that checks whether the resulting
datetime from each conversion (value / 1e9, value / 1e6, value / 1000) produces
a valid timestamp within an acceptable range (e.g., between 1970 and a
reasonable future year). Return the result only if it falls within valid bounds,
otherwise try the next conversion factor, ensuring boundary values are correctly
classified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant