Skip to content

fix apache parser: match real user field in COMBINED_PATTERN#12

Open
HrachShah wants to merge 3 commits into
mainfrom
patch-apache-combined-user
Open

fix apache parser: match real user field in COMBINED_PATTERN#12
HrachShah wants to merge 3 commits into
mainfrom
patch-apache-combined-user

Conversation

@HrachShah

@HrachShah HrachShah commented Jun 19, 2026

Copy link
Copy Markdown
Owner

The COMBINED_PATTERN regex in parsers/apache.py was using \s+ to match the user field, which only matches whitespace. Real combined-format log lines have a literal - for unauthenticated requests or a username, so the pattern never matched a true combined-format line. parse() then fell back to COMMON_PATTERN, dropping the referer and user_agent metadata that combined format is supposed to expose.

This change uses \S+ (with a trailing \s+ to keep field separation) so the user field matches the same way it does in COMMON_PATTERN. Combined-format lines now get referer and user_agent populated in the entry metadata; common-format lines (which don't have those fields) still match via the optional non-capturing group.

Repro before the fix:

from log_analyzer_cli.parsers import ApacheParser
p = ApacheParser()
line = '192.168.1.10 - - [20/Mar/2025:10:15:32 +0000] "GET /index.html HTTP/1.1" 200 2326 "https://example.com" "Mozilla/5.0"'
entry = p.parse(line)
# 'referer' and 'user_agent' are missing from entry.metadata

After the fix, entry.metadata has referer and user_agent populated.

Summary by Sourcery

Fix timestamp handling and Apache combined log parsing to correctly extract metadata and produce accurate time-series analyses.

Bug Fixes:

  • Correct Apache combined log pattern so the user field matches real log values and combined-format entries retain referer and user agent metadata.
  • Fix JSON log parser numeric timestamp handling to distinguish seconds, milliseconds, microseconds, and nanoseconds by magnitude and avoid overflows or invalid dates.
  • Anchor yearless syslog-style timestamps to the current year instead of 1900 so time-based reports and aggregations reflect real log times.
  • Ensure invalid or non-positive numeric JSON timestamps do not produce bogus datetimes but instead yield a missing timestamp.

Tests:

  • Add tests for JSON timestamp parsing that cover string and numeric timestamps at second, millisecond, microsecond, and nanosecond precision.
  • Add focused tests for shared timestamp parsing utilities, including syslog-style yearless formats, missing-timestamp scenarios, and edge cases like single-digit days.

Summary by CodeRabbit

Bug Fixes

  • Improved Apache/Nginx log parsing accuracy for user field extraction
  • Enhanced JSON timestamp handling across multiple unit widths (seconds, milliseconds, microseconds, nanoseconds)
  • Fixed syslog-style timestamp year interpretation to use current year instead of 1900

Tests

  • Added comprehensive test coverage for timestamp parsing validation

Zo Bot added 3 commits June 17, 2026 01:51
The JSON log parser used a single > 1e12 branch to decide whether a numeric timestamp was in se
The "%b %d %H:%M:%S" format used by RFC 3164 syslog (e.g. "Mar 20
10:15:32") has no year component. Previously _try_parse_datetime handed
the string straight to datetime.strptime, which fills 1900 as the
default year. That date then propagated through analyze() into
first_seen / last_seen / TimeDistribution, which meant every syslog
entry landed in a single January-1900 bucket in the time-series report
and sorted before every other entry in the file.

Re-parse the same string with the current year prepended ("%Y %b %d
%H:%M:%S") when the result lands on year 1900. This matches the
convention used by Python's logging module and most production syslog
parsers (rsyslog, syslog-ng). Timestamps that already carry a year are
unaffected — the rewrite only runs on the yearless format.

Also added tests/test_utils.py covering: yearless syslog lines anchored
to the current year, single-digit day ("Mar  5"), the no-1900
invariant, that timestamps with a year are unchanged, and that lines
without any timestamp return None.
…r_agent are captured

The user group in COMBINED_PATTERN was the whitespace class \\s+, which only
matches spaces and never the real user field (typically a literal '-' for
unauthenticated requests, or an actual username). The regex therefore never
matched a real combined-format line and parse() silently fell through to
COMMON_PATTERN, losing the referer and user_agent metadata that combined
format is specifically meant to surface.

Match the user field with \\S+ like COMMON_PATTERN does, and keep the
trailing \\s+ so the field stays separated from the bracketed timestamp.
Combined-format lines now produce entries with referer and user_agent
populated, while common-format lines (which have no referer/user_agent to
capture) still match correctly via the optional non-capturing group.
@sourcery-ai

sourcery-ai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Reviewer's Guide

Adjusts timestamp parsing and detection across utilities and the JSON log parser, fixes Apache combined-log user-field matching, and adds targeted tests to prevent regressions in yearless and numeric timestamp handling.

File-Level Changes

Change Details Files
Fix Apache combined log parsing so the user field matches real username / '-' values and enables referer and user_agent extraction for combined-format lines.
  • Change COMBINED_PATTERN user group from a whitespace matcher to a non-whitespace token followed by whitespace, aligning it with COMMON_PATTERN semantics.
  • Ensure combined-format Apache log lines are properly recognized so referer and user-agent fields are populated instead of falling back to the common format.
src/log_analyzer_cli/parsers/apache.py
Anchor yearless (syslog-style) timestamps to the current year instead of 1900 while preserving behavior for timestamps that already include a year.
  • Iterate over known datetime formats with an index so the syslog-style "%b %d %H:%M:%S" format can be detected.
  • When that format matches and yields year 1900, re-parse the timestamp with the current year prepended, falling back to the original parsed value if reparse fails.
  • Add focused tests for parse_timestamp covering yearless syslog-style inputs, non-timestamp lines, and edge cases like single-digit days and whitespace-only lines.
src/log_analyzer_cli/utils.py
tests/test_utils.py
Make JSON numeric timestamp extraction width-aware so seconds, milliseconds, microseconds, and nanoseconds are handled correctly and invalid/degenerate numeric values are ignored.
  • Extend _extract_timestamp to treat non-positive or non-int numeric timestamp values as invalid, returning None instead of raising or mis-parsing.
  • Replace the old >1e12 heuristic with width-based thresholds that map to seconds, milliseconds, microseconds, or nanoseconds and scale values accordingly before calling datetime.fromtimestamp.
  • Add regression tests to confirm JSONLogParser can_parse behavior for various string timestamp precisions and accurate parsing of numeric timestamps at second, millisecond, microsecond, and nanosecond resolutions, including zero/negative timestamps yielding None.
src/log_analyzer_cli/parsers/json_log.py
tests/test_parsers.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Three independent parser bug fixes: the Apache combined-log regex corrects the user capture group from \s+ to \S+; the JSON log parser gains multi-threshold numeric timestamp unit-width detection (nanoseconds/microseconds/milliseconds/seconds) with non-positive value rejection; and _try_parse_datetime now re-anchors syslog-style yearless timestamps from 1900 to the current year. Tests are added for all three fixes.

Changes

Parser and Timestamp Bug Fixes

Layer / File(s) Summary
Apache combined-log user field regex fix
src/log_analyzer_cli/parsers/apache.py
Changes the user capture group in COMBINED_PATTERN from \s+ to \S+, correcting how the user field is extracted from Apache/Nginx combined log lines.
JSON numeric timestamp unit-width detection and tests
src/log_analyzer_cli/parsers/json_log.py, tests/test_parsers.py
Replaces single-threshold numeric timestamp conversion with explicit magnitude checks for nanoseconds, microseconds, milliseconds, and seconds; rejects non-positive values. Tests cover string formats, all four numeric unit widths, and zero/negative edge cases returning None.
Syslog yearless timestamp year anchoring and tests
src/log_analyzer_cli/utils.py, tests/test_utils.py
_try_parse_datetime re-parses syslog-style timestamps that produce year 1900 by prepending the current year. Tests assert current-year anchoring, no 1900 default, ISO year preservation, None for unparseable input, and correct handling of single-digit days.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 A hop through the logs, a fix here and there,
The user field caught by \S+ with care.
Timestamps in nanos? In millis? In years?
We anchor to now and banish old fears.
No more 1900 — the rabbit jumps free! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.64% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title directly and clearly summarizes the main change: fixing the Apache parser's COMBINED_PATTERN regex to correctly match the user field.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch patch-apache-combined-user

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • In _try_parse_datetime, the special handling for the syslog-style format relies on i == 6, which is fragile if the formats list changes ordering; it would be more robust to key this behavior off the specific format string (e.g. fmt == "%b %d %H:%M:%S").
  • The updated _extract_timestamp in JSONLogParser now ignores float epoch timestamps (isinstance(value, float) previously worked, but now any non-int numeric returns None); consider preserving support for float seconds while still doing width-based unit detection for large integers.
  • The tests asserting ts.year == datetime.now().year in test_utils.py can be flaky around the year boundary (e.g., log line timestamp just before midnight and datetime.now() just after); consider injecting a fixed reference year or mocking the current time instead of calling datetime.now() directly.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `_try_parse_datetime`, the special handling for the syslog-style format relies on `i == 6`, which is fragile if the `formats` list changes ordering; it would be more robust to key this behavior off the specific format string (e.g. `fmt == "%b %d %H:%M:%S"`).
- The updated `_extract_timestamp` in `JSONLogParser` now ignores float epoch timestamps (`isinstance(value, float)` previously worked, but now any non-`int` numeric returns `None`); consider preserving support for float seconds while still doing width-based unit detection for large integers.
- The tests asserting `ts.year == datetime.now().year` in `test_utils.py` can be flaky around the year boundary (e.g., log line timestamp just before midnight and `datetime.now()` just after); consider injecting a fixed reference year or mocking the current time instead of calling `datetime.now()` directly.

## Individual Comments

### Comment 1
<location path="src/log_analyzer_cli/utils.py" line_range="53-62" />
<code_context>
+    for i, fmt in enumerate(formats):
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Avoid depending on the index of the format list when applying the syslog year fix.

Relying on `if i == 6` couples this behavior to the current `formats` ordering. If formats are added or reordered, the condition may silently stop targeting the syslog format (or target the wrong one). Instead, key this logic off the `fmt` value itself (e.g., compare to the `%b %d %H:%M:%S` pattern) or a named constant representing the syslog format.

Suggested implementation:

```python
    for fmt in formats:

```

```python
        # The "%b %d %H:%M:%S" format (syslog-style "Jan 15 10:30:45") has no
        # year, so datetime.strptime fills in 1900 as the default. Re-parse
        # the same string with the current year prepended so the resulting
        # timestamp is anchored to when the log was actually generated. This
        # matches the convention used by Python's logging module and most

```

```python
        if fmt == "%b %d %H:%M:%S":

```

If your `formats` list is defined elsewhere and you’d like to avoid hardcoding the syslog pattern in multiple places, consider introducing a module-level constant like `SYSLOG_FORMAT = "%b %d %H:%M:%S"` and using that both in the `formats` list and in this conditional (`if fmt == SYSLOG_FORMAT:`).
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +53 to +62
for i, fmt in enumerate(formats):
try:
return datetime.strptime(ts_str, fmt)
parsed = datetime.strptime(ts_str, fmt)
except ValueError:
continue
# The "%b %d %H:%M:%S" format (syslog-style "Jan 15 10:30:45") has no
# year, so datetime.strptime fills in 1900 as the default. Re-parse
# the same string with the current year prepended so the resulting
# timestamp is anchored to when the log was actually generated. This
# matches the convention used by Python's logging module and most

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Avoid depending on the index of the format list when applying the syslog year fix.

Relying on if i == 6 couples this behavior to the current formats ordering. If formats are added or reordered, the condition may silently stop targeting the syslog format (or target the wrong one). Instead, key this logic off the fmt value itself (e.g., compare to the %b %d %H:%M:%S pattern) or a named constant representing the syslog format.

Suggested implementation:

    for fmt in formats:
        # The "%b %d %H:%M:%S" format (syslog-style "Jan 15 10:30:45") has no
        # year, so datetime.strptime fills in 1900 as the default. Re-parse
        # the same string with the current year prepended so the resulting
        # timestamp is anchored to when the log was actually generated. This
        # matches the convention used by Python's logging module and most
        if fmt == "%b %d %H:%M:%S":

If your formats list is defined elsewhere and you’d like to avoid hardcoding the syslog pattern in multiple places, consider introducing a module-level constant like SYSLOG_FORMAT = "%b %d %H:%M:%S" and using that both in the formats list and in this conditional (if fmt == SYSLOG_FORMAT:).

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
src/log_analyzer_cli/utils.py (1)

53-78: ⚡ Quick win

Use the format string instead of index to trigger year anchoring.

Line 68 depends on i == 6, which can silently break if formats order changes. Match on fmt == "%b %d %H:%M:%S" instead.

Proposed refactor
-    for i, fmt in enumerate(formats):
+    for fmt in formats:
         try:
             parsed = datetime.strptime(ts_str, fmt)
         except ValueError:
             continue
@@
-        if i == 6 and parsed.year == 1900:
+        if fmt == "%b %d %H:%M:%S" and parsed.year == 1900:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/log_analyzer_cli/utils.py` around lines 53 - 78, The condition checking i
== 6 to trigger year anchoring for syslog-style timestamps is fragile because it
depends on the index position in the formats list. If the formats list order
changes, the special handling will silently fail to trigger. Replace the
index-based check with a direct format string comparison by changing the
condition from i == 6 and parsed.year == 1900 to fmt == "%b %d %H:%M:%S" and
parsed.year == 1900. This makes the code robust to changes in the formats list
order.
tests/test_parsers.py (1)

187-200: ⚡ Quick win

Assert parsed timestamps, not only JSON validity.

can_parse() only checks that the line is syntactically JSON, so these tests would still pass if timestamp parsing for Z, milliseconds, or microseconds regressed. Parse the entry and assert timestamp / microsecond instead.

🧪 Proposed test strengthening
     def test_seconds(self):
         parser = JSONLogParser()
         line = '{"timestamp": "2025-03-20T10:15:32Z", "level": "INFO", "message": "Started"}'
-        assert parser.can_parse(line) is True
+        entry = parser.parse(line)
+        assert entry is not None
+        assert entry.timestamp is not None
+        assert entry.timestamp.microsecond == 0
 
     def test_milliseconds(self):
         parser = JSONLogParser()
         line = '{"timestamp": "2025-03-20T10:15:32.123Z", "level": "INFO", "message": "Started"}'
-        assert parser.can_parse(line) is True
+        entry = parser.parse(line)
+        assert entry is not None
+        assert entry.timestamp is not None
+        assert entry.timestamp.microsecond == 123000
 
     def test_microseconds(self):
         parser = JSONLogParser()
         line = '{"timestamp": "2025-03-20T10:15:32.123456Z", "level": "INFO", "message": "Started"}'
-        assert parser.can_parse(line) is True
+        entry = parser.parse(line)
+        assert entry is not None
+        assert entry.timestamp is not None
+        assert entry.timestamp.microsecond == 123456
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_parsers.py` around lines 187 - 200, The test methods test_seconds,
test_milliseconds, and test_microseconds in the JSONLogParser test class are
only validating JSON syntax through can_parse() but not actually asserting
correct timestamp parsing. For each test method, replace the assertion on
can_parse() with a call to the parser's parse method on the JSON line, then
assert that the parsed entry's timestamp or microsecond value matches the
expected precision level (seconds, milliseconds with 3 decimal places, or
microseconds with 6 decimal places respectively) to ensure timestamp parsing
doesn't regress.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/log_analyzer_cli/parsers/json_log.py`:
- Around line 82-98: The timestamp parsing logic needs to reject boolean values
explicitly and handle conversion errors gracefully. First, add an additional
isinstance check for bool before the current int/float type check to prevent
True/False from being treated as 1/0. Second, wrap each datetime.fromtimestamp
call (for the 1e18, 1e15, 1e12, and direct value cases) in a try/except block to
catch ValueError exceptions from year out of range or overflow conditions,
returning None when parsing fails so invalid timestamps do not abort the entire
parse operation.

In `@tests/test_utils.py`:
- Line 82: The file tests/test_utils.py is missing a trailing newline at the end
of the file after the assertion statement assert ts.year == datetime.now().year.
Add a newline character at the end of the file to comply with flake8 W292
requirements. This is a common style convention where all Python files should
end with a single newline character.

---

Nitpick comments:
In `@src/log_analyzer_cli/utils.py`:
- Around line 53-78: The condition checking i == 6 to trigger year anchoring for
syslog-style timestamps is fragile because it depends on the index position in
the formats list. If the formats list order changes, the special handling will
silently fail to trigger. Replace the index-based check with a direct format
string comparison by changing the condition from i == 6 and parsed.year == 1900
to fmt == "%b %d %H:%M:%S" and parsed.year == 1900. This makes the code robust
to changes in the formats list order.

In `@tests/test_parsers.py`:
- Around line 187-200: The test methods test_seconds, test_milliseconds, and
test_microseconds in the JSONLogParser test class are only validating JSON
syntax through can_parse() but not actually asserting correct timestamp parsing.
For each test method, replace the assertion on can_parse() with a call to the
parser's parse method on the JSON line, then assert that the parsed entry's
timestamp or microsecond value matches the expected precision level (seconds,
milliseconds with 3 decimal places, or microseconds with 6 decimal places
respectively) to ensure timestamp parsing doesn't regress.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b68509dc-28c3-4337-9dcf-5af3544ff667

📥 Commits

Reviewing files that changed from the base of the PR and between e93757f and 3c5de8f.

📒 Files selected for processing (5)
  • src/log_analyzer_cli/parsers/apache.py
  • src/log_analyzer_cli/parsers/json_log.py
  • src/log_analyzer_cli/utils.py
  • tests/test_parsers.py
  • tests/test_utils.py

Comment on lines 82 to 98
if isinstance(value, (int, float)):
if value > 1e12:
if not isinstance(value, int) or value <= 0:
return None
# Distinguish seconds, milliseconds, microseconds, and
# nanoseconds by the width of the value, not just
# whether it is > 1e12. A value of 1.7e15 (microseconds)
# would be misread as milliseconds in the old code
# and produce a year 53872 ValueError, and a value of
# 1.7e18 (nanoseconds) would silently overflow
# datetime's range.
if value >= 1e18:
return datetime.fromtimestamp(value / 1e9)
if value >= 1e15:
return datetime.fromtimestamp(value / 1e6)
if value >= 1e12:
return datetime.fromtimestamp(value / 1000)
return datetime.fromtimestamp(value)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Demonstrate bool handling and pre-2001 millisecond misclassification risk.

python - <<'PY'
from datetime import datetime

values = {
    "bool_true": True,
    "pre_2001_milliseconds": 946684800000,
    "huge_nanoseconds": 999999999999999999999999,
}

for name, value in values.items():
    print(f"\n{name} = {value!r}")
    print("isinstance(value, int):", isinstance(value, int))
    try:
        print("as seconds:", datetime.fromtimestamp(value))
    except Exception as exc:
        print("as seconds raised:", type(exc).__name__, exc)
    if isinstance(value, int):
        try:
            print("as milliseconds:", datetime.fromtimestamp(value / 1000))
        except Exception as exc:
            print("as milliseconds raised:", type(exc).__name__, exc)
PY

Repository: HrachShah/log-analyzer-cli

Length of output: 585


🏁 Script executed:

cd /tmp/repo && find . -name "json_log.py" -type f 2>/dev/null | head -5

Repository: HrachShah/log-analyzer-cli

Length of output: 130


🏁 Script executed:

cd /tmp/repo && wc -l src/log_analyzer_cli/parsers/json_log.py 2>/dev/null || echo "File not found"

Repository: HrachShah/log-analyzer-cli

Length of output: 163


🏁 Script executed:

cd /tmp/repo && sed -n '75,105p' src/log_analyzer_cli/parsers/json_log.py 2>/dev/null

Repository: HrachShah/log-analyzer-cli

Length of output: 130


🏁 Script executed:

find . -name "json_log.py" -type f

Repository: HrachShah/log-analyzer-cli

Length of output: 113


🏁 Script executed:

wc -l src/log_analyzer_cli/parsers/json_log.py

Repository: HrachShah/log-analyzer-cli

Length of output: 115


🏁 Script executed:

sed -n '75,105p' src/log_analyzer_cli/parsers/json_log.py

Repository: HrachShah/log-analyzer-cli

Length of output: 1577


🏁 Script executed:

sed -n '65,110p' src/log_analyzer_cli/parsers/json_log.py

Repository: HrachShah/log-analyzer-cli

Length of output: 2051


🏁 Script executed:

rg -A 15 "def _parse_timestamp" src/log_analyzer_cli/parsers/json_log.py

Repository: HrachShah/log-analyzer-cli

Length of output: 589


🏁 Script executed:

sed -n '50,75p' src/log_analyzer_cli/parsers/json_log.py

Repository: HrachShah/log-analyzer-cli

Length of output: 896


🏁 Script executed:

rg "try:|except:" -B 2 -A 5 src/log_analyzer_cli/parsers/json_log.py | head -40

Repository: HrachShah/log-analyzer-cli

Length of output: 962


🏁 Script executed:

python3 - <<'PY'
# Verify the specific issues with current code behavior
from datetime import datetime

# Issue 1: Boolean passthrough
print("=== Issue 1: Boolean passthrough ===")
value = True
print(f"value = {value!r}")
print(f"isinstance(value, (int, float)): {isinstance(value, (int, float))}")
print(f"isinstance(value, int): {isinstance(value, int)}")
print(f"value <= 0: {value <= 0}")
print(f"Would pass through check: {isinstance(value, (int, float)) and isinstance(value, int) and value > 0}")
if isinstance(value, (int, float)) and isinstance(value, int) and value > 0:
    try:
        result = datetime.fromtimestamp(value)
        print(f"Result: {result}")
    except Exception as e:
        print(f"Error: {e}")

# Issue 2: Pre-2001 milliseconds misclassification
print("\n=== Issue 2: Pre-2001 milliseconds ===")
value = 946684800000
print(f"value = {value!r} (milliseconds for 2000-01-01)")
print(f"isinstance(value, (int, float)): {isinstance(value, (int, float))}")
print(f"isinstance(value, int): {isinstance(value, int)}")
print(f"value <= 0: {value <= 0}")
print(f"value >= 1e12: {value >= 1e12}")
if isinstance(value, (int, float)) and isinstance(value, int) and value > 0 and value >= 1e12:
    try:
        result = datetime.fromtimestamp(value / 1000)
        print(f"As milliseconds (correct): {result}")
    except Exception as e:
        print(f"Error: {e}")
    try:
        result = datetime.fromtimestamp(value)
        print(f"As seconds (would fail): {result}")
    except Exception as e:
        print(f"Error as seconds: {type(e).__name__}: {e}")

# Issue 3: Overflow risk
print("\n=== Issue 3: Overflow on huge nanoseconds ===")
value = 999999999999999999999999
print(f"value = {value!r}")
print(f"value >= 1e18: {value >= 1e18}")
if isinstance(value, (int, float)) and isinstance(value, int) and value > 0 and value >= 1e18:
    try:
        result = datetime.fromtimestamp(value / 1e9)
        print(f"Result: {result}")
    except Exception as e:
        print(f"Error: {type(e).__name__}: {e}")
PY

Repository: HrachShah/log-analyzer-cli

Length of output: 610


Handle booleans and add error recovery for timestamp parsing.

The current logic has three issues:

  1. Boolean passthrough: isinstance(True, int) returns True in Python, so JSON true values pass through and become 1970-01-01 00:00:01.

  2. Pre-2001 milliseconds misclassified: Values < 1e12 (e.g., 946684800000 for 2000-01-01 in milliseconds) are treated as seconds and raise ValueError: year out of range.

  3. Overflow on large values: Extremely large nanosecond values (≥ 1e18) cause ValueError: year out of range when divided by 1e9.

Additionally, failed datetime.fromtimestamp() calls are not caught, causing one bad timestamp to abort the entire parse operation.

Add explicit boolean rejection with isinstance(value, bool) and wrap conversion attempts in try/except to gracefully skip invalid timestamps:

Proposed approach
                if isinstance(value, (int, float)):
-                    if not isinstance(value, int) or value <= 0:
+                    if isinstance(value, bool) or not isinstance(value, int) or value <= 0:
                         return None
-                    # Distinguish seconds, milliseconds, microseconds, and
-                    # nanoseconds by the width of the value, not just
-                    # whether it is > 1e12. A value of 1.7e15 (microseconds)
-                    # would be misread as milliseconds in the old code
-                    # and produce a year 53872 ValueError, and a value of
-                    # 1.7e18 (nanoseconds) would silently overflow
-                    # datetime's range.
-                    if value >= 1e18:
-                        return datetime.fromtimestamp(value / 1e9)
-                    if value >= 1e15:
-                        return datetime.fromtimestamp(value / 1e6)
-                    if value >= 1e12:
-                        return datetime.fromtimestamp(value / 1000)
-                    return datetime.fromtimestamp(value)
+
+                    scales = []
+                    if value >= 1_000_000_000_000_000_000:
+                        scales = [1_000_000_000]
+                    elif value >= 1_000_000_000_000_000:
+                        scales = [1_000_000, 1_000_000_000]
+                    elif value >= 1_000_000_000_000:
+                        scales = [1_000, 1_000_000]
+                    else:
+                        scales = [1, 1_000]
+
+                    for scale in scales:
+                        try:
+                            return datetime.fromtimestamp(value / scale)
+                        except (OSError, OverflowError, ValueError):
+                            continue
+                    return None
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/log_analyzer_cli/parsers/json_log.py` around lines 82 - 98, The timestamp
parsing logic needs to reject boolean values explicitly and handle conversion
errors gracefully. First, add an additional isinstance check for bool before the
current int/float type check to prevent True/False from being treated as 1/0.
Second, wrap each datetime.fromtimestamp call (for the 1e18, 1e15, 1e12, and
direct value cases) in a try/except block to catch ValueError exceptions from
year out of range or overflow conditions, returning None when parsing fails so
invalid timestamps do not abort the entire parse operation.

Comment thread tests/test_utils.py
ts = parse_timestamp(line)
assert ts is not None
assert ts.day == 5
assert ts.year == datetime.now().year No newline at end of file

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a trailing newline at EOF to clear flake8 W292.

CI reports W292 on Line 82. Please add a newline at end of file.

🧰 Tools
🪛 GitHub Actions: CI / 1_test (3.10).txt

[warning] 82-82: flake8 (W292): no newline at end of file

🪛 GitHub Actions: CI / test (3.10)

[warning] 82-82: flake8 (W292): no newline at end of file

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_utils.py` at line 82, The file tests/test_utils.py is missing a
trailing newline at the end of the file after the assertion statement assert
ts.year == datetime.now().year. Add a newline character at the end of the file
to comply with flake8 W292 requirements. This is a common style convention where
all Python files should end with a single newline character.

Source: Pipeline failures

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant