fix apache parser: match real user field in COMBINED_PATTERN by HrachShah · Pull Request #12 · HrachShah/log-analyzer-cli

HrachShah · 2026-06-19T09:46:22Z

The COMBINED_PATTERN regex in parsers/apache.py was using \s+ to match the user field, which only matches whitespace. Real combined-format log lines have a literal - for unauthenticated requests or a username, so the pattern never matched a true combined-format line. parse() then fell back to COMMON_PATTERN, dropping the referer and user_agent metadata that combined format is supposed to expose.

This change uses \S+ (with a trailing \s+ to keep field separation) so the user field matches the same way it does in COMMON_PATTERN. Combined-format lines now get referer and user_agent populated in the entry metadata; common-format lines (which don't have those fields) still match via the optional non-capturing group.

Repro before the fix:

from log_analyzer_cli.parsers import ApacheParser
p = ApacheParser()
line = '192.168.1.10 - - [20/Mar/2025:10:15:32 +0000] "GET /index.html HTTP/1.1" 200 2326 "https://example.com" "Mozilla/5.0"'
entry = p.parse(line)
# 'referer' and 'user_agent' are missing from entry.metadata

After the fix, entry.metadata has referer and user_agent populated.

Summary by Sourcery

Fix timestamp handling and Apache combined log parsing to correctly extract metadata and produce accurate time-series analyses.

Bug Fixes:

Correct Apache combined log pattern so the user field matches real log values and combined-format entries retain referer and user agent metadata.
Fix JSON log parser numeric timestamp handling to distinguish seconds, milliseconds, microseconds, and nanoseconds by magnitude and avoid overflows or invalid dates.
Anchor yearless syslog-style timestamps to the current year instead of 1900 so time-based reports and aggregations reflect real log times.
Ensure invalid or non-positive numeric JSON timestamps do not produce bogus datetimes but instead yield a missing timestamp.

Tests:

Add tests for JSON timestamp parsing that cover string and numeric timestamps at second, millisecond, microsecond, and nanosecond precision.
Add focused tests for shared timestamp parsing utilities, including syslog-style yearless formats, missing-timestamp scenarios, and edge cases like single-digit days.

Summary by CodeRabbit

Bug Fixes

Improved Apache/Nginx log parsing accuracy for user field extraction
Enhanced JSON timestamp handling across multiple unit widths (seconds, milliseconds, microseconds, nanoseconds)
Fixed syslog-style timestamp year interpretation to use current year instead of 1900

Tests

Added comprehensive test coverage for timestamp parsing validation

The JSON log parser used a single > 1e12 branch to decide whether a numeric timestamp was in se

The "%b %d %H:%M:%S" format used by RFC 3164 syslog (e.g. "Mar 20 10:15:32") has no year component. Previously _try_parse_datetime handed the string straight to datetime.strptime, which fills 1900 as the default year. That date then propagated through analyze() into first_seen / last_seen / TimeDistribution, which meant every syslog entry landed in a single January-1900 bucket in the time-series report and sorted before every other entry in the file. Re-parse the same string with the current year prepended ("%Y %b %d %H:%M:%S") when the result lands on year 1900. This matches the convention used by Python's logging module and most production syslog parsers (rsyslog, syslog-ng). Timestamps that already carry a year are unaffected — the rewrite only runs on the yearless format. Also added tests/test_utils.py covering: yearless syslog lines anchored to the current year, single-digit day ("Mar 5"), the no-1900 invariant, that timestamps with a year are unchanged, and that lines without any timestamp return None.

…r_agent are captured The user group in COMBINED_PATTERN was the whitespace class \\s+, which only matches spaces and never the real user field (typically a literal '-' for unauthenticated requests, or an actual username). The regex therefore never matched a real combined-format line and parse() silently fell through to COMMON_PATTERN, losing the referer and user_agent metadata that combined format is specifically meant to surface. Match the user field with \\S+ like COMMON_PATTERN does, and keep the trailing \\s+ so the field stays separated from the bracketed timestamp. Combined-format lines now produce entries with referer and user_agent populated, while common-format lines (which have no referer/user_agent to capture) still match correctly via the optional non-capturing group.

sourcery-ai · 2026-06-19T09:46:29Z

Reviewer's Guide

Adjusts timestamp parsing and detection across utilities and the JSON log parser, fixes Apache combined-log user-field matching, and adds targeted tests to prevent regressions in yearless and numeric timestamp handling.

File-Level Changes

Change	Details	Files
Fix Apache combined log parsing so the user field matches real username / '-' values and enables referer and user_agent extraction for combined-format lines.	Change COMBINED_PATTERN user group from a whitespace matcher to a non-whitespace token followed by whitespace, aligning it with COMMON_PATTERN semantics. Ensure combined-format Apache log lines are properly recognized so referer and user-agent fields are populated instead of falling back to the common format.	`src/log_analyzer_cli/parsers/apache.py`
Anchor yearless (syslog-style) timestamps to the current year instead of 1900 while preserving behavior for timestamps that already include a year.	Iterate over known datetime formats with an index so the syslog-style "%b %d %H:%M:%S" format can be detected. When that format matches and yields year 1900, re-parse the timestamp with the current year prepended, falling back to the original parsed value if reparse fails. Add focused tests for parse_timestamp covering yearless syslog-style inputs, non-timestamp lines, and edge cases like single-digit days and whitespace-only lines.	`src/log_analyzer_cli/utils.py` `tests/test_utils.py`
Make JSON numeric timestamp extraction width-aware so seconds, milliseconds, microseconds, and nanoseconds are handled correctly and invalid/degenerate numeric values are ignored.	Extend _extract_timestamp to treat non-positive or non-int numeric timestamp values as invalid, returning None instead of raising or mis-parsing. Replace the old >1e12 heuristic with width-based thresholds that map to seconds, milliseconds, microseconds, or nanoseconds and scale values accordingly before calling datetime.fromtimestamp. Add regression tests to confirm JSONLogParser can_parse behavior for various string timestamp precisions and accurate parsing of numeric timestamps at second, millisecond, microsecond, and nanosecond resolutions, including zero/negative timestamps yielding None.	`src/log_analyzer_cli/parsers/json_log.py` `tests/test_parsers.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

coderabbitai · 2026-06-19T09:46:35Z

📝 Walkthrough

Walkthrough

Three independent parser bug fixes: the Apache combined-log regex corrects the user capture group from \s+ to \S+; the JSON log parser gains multi-threshold numeric timestamp unit-width detection (nanoseconds/microseconds/milliseconds/seconds) with non-positive value rejection; and _try_parse_datetime now re-anchors syslog-style yearless timestamps from 1900 to the current year. Tests are added for all three fixes.

Changes

Parser and Timestamp Bug Fixes

Layer / File(s)	Summary
Apache combined-log user field regex fix `src/log_analyzer_cli/parsers/apache.py`	Changes the `user` capture group in `COMBINED_PATTERN` from `\s+` to `\S+`, correcting how the user field is extracted from Apache/Nginx combined log lines.
JSON numeric timestamp unit-width detection and tests `src/log_analyzer_cli/parsers/json_log.py`, `tests/test_parsers.py`	Replaces single-threshold numeric timestamp conversion with explicit magnitude checks for nanoseconds, microseconds, milliseconds, and seconds; rejects non-positive values. Tests cover string formats, all four numeric unit widths, and zero/negative edge cases returning `None`.
Syslog yearless timestamp year anchoring and tests `src/log_analyzer_cli/utils.py`, `tests/test_utils.py`	`_try_parse_datetime` re-parses syslog-style timestamps that produce year 1900 by prepending the current year. Tests assert current-year anchoring, no 1900 default, ISO year preservation, `None` for unparseable input, and correct handling of single-digit days.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 A hop through the logs, a fix here and there,
The user field caught by \S+ with care.
Timestamps in nanos? In millis? In years?
We anchor to now and banish old fears.
No more 1900 — the rabbit jumps free! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 13.64% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title directly and clearly summarizes the main change: fixing the Apache parser's COMBINED_PATTERN regex to correctly match the user field.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch patch-apache-combined-user

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

In _try_parse_datetime, the special handling for the syslog-style format relies on i == 6, which is fragile if the formats list changes ordering; it would be more robust to key this behavior off the specific format string (e.g. fmt == "%b %d %H:%M:%S").
The updated _extract_timestamp in JSONLogParser now ignores float epoch timestamps (isinstance(value, float) previously worked, but now any non-int numeric returns None); consider preserving support for float seconds while still doing width-based unit detection for large integers.
The tests asserting ts.year == datetime.now().year in test_utils.py can be flaky around the year boundary (e.g., log line timestamp just before midnight and datetime.now() just after); consider injecting a fixed reference year or mocking the current time instead of calling datetime.now() directly.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `_try_parse_datetime`, the special handling for the syslog-style format relies on `i == 6`, which is fragile if the `formats` list changes ordering; it would be more robust to key this behavior off the specific format string (e.g. `fmt == "%b %d %H:%M:%S"`).
- The updated `_extract_timestamp` in `JSONLogParser` now ignores float epoch timestamps (`isinstance(value, float)` previously worked, but now any non-`int` numeric returns `None`); consider preserving support for float seconds while still doing width-based unit detection for large integers.
- The tests asserting `ts.year == datetime.now().year` in `test_utils.py` can be flaky around the year boundary (e.g., log line timestamp just before midnight and `datetime.now()` just after); consider injecting a fixed reference year or mocking the current time instead of calling `datetime.now()` directly.

## Individual Comments

### Comment 1
<location path="src/log_analyzer_cli/utils.py" line_range="53-62" />
<code_context>
+    for i, fmt in enumerate(formats):
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Avoid depending on the index of the format list when applying the syslog year fix.

Relying on `if i == 6` couples this behavior to the current `formats` ordering. If formats are added or reordered, the condition may silently stop targeting the syslog format (or target the wrong one). Instead, key this logic off the `fmt` value itself (e.g., compare to the `%b %d %H:%M:%S` pattern) or a named constant representing the syslog format.

Suggested implementation:

```python
    for fmt in formats:

```

```python
        # The "%b %d %H:%M:%S" format (syslog-style "Jan 15 10:30:45") has no
        # year, so datetime.strptime fills in 1900 as the default. Re-parse
        # the same string with the current year prepended so the resulting
        # timestamp is anchored to when the log was actually generated. This
        # matches the convention used by Python's logging module and most

```

```python
        if fmt == "%b %d %H:%M:%S":

```

If your `formats` list is defined elsewhere and you’d like to avoid hardcoding the syslog pattern in multiple places, consider introducing a module-level constant like `SYSLOG_FORMAT = "%b %d %H:%M:%S"` and using that both in the `formats` list and in this conditional (`if fmt == SYSLOG_FORMAT:`).
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-06-19T09:47:30Z

+    for i, fmt in enumerate(formats):
        try:
-            return datetime.strptime(ts_str, fmt)
+            parsed = datetime.strptime(ts_str, fmt)
        except ValueError:
            continue
+        # The "%b %d %H:%M:%S" format (syslog-style "Jan 15 10:30:45") has no
+        # year, so datetime.strptime fills in 1900 as the default. Re-parse
+        # the same string with the current year prepended so the resulting
+        # timestamp is anchored to when the log was actually generated. This
+        # matches the convention used by Python's logging module and most


suggestion (bug_risk): Avoid depending on the index of the format list when applying the syslog year fix.

Relying on if i == 6 couples this behavior to the current formats ordering. If formats are added or reordered, the condition may silently stop targeting the syslog format (or target the wrong one). Instead, key this logic off the fmt value itself (e.g., compare to the %b %d %H:%M:%S pattern) or a named constant representing the syslog format.

Suggested implementation:

for fmt in formats:

# The "%b %d %H:%M:%S" format (syslog-style "Jan 15 10:30:45") has no # year, so datetime.strptime fills in 1900 as the default. Re-parse # the same string with the current year prepended so the resulting # timestamp is anchored to when the log was actually generated. This # matches the convention used by Python's logging module and most

if fmt == "%b %d %H:%M:%S":

If your formats list is defined elsewhere and you’d like to avoid hardcoding the syslog pattern in multiple places, consider introducing a module-level constant like SYSLOG_FORMAT = "%b %d %H:%M:%S" and using that both in the formats list and in this conditional (if fmt == SYSLOG_FORMAT:).

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

src/log_analyzer_cli/utils.py (1)

53-78: ⚡ Quick win

Use the format string instead of index to trigger year anchoring.

Line 68 depends on i == 6, which can silently break if formats order changes. Match on fmt == "%b %d %H:%M:%S" instead.

Proposed refactor

-    for i, fmt in enumerate(formats):
+    for fmt in formats:
         try:
             parsed = datetime.strptime(ts_str, fmt)
         except ValueError:
             continue
@@
-        if i == 6 and parsed.year == 1900:
+        if fmt == "%b %d %H:%M:%S" and parsed.year == 1900:

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/log_analyzer_cli/utils.py` around lines 53 - 78, The condition checking i
== 6 to trigger year anchoring for syslog-style timestamps is fragile because it
depends on the index position in the formats list. If the formats list order
changes, the special handling will silently fail to trigger. Replace the
index-based check with a direct format string comparison by changing the
condition from i == 6 and parsed.year == 1900 to fmt == "%b %d %H:%M:%S" and
parsed.year == 1900. This makes the code robust to changes in the formats list
order.

tests/test_parsers.py (1)

187-200: ⚡ Quick win

Assert parsed timestamps, not only JSON validity.

can_parse() only checks that the line is syntactically JSON, so these tests would still pass if timestamp parsing for Z, milliseconds, or microseconds regressed. Parse the entry and assert timestamp / microsecond instead.

🧪 Proposed test strengthening

     def test_seconds(self):
         parser = JSONLogParser()
         line = '{"timestamp": "2025-03-20T10:15:32Z", "level": "INFO", "message": "Started"}'
-        assert parser.can_parse(line) is True
+        entry = parser.parse(line)
+        assert entry is not None
+        assert entry.timestamp is not None
+        assert entry.timestamp.microsecond == 0
 
     def test_milliseconds(self):
         parser = JSONLogParser()
         line = '{"timestamp": "2025-03-20T10:15:32.123Z", "level": "INFO", "message": "Started"}'
-        assert parser.can_parse(line) is True
+        entry = parser.parse(line)
+        assert entry is not None
+        assert entry.timestamp is not None
+        assert entry.timestamp.microsecond == 123000
 
     def test_microseconds(self):
         parser = JSONLogParser()
         line = '{"timestamp": "2025-03-20T10:15:32.123456Z", "level": "INFO", "message": "Started"}'
-        assert parser.can_parse(line) is True
+        entry = parser.parse(line)
+        assert entry is not None
+        assert entry.timestamp is not None
+        assert entry.timestamp.microsecond == 123456

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_parsers.py` around lines 187 - 200, The test methods test_seconds,
test_milliseconds, and test_microseconds in the JSONLogParser test class are
only validating JSON syntax through can_parse() but not actually asserting
correct timestamp parsing. For each test method, replace the assertion on
can_parse() with a call to the parser's parse method on the JSON line, then
assert that the parsed entry's timestamp or microsecond value matches the
expected precision level (seconds, milliseconds with 3 decimal places, or
microseconds with 6 decimal places respectively) to ensure timestamp parsing
doesn't regress.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/log_analyzer_cli/parsers/json_log.py`:
- Around line 82-98: The timestamp parsing logic needs to reject boolean values
explicitly and handle conversion errors gracefully. First, add an additional
isinstance check for bool before the current int/float type check to prevent
True/False from being treated as 1/0. Second, wrap each datetime.fromtimestamp
call (for the 1e18, 1e15, 1e12, and direct value cases) in a try/except block to
catch ValueError exceptions from year out of range or overflow conditions,
returning None when parsing fails so invalid timestamps do not abort the entire
parse operation.

In `@tests/test_utils.py`:
- Line 82: The file tests/test_utils.py is missing a trailing newline at the end
of the file after the assertion statement assert ts.year == datetime.now().year.
Add a newline character at the end of the file to comply with flake8 W292
requirements. This is a common style convention where all Python files should
end with a single newline character.

---

Nitpick comments:
In `@src/log_analyzer_cli/utils.py`:
- Around line 53-78: The condition checking i == 6 to trigger year anchoring for
syslog-style timestamps is fragile because it depends on the index position in
the formats list. If the formats list order changes, the special handling will
silently fail to trigger. Replace the index-based check with a direct format
string comparison by changing the condition from i == 6 and parsed.year == 1900
to fmt == "%b %d %H:%M:%S" and parsed.year == 1900. This makes the code robust
to changes in the formats list order.

In `@tests/test_parsers.py`:
- Around line 187-200: The test methods test_seconds, test_milliseconds, and
test_microseconds in the JSONLogParser test class are only validating JSON
syntax through can_parse() but not actually asserting correct timestamp parsing.
For each test method, replace the assertion on can_parse() with a call to the
parser's parse method on the JSON line, then assert that the parsed entry's
timestamp or microsecond value matches the expected precision level (seconds,
milliseconds with 3 decimal places, or microseconds with 6 decimal places
respectively) to ensure timestamp parsing doesn't regress.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b68509dc-28c3-4337-9dcf-5af3544ff667

📥 Commits

Reviewing files that changed from the base of the PR and between e93757f and 3c5de8f.

📒 Files selected for processing (5)

src/log_analyzer_cli/parsers/apache.py
src/log_analyzer_cli/parsers/json_log.py
src/log_analyzer_cli/utils.py
tests/test_parsers.py
tests/test_utils.py

coderabbitai · 2026-06-19T09:53:10Z

                if isinstance(value, (int, float)):
-                    if value > 1e12:
+                    if not isinstance(value, int) or value <= 0:
+                        return None
+                    # Distinguish seconds, milliseconds, microseconds, and
+                    # nanoseconds by the width of the value, not just
+                    # whether it is > 1e12. A value of 1.7e15 (microseconds)
+                    # would be misread as milliseconds in the old code
+                    # and produce a year 53872 ValueError, and a value of
+                    # 1.7e18 (nanoseconds) would silently overflow
+                    # datetime's range.
+                    if value >= 1e18:
+                        return datetime.fromtimestamp(value / 1e9)
+                    if value >= 1e15:
+                        return datetime.fromtimestamp(value / 1e6)
+                    if value >= 1e12:
                        return datetime.fromtimestamp(value / 1000)
                    return datetime.fromtimestamp(value)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Demonstrate bool handling and pre-2001 millisecond misclassification risk. python - <<'PY' from datetime import datetime values = { "bool_true": True, "pre_2001_milliseconds": 946684800000, "huge_nanoseconds": 999999999999999999999999, } for name, value in values.items(): print(f"\n{name} = {value!r}") print("isinstance(value, int):", isinstance(value, int)) try: print("as seconds:", datetime.fromtimestamp(value)) except Exception as exc: print("as seconds raised:", type(exc).__name__, exc) if isinstance(value, int): try: print("as milliseconds:", datetime.fromtimestamp(value / 1000)) except Exception as exc: print("as milliseconds raised:", type(exc).__name__, exc) PY

Repository: HrachShah/log-analyzer-cli

Length of output: 585

🏁 Script executed:

cd /tmp/repo && find . -name "json_log.py" -type f 2>/dev/null | head -5

Repository: HrachShah/log-analyzer-cli

Length of output: 130

🏁 Script executed:

cd /tmp/repo && wc -l src/log_analyzer_cli/parsers/json_log.py 2>/dev/null || echo "File not found"

Repository: HrachShah/log-analyzer-cli

Length of output: 163

🏁 Script executed:

cd /tmp/repo && sed -n '75,105p' src/log_analyzer_cli/parsers/json_log.py 2>/dev/null

Repository: HrachShah/log-analyzer-cli

Length of output: 130

🏁 Script executed:

find . -name "json_log.py" -type f

Repository: HrachShah/log-analyzer-cli

Length of output: 113

🏁 Script executed:

wc -l src/log_analyzer_cli/parsers/json_log.py

Repository: HrachShah/log-analyzer-cli

Length of output: 115

🏁 Script executed:

sed -n '75,105p' src/log_analyzer_cli/parsers/json_log.py

Repository: HrachShah/log-analyzer-cli

Length of output: 1577

🏁 Script executed:

sed -n '65,110p' src/log_analyzer_cli/parsers/json_log.py

Repository: HrachShah/log-analyzer-cli

Length of output: 2051

🏁 Script executed:

rg -A 15 "def _parse_timestamp" src/log_analyzer_cli/parsers/json_log.py

Repository: HrachShah/log-analyzer-cli

Length of output: 589

🏁 Script executed:

sed -n '50,75p' src/log_analyzer_cli/parsers/json_log.py

Repository: HrachShah/log-analyzer-cli

Length of output: 896

🏁 Script executed:

rg "try:|except:" -B 2 -A 5 src/log_analyzer_cli/parsers/json_log.py | head -40

Repository: HrachShah/log-analyzer-cli

Length of output: 962

🏁 Script executed:

python3 - <<'PY' # Verify the specific issues with current code behavior from datetime import datetime # Issue 1: Boolean passthrough print("=== Issue 1: Boolean passthrough ===") value = True print(f"value = {value!r}") print(f"isinstance(value, (int, float)): {isinstance(value, (int, float))}") print(f"isinstance(value, int): {isinstance(value, int)}") print(f"value <= 0: {value <= 0}") print(f"Would pass through check: {isinstance(value, (int, float)) and isinstance(value, int) and value > 0}") if isinstance(value, (int, float)) and isinstance(value, int) and value > 0: try: result = datetime.fromtimestamp(value) print(f"Result: {result}") except Exception as e: print(f"Error: {e}") # Issue 2: Pre-2001 milliseconds misclassification print("\n=== Issue 2: Pre-2001 milliseconds ===") value = 946684800000 print(f"value = {value!r} (milliseconds for 2000-01-01)") print(f"isinstance(value, (int, float)): {isinstance(value, (int, float))}") print(f"isinstance(value, int): {isinstance(value, int)}") print(f"value <= 0: {value <= 0}") print(f"value >= 1e12: {value >= 1e12}") if isinstance(value, (int, float)) and isinstance(value, int) and value > 0 and value >= 1e12: try: result = datetime.fromtimestamp(value / 1000) print(f"As milliseconds (correct): {result}") except Exception as e: print(f"Error: {e}") try: result = datetime.fromtimestamp(value) print(f"As seconds (would fail): {result}") except Exception as e: print(f"Error as seconds: {type(e).__name__}: {e}") # Issue 3: Overflow risk print("\n=== Issue 3: Overflow on huge nanoseconds ===") value = 999999999999999999999999 print(f"value = {value!r}") print(f"value >= 1e18: {value >= 1e18}") if isinstance(value, (int, float)) and isinstance(value, int) and value > 0 and value >= 1e18: try: result = datetime.fromtimestamp(value / 1e9) print(f"Result: {result}") except Exception as e: print(f"Error: {type(e).__name__}: {e}") PY

Repository: HrachShah/log-analyzer-cli

Length of output: 610

Handle booleans and add error recovery for timestamp parsing.

The current logic has three issues:

Boolean passthrough: isinstance(True, int) returns True in Python, so JSON true values pass through and become 1970-01-01 00:00:01.

Pre-2001 milliseconds misclassified: Values < 1e12 (e.g., 946684800000 for 2000-01-01 in milliseconds) are treated as seconds and raise ValueError: year out of range.

Overflow on large values: Extremely large nanosecond values (≥ 1e18) cause ValueError: year out of range when divided by 1e9.

Additionally, failed datetime.fromtimestamp() calls are not caught, causing one bad timestamp to abort the entire parse operation.

Add explicit boolean rejection with isinstance(value, bool) and wrap conversion attempts in try/except to gracefully skip invalid timestamps:

Proposed approach

if isinstance(value, (int, float)): - if not isinstance(value, int) or value <= 0: + if isinstance(value, bool) or not isinstance(value, int) or value <= 0: return None - # Distinguish seconds, milliseconds, microseconds, and - # nanoseconds by the width of the value, not just - # whether it is > 1e12. A value of 1.7e15 (microseconds) - # would be misread as milliseconds in the old code - # and produce a year 53872 ValueError, and a value of - # 1.7e18 (nanoseconds) would silently overflow - # datetime's range. - if value >= 1e18: - return datetime.fromtimestamp(value / 1e9) - if value >= 1e15: - return datetime.fromtimestamp(value / 1e6) - if value >= 1e12: - return datetime.fromtimestamp(value / 1000) - return datetime.fromtimestamp(value) + + scales = [] + if value >= 1_000_000_000_000_000_000: + scales = [1_000_000_000] + elif value >= 1_000_000_000_000_000: + scales = [1_000_000, 1_000_000_000] + elif value >= 1_000_000_000_000: + scales = [1_000, 1_000_000] + else: + scales = [1, 1_000] + + for scale in scales: + try: + return datetime.fromtimestamp(value / scale) + except (OSError, OverflowError, ValueError): + continue + return None

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/log_analyzer_cli/parsers/json_log.py` around lines 82 - 98, The timestamp parsing logic needs to reject boolean values explicitly and handle conversion errors gracefully. First, add an additional isinstance check for bool before the current int/float type check to prevent True/False from being treated as 1/0. Second, wrap each datetime.fromtimestamp call (for the 1e18, 1e15, 1e12, and direct value cases) in a try/except block to catch ValueError exceptions from year out of range or overflow conditions, returning None when parsing fails so invalid timestamps do not abort the entire parse operation.

coderabbitai · 2026-06-19T09:53:10Z

+        ts = parse_timestamp(line)
+        assert ts is not None
+        assert ts.day == 5
+        assert ts.year == datetime.now().year


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a trailing newline at EOF to clear flake8 W292.

CI reports W292 on Line 82. Please add a newline at end of file.

🧰 Tools

🪛 GitHub Actions: CI / 1_test (3.10).txt

[warning] 82-82: flake8 (W292): no newline at end of file

🪛 GitHub Actions: CI / test (3.10)

[warning] 82-82: flake8 (W292): no newline at end of file

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_utils.py` at line 82, The file tests/test_utils.py is missing a trailing newline at the end of the file after the assertion statement assert ts.year == datetime.now().year. Add a newline character at the end of the file to comply with flake8 W292 requirements. This is a common style convention where all Python files should end with a single newline character.

Source: Pipeline failures

Zo Bot added 3 commits June 17, 2026 01:51

classify numeric JSON timestamps by width in seconds, ms, us, or ns

c933ec6

The JSON log parser used a single > 1e12 branch to decide whether a numeric timestamp was in se

sourcery-ai Bot reviewed Jun 19, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix apache parser: match real user field in COMBINED_PATTERN#12

fix apache parser: match real user field in COMBINED_PATTERN#12
HrachShah wants to merge 3 commits into
mainfrom
patch-apache-combined-user

HrachShah commented Jun 19, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

sourcery-ai Bot commented Jun 19, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

sourcery-ai Bot Jun 19, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 19, 2026

Uh oh!

coderabbitai Bot Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HrachShah commented Jun 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Summary by CodeRabbit

Bug Fixes

Tests

Uh oh!

sourcery-ai Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

coderabbitai Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HrachShah commented Jun 19, 2026 •

edited by coderabbitai Bot

Loading

sourcery-ai Bot commented Jun 19, 2026 •

edited

Loading

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading