fix apache parser: match real user field in COMBINED_PATTERN#12
fix apache parser: match real user field in COMBINED_PATTERN#12HrachShah wants to merge 3 commits into
Conversation
The JSON log parser used a single > 1e12 branch to decide whether a numeric timestamp was in se
The "%b %d %H:%M:%S" format used by RFC 3164 syslog (e.g. "Mar 20
10:15:32") has no year component. Previously _try_parse_datetime handed
the string straight to datetime.strptime, which fills 1900 as the
default year. That date then propagated through analyze() into
first_seen / last_seen / TimeDistribution, which meant every syslog
entry landed in a single January-1900 bucket in the time-series report
and sorted before every other entry in the file.
Re-parse the same string with the current year prepended ("%Y %b %d
%H:%M:%S") when the result lands on year 1900. This matches the
convention used by Python's logging module and most production syslog
parsers (rsyslog, syslog-ng). Timestamps that already carry a year are
unaffected — the rewrite only runs on the yearless format.
Also added tests/test_utils.py covering: yearless syslog lines anchored
to the current year, single-digit day ("Mar 5"), the no-1900
invariant, that timestamps with a year are unchanged, and that lines
without any timestamp return None.
…r_agent are captured The user group in COMBINED_PATTERN was the whitespace class \\s+, which only matches spaces and never the real user field (typically a literal '-' for unauthenticated requests, or an actual username). The regex therefore never matched a real combined-format line and parse() silently fell through to COMMON_PATTERN, losing the referer and user_agent metadata that combined format is specifically meant to surface. Match the user field with \\S+ like COMMON_PATTERN does, and keep the trailing \\s+ so the field stays separated from the bracketed timestamp. Combined-format lines now produce entries with referer and user_agent populated, while common-format lines (which have no referer/user_agent to capture) still match correctly via the optional non-capturing group.
Reviewer's GuideAdjusts timestamp parsing and detection across utilities and the JSON log parser, fixes Apache combined-log user-field matching, and adds targeted tests to prevent regressions in yearless and numeric timestamp handling. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
📝 WalkthroughWalkthroughThree independent parser bug fixes: the Apache combined-log regex corrects the ChangesParser and Timestamp Bug Fixes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- In
_try_parse_datetime, the special handling for the syslog-style format relies oni == 6, which is fragile if theformatslist changes ordering; it would be more robust to key this behavior off the specific format string (e.g.fmt == "%b %d %H:%M:%S"). - The updated
_extract_timestampinJSONLogParsernow ignores float epoch timestamps (isinstance(value, float)previously worked, but now any non-intnumeric returnsNone); consider preserving support for float seconds while still doing width-based unit detection for large integers. - The tests asserting
ts.year == datetime.now().yearintest_utils.pycan be flaky around the year boundary (e.g., log line timestamp just before midnight anddatetime.now()just after); consider injecting a fixed reference year or mocking the current time instead of callingdatetime.now()directly.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `_try_parse_datetime`, the special handling for the syslog-style format relies on `i == 6`, which is fragile if the `formats` list changes ordering; it would be more robust to key this behavior off the specific format string (e.g. `fmt == "%b %d %H:%M:%S"`).
- The updated `_extract_timestamp` in `JSONLogParser` now ignores float epoch timestamps (`isinstance(value, float)` previously worked, but now any non-`int` numeric returns `None`); consider preserving support for float seconds while still doing width-based unit detection for large integers.
- The tests asserting `ts.year == datetime.now().year` in `test_utils.py` can be flaky around the year boundary (e.g., log line timestamp just before midnight and `datetime.now()` just after); consider injecting a fixed reference year or mocking the current time instead of calling `datetime.now()` directly.
## Individual Comments
### Comment 1
<location path="src/log_analyzer_cli/utils.py" line_range="53-62" />
<code_context>
+ for i, fmt in enumerate(formats):
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Avoid depending on the index of the format list when applying the syslog year fix.
Relying on `if i == 6` couples this behavior to the current `formats` ordering. If formats are added or reordered, the condition may silently stop targeting the syslog format (or target the wrong one). Instead, key this logic off the `fmt` value itself (e.g., compare to the `%b %d %H:%M:%S` pattern) or a named constant representing the syslog format.
Suggested implementation:
```python
for fmt in formats:
```
```python
# The "%b %d %H:%M:%S" format (syslog-style "Jan 15 10:30:45") has no
# year, so datetime.strptime fills in 1900 as the default. Re-parse
# the same string with the current year prepended so the resulting
# timestamp is anchored to when the log was actually generated. This
# matches the convention used by Python's logging module and most
```
```python
if fmt == "%b %d %H:%M:%S":
```
If your `formats` list is defined elsewhere and you’d like to avoid hardcoding the syslog pattern in multiple places, consider introducing a module-level constant like `SYSLOG_FORMAT = "%b %d %H:%M:%S"` and using that both in the `formats` list and in this conditional (`if fmt == SYSLOG_FORMAT:`).
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| for i, fmt in enumerate(formats): | ||
| try: | ||
| return datetime.strptime(ts_str, fmt) | ||
| parsed = datetime.strptime(ts_str, fmt) | ||
| except ValueError: | ||
| continue | ||
| # The "%b %d %H:%M:%S" format (syslog-style "Jan 15 10:30:45") has no | ||
| # year, so datetime.strptime fills in 1900 as the default. Re-parse | ||
| # the same string with the current year prepended so the resulting | ||
| # timestamp is anchored to when the log was actually generated. This | ||
| # matches the convention used by Python's logging module and most |
There was a problem hiding this comment.
suggestion (bug_risk): Avoid depending on the index of the format list when applying the syslog year fix.
Relying on if i == 6 couples this behavior to the current formats ordering. If formats are added or reordered, the condition may silently stop targeting the syslog format (or target the wrong one). Instead, key this logic off the fmt value itself (e.g., compare to the %b %d %H:%M:%S pattern) or a named constant representing the syslog format.
Suggested implementation:
for fmt in formats: # The "%b %d %H:%M:%S" format (syslog-style "Jan 15 10:30:45") has no
# year, so datetime.strptime fills in 1900 as the default. Re-parse
# the same string with the current year prepended so the resulting
# timestamp is anchored to when the log was actually generated. This
# matches the convention used by Python's logging module and most if fmt == "%b %d %H:%M:%S":If your formats list is defined elsewhere and you’d like to avoid hardcoding the syslog pattern in multiple places, consider introducing a module-level constant like SYSLOG_FORMAT = "%b %d %H:%M:%S" and using that both in the formats list and in this conditional (if fmt == SYSLOG_FORMAT:).
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
src/log_analyzer_cli/utils.py (1)
53-78: ⚡ Quick winUse the format string instead of index to trigger year anchoring.
Line 68 depends on
i == 6, which can silently break ifformatsorder changes. Match onfmt == "%b %d %H:%M:%S"instead.Proposed refactor
- for i, fmt in enumerate(formats): + for fmt in formats: try: parsed = datetime.strptime(ts_str, fmt) except ValueError: continue @@ - if i == 6 and parsed.year == 1900: + if fmt == "%b %d %H:%M:%S" and parsed.year == 1900:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/log_analyzer_cli/utils.py` around lines 53 - 78, The condition checking i == 6 to trigger year anchoring for syslog-style timestamps is fragile because it depends on the index position in the formats list. If the formats list order changes, the special handling will silently fail to trigger. Replace the index-based check with a direct format string comparison by changing the condition from i == 6 and parsed.year == 1900 to fmt == "%b %d %H:%M:%S" and parsed.year == 1900. This makes the code robust to changes in the formats list order.tests/test_parsers.py (1)
187-200: ⚡ Quick winAssert parsed timestamps, not only JSON validity.
can_parse()only checks that the line is syntactically JSON, so these tests would still pass if timestamp parsing forZ, milliseconds, or microseconds regressed. Parse the entry and asserttimestamp/microsecondinstead.🧪 Proposed test strengthening
def test_seconds(self): parser = JSONLogParser() line = '{"timestamp": "2025-03-20T10:15:32Z", "level": "INFO", "message": "Started"}' - assert parser.can_parse(line) is True + entry = parser.parse(line) + assert entry is not None + assert entry.timestamp is not None + assert entry.timestamp.microsecond == 0 def test_milliseconds(self): parser = JSONLogParser() line = '{"timestamp": "2025-03-20T10:15:32.123Z", "level": "INFO", "message": "Started"}' - assert parser.can_parse(line) is True + entry = parser.parse(line) + assert entry is not None + assert entry.timestamp is not None + assert entry.timestamp.microsecond == 123000 def test_microseconds(self): parser = JSONLogParser() line = '{"timestamp": "2025-03-20T10:15:32.123456Z", "level": "INFO", "message": "Started"}' - assert parser.can_parse(line) is True + entry = parser.parse(line) + assert entry is not None + assert entry.timestamp is not None + assert entry.timestamp.microsecond == 123456🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_parsers.py` around lines 187 - 200, The test methods test_seconds, test_milliseconds, and test_microseconds in the JSONLogParser test class are only validating JSON syntax through can_parse() but not actually asserting correct timestamp parsing. For each test method, replace the assertion on can_parse() with a call to the parser's parse method on the JSON line, then assert that the parsed entry's timestamp or microsecond value matches the expected precision level (seconds, milliseconds with 3 decimal places, or microseconds with 6 decimal places respectively) to ensure timestamp parsing doesn't regress.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/log_analyzer_cli/parsers/json_log.py`:
- Around line 82-98: The timestamp parsing logic needs to reject boolean values
explicitly and handle conversion errors gracefully. First, add an additional
isinstance check for bool before the current int/float type check to prevent
True/False from being treated as 1/0. Second, wrap each datetime.fromtimestamp
call (for the 1e18, 1e15, 1e12, and direct value cases) in a try/except block to
catch ValueError exceptions from year out of range or overflow conditions,
returning None when parsing fails so invalid timestamps do not abort the entire
parse operation.
In `@tests/test_utils.py`:
- Line 82: The file tests/test_utils.py is missing a trailing newline at the end
of the file after the assertion statement assert ts.year == datetime.now().year.
Add a newline character at the end of the file to comply with flake8 W292
requirements. This is a common style convention where all Python files should
end with a single newline character.
---
Nitpick comments:
In `@src/log_analyzer_cli/utils.py`:
- Around line 53-78: The condition checking i == 6 to trigger year anchoring for
syslog-style timestamps is fragile because it depends on the index position in
the formats list. If the formats list order changes, the special handling will
silently fail to trigger. Replace the index-based check with a direct format
string comparison by changing the condition from i == 6 and parsed.year == 1900
to fmt == "%b %d %H:%M:%S" and parsed.year == 1900. This makes the code robust
to changes in the formats list order.
In `@tests/test_parsers.py`:
- Around line 187-200: The test methods test_seconds, test_milliseconds, and
test_microseconds in the JSONLogParser test class are only validating JSON
syntax through can_parse() but not actually asserting correct timestamp parsing.
For each test method, replace the assertion on can_parse() with a call to the
parser's parse method on the JSON line, then assert that the parsed entry's
timestamp or microsecond value matches the expected precision level (seconds,
milliseconds with 3 decimal places, or microseconds with 6 decimal places
respectively) to ensure timestamp parsing doesn't regress.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b68509dc-28c3-4337-9dcf-5af3544ff667
📒 Files selected for processing (5)
src/log_analyzer_cli/parsers/apache.pysrc/log_analyzer_cli/parsers/json_log.pysrc/log_analyzer_cli/utils.pytests/test_parsers.pytests/test_utils.py
| if isinstance(value, (int, float)): | ||
| if value > 1e12: | ||
| if not isinstance(value, int) or value <= 0: | ||
| return None | ||
| # Distinguish seconds, milliseconds, microseconds, and | ||
| # nanoseconds by the width of the value, not just | ||
| # whether it is > 1e12. A value of 1.7e15 (microseconds) | ||
| # would be misread as milliseconds in the old code | ||
| # and produce a year 53872 ValueError, and a value of | ||
| # 1.7e18 (nanoseconds) would silently overflow | ||
| # datetime's range. | ||
| if value >= 1e18: | ||
| return datetime.fromtimestamp(value / 1e9) | ||
| if value >= 1e15: | ||
| return datetime.fromtimestamp(value / 1e6) | ||
| if value >= 1e12: | ||
| return datetime.fromtimestamp(value / 1000) | ||
| return datetime.fromtimestamp(value) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Demonstrate bool handling and pre-2001 millisecond misclassification risk.
python - <<'PY'
from datetime import datetime
values = {
"bool_true": True,
"pre_2001_milliseconds": 946684800000,
"huge_nanoseconds": 999999999999999999999999,
}
for name, value in values.items():
print(f"\n{name} = {value!r}")
print("isinstance(value, int):", isinstance(value, int))
try:
print("as seconds:", datetime.fromtimestamp(value))
except Exception as exc:
print("as seconds raised:", type(exc).__name__, exc)
if isinstance(value, int):
try:
print("as milliseconds:", datetime.fromtimestamp(value / 1000))
except Exception as exc:
print("as milliseconds raised:", type(exc).__name__, exc)
PYRepository: HrachShah/log-analyzer-cli
Length of output: 585
🏁 Script executed:
cd /tmp/repo && find . -name "json_log.py" -type f 2>/dev/null | head -5Repository: HrachShah/log-analyzer-cli
Length of output: 130
🏁 Script executed:
cd /tmp/repo && wc -l src/log_analyzer_cli/parsers/json_log.py 2>/dev/null || echo "File not found"Repository: HrachShah/log-analyzer-cli
Length of output: 163
🏁 Script executed:
cd /tmp/repo && sed -n '75,105p' src/log_analyzer_cli/parsers/json_log.py 2>/dev/nullRepository: HrachShah/log-analyzer-cli
Length of output: 130
🏁 Script executed:
find . -name "json_log.py" -type fRepository: HrachShah/log-analyzer-cli
Length of output: 113
🏁 Script executed:
wc -l src/log_analyzer_cli/parsers/json_log.pyRepository: HrachShah/log-analyzer-cli
Length of output: 115
🏁 Script executed:
sed -n '75,105p' src/log_analyzer_cli/parsers/json_log.pyRepository: HrachShah/log-analyzer-cli
Length of output: 1577
🏁 Script executed:
sed -n '65,110p' src/log_analyzer_cli/parsers/json_log.pyRepository: HrachShah/log-analyzer-cli
Length of output: 2051
🏁 Script executed:
rg -A 15 "def _parse_timestamp" src/log_analyzer_cli/parsers/json_log.pyRepository: HrachShah/log-analyzer-cli
Length of output: 589
🏁 Script executed:
sed -n '50,75p' src/log_analyzer_cli/parsers/json_log.pyRepository: HrachShah/log-analyzer-cli
Length of output: 896
🏁 Script executed:
rg "try:|except:" -B 2 -A 5 src/log_analyzer_cli/parsers/json_log.py | head -40Repository: HrachShah/log-analyzer-cli
Length of output: 962
🏁 Script executed:
python3 - <<'PY'
# Verify the specific issues with current code behavior
from datetime import datetime
# Issue 1: Boolean passthrough
print("=== Issue 1: Boolean passthrough ===")
value = True
print(f"value = {value!r}")
print(f"isinstance(value, (int, float)): {isinstance(value, (int, float))}")
print(f"isinstance(value, int): {isinstance(value, int)}")
print(f"value <= 0: {value <= 0}")
print(f"Would pass through check: {isinstance(value, (int, float)) and isinstance(value, int) and value > 0}")
if isinstance(value, (int, float)) and isinstance(value, int) and value > 0:
try:
result = datetime.fromtimestamp(value)
print(f"Result: {result}")
except Exception as e:
print(f"Error: {e}")
# Issue 2: Pre-2001 milliseconds misclassification
print("\n=== Issue 2: Pre-2001 milliseconds ===")
value = 946684800000
print(f"value = {value!r} (milliseconds for 2000-01-01)")
print(f"isinstance(value, (int, float)): {isinstance(value, (int, float))}")
print(f"isinstance(value, int): {isinstance(value, int)}")
print(f"value <= 0: {value <= 0}")
print(f"value >= 1e12: {value >= 1e12}")
if isinstance(value, (int, float)) and isinstance(value, int) and value > 0 and value >= 1e12:
try:
result = datetime.fromtimestamp(value / 1000)
print(f"As milliseconds (correct): {result}")
except Exception as e:
print(f"Error: {e}")
try:
result = datetime.fromtimestamp(value)
print(f"As seconds (would fail): {result}")
except Exception as e:
print(f"Error as seconds: {type(e).__name__}: {e}")
# Issue 3: Overflow risk
print("\n=== Issue 3: Overflow on huge nanoseconds ===")
value = 999999999999999999999999
print(f"value = {value!r}")
print(f"value >= 1e18: {value >= 1e18}")
if isinstance(value, (int, float)) and isinstance(value, int) and value > 0 and value >= 1e18:
try:
result = datetime.fromtimestamp(value / 1e9)
print(f"Result: {result}")
except Exception as e:
print(f"Error: {type(e).__name__}: {e}")
PYRepository: HrachShah/log-analyzer-cli
Length of output: 610
Handle booleans and add error recovery for timestamp parsing.
The current logic has three issues:
-
Boolean passthrough:
isinstance(True, int)returnsTruein Python, so JSONtruevalues pass through and become1970-01-01 00:00:01. -
Pre-2001 milliseconds misclassified: Values < 1e12 (e.g.,
946684800000for 2000-01-01 in milliseconds) are treated as seconds and raiseValueError: year out of range. -
Overflow on large values: Extremely large nanosecond values (≥ 1e18) cause
ValueError: year out of rangewhen divided by 1e9.
Additionally, failed datetime.fromtimestamp() calls are not caught, causing one bad timestamp to abort the entire parse operation.
Add explicit boolean rejection with isinstance(value, bool) and wrap conversion attempts in try/except to gracefully skip invalid timestamps:
Proposed approach
if isinstance(value, (int, float)):
- if not isinstance(value, int) or value <= 0:
+ if isinstance(value, bool) or not isinstance(value, int) or value <= 0:
return None
- # Distinguish seconds, milliseconds, microseconds, and
- # nanoseconds by the width of the value, not just
- # whether it is > 1e12. A value of 1.7e15 (microseconds)
- # would be misread as milliseconds in the old code
- # and produce a year 53872 ValueError, and a value of
- # 1.7e18 (nanoseconds) would silently overflow
- # datetime's range.
- if value >= 1e18:
- return datetime.fromtimestamp(value / 1e9)
- if value >= 1e15:
- return datetime.fromtimestamp(value / 1e6)
- if value >= 1e12:
- return datetime.fromtimestamp(value / 1000)
- return datetime.fromtimestamp(value)
+
+ scales = []
+ if value >= 1_000_000_000_000_000_000:
+ scales = [1_000_000_000]
+ elif value >= 1_000_000_000_000_000:
+ scales = [1_000_000, 1_000_000_000]
+ elif value >= 1_000_000_000_000:
+ scales = [1_000, 1_000_000]
+ else:
+ scales = [1, 1_000]
+
+ for scale in scales:
+ try:
+ return datetime.fromtimestamp(value / scale)
+ except (OSError, OverflowError, ValueError):
+ continue
+ return None🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/log_analyzer_cli/parsers/json_log.py` around lines 82 - 98, The timestamp
parsing logic needs to reject boolean values explicitly and handle conversion
errors gracefully. First, add an additional isinstance check for bool before the
current int/float type check to prevent True/False from being treated as 1/0.
Second, wrap each datetime.fromtimestamp call (for the 1e18, 1e15, 1e12, and
direct value cases) in a try/except block to catch ValueError exceptions from
year out of range or overflow conditions, returning None when parsing fails so
invalid timestamps do not abort the entire parse operation.
| ts = parse_timestamp(line) | ||
| assert ts is not None | ||
| assert ts.day == 5 | ||
| assert ts.year == datetime.now().year No newline at end of file |
There was a problem hiding this comment.
Add a trailing newline at EOF to clear flake8 W292.
CI reports W292 on Line 82. Please add a newline at end of file.
🧰 Tools
🪛 GitHub Actions: CI / 1_test (3.10).txt
[warning] 82-82: flake8 (W292): no newline at end of file
🪛 GitHub Actions: CI / test (3.10)
[warning] 82-82: flake8 (W292): no newline at end of file
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_utils.py` at line 82, The file tests/test_utils.py is missing a
trailing newline at the end of the file after the assertion statement assert
ts.year == datetime.now().year. Add a newline character at the end of the file
to comply with flake8 W292 requirements. This is a common style convention where
all Python files should end with a single newline character.
Source: Pipeline failures
The COMBINED_PATTERN regex in parsers/apache.py was using
\s+to match the user field, which only matches whitespace. Real combined-format log lines have a literal-for unauthenticated requests or a username, so the pattern never matched a true combined-format line. parse() then fell back to COMMON_PATTERN, dropping the referer and user_agent metadata that combined format is supposed to expose.This change uses
\S+(with a trailing\s+to keep field separation) so the user field matches the same way it does in COMMON_PATTERN. Combined-format lines now get referer and user_agent populated in the entry metadata; common-format lines (which don't have those fields) still match via the optional non-capturing group.Repro before the fix:
After the fix, entry.metadata has
refereranduser_agentpopulated.Summary by Sourcery
Fix timestamp handling and Apache combined log parsing to correctly extract metadata and produce accurate time-series analyses.
Bug Fixes:
Tests:
Summary by CodeRabbit
Bug Fixes
Tests