match Apache combined log format with the real user field shape#10
match Apache combined log format with the real user field shape#10HrachShah wants to merge 2 commits into
Conversation
tests/conftest.py wires up examples_dir() / apache_file() / mixed_file() fixtures that point at examples/apache-sample.log and examples/mixed.log, and the CLI tests (TestCLI::test_analyze_apache_file, etc.) pass these paths to runner.invoke(main, ['analyze', str(apt)]). When the fixtures are missing, Click's file_okay=True path validation rejects the path with exit code 2 and the test fails on the very first line with 'assert result.exit_code == 0', masking whatever the parser would have done. The fixtures were added in earlier sessions (01014cd and 85c2dd7) but the .gitignore entry that tracked them — a re-ignore of examples/*.log flanked by !examples/ and !examples/<fixture>.log exceptions — is missing on main, so a fresh clone ends up with only app-json.log and syslog-sample.log in examples/ and apache/mixed tests fail. This commit re-applies the same pattern: !examples/ to un-ignore the directory, !examples/apache-sample.log and !examples/mixed.log to re-allow the two fixtures, and the trailing examples/*.log rule to keep ad-hoc *.log files in that folder ignored. Verified: 47 tests pass on a clean tree, including test_analyze_apache_file which now finds examples/apache-sample.log under its fixture path.
The COMBINED_PATTERN's user group required whitespace (\s+) instead of a non-whitespace token, so the regex never matched a real Apache combined log line. The parse still returned a value, but only because the function fell through to COMMON_PATTERN — which silently dropped the defining fields of the combined format: referer and user_agent. The same line in COMMON_PATTERN uses \S+ for the user group, so the fix is to make COMBINED_PATTERN agree. Also widened the request, referer, user_agent groups to accept empty strings (the trailing - in each field is valid), and the size group to accept a literal dash (the Apache spec allows - for size on responses with no body). Strengthened the existing test_parse_apache_combined to assert on user_agent and referer (the regression that the broken regex would have caught) and added three new tests covering an authenticated user, a user-agent containing a space, and a 5xx response.
Reviewer's GuideAdjusts the Apache combined log regex to correctly parse user, referer, and user_agent fields (including edge cases), and strengthens/extends tests and examples to lock in the behavior. Flow diagram for ApacheParser.parse combined vs common patternsflowchart LR
LogLine[Apache combined log line] --> ApacheParser_parse[ApacheParser.parse]
ApacheParser_parse --> COMBINED_PATTERN_match[COMBINED_PATTERN.match]
COMBINED_PATTERN_match -->|match| CombinedEntry[Extract host, ident, user, timestamp, request, status, size, referer, user_agent]
COMBINED_PATTERN_match -->|no match| COMMON_PATTERN_match[COMMON_PATTERN.match]
COMMON_PATTERN_match --> CommonEntry[Extract host, ident, user, timestamp, request, status, size]
CombinedEntry --> ParseResult[Return entry with referer and user_agent]
CommonEntry --> ParseResult
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
📝 WalkthroughWalkthrough
ChangesApache Combined Log Parsing Fix
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- In the COMBINED_PATTERN,
(?P<size>\S+|\-)is redundant since\S+already matches-; if you want to constrain this more closely to Apache semantics, consider something like(?P<size>\d+|-), otherwise you can simplify back to just\S+. - Now that
request,referer, anduser_agentcapture[^\"]*, it may be worth considering whether you want to distinguish between an actually empty field and a literal-placeholder (common in Apache logs), e.g. by constraining the regex or normalizing the parsed metadata values.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In the COMBINED_PATTERN, `(?P<size>\S+|\-)` is redundant since `\S+` already matches `-`; if you want to constrain this more closely to Apache semantics, consider something like `(?P<size>\d+|-)`, otherwise you can simplify back to just `\S+`.
- Now that `request`, `referer`, and `user_agent` capture `[^\"]*`, it may be worth considering whether you want to distinguish between an actually empty field and a literal `-` placeholder (common in Apache logs), e.g. by constraining the regex or normalizing the parsed metadata values.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
tests/test_parsers.py (2)
5-5:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRemove unused
pytestimport to unblock CI
pytestis unused and currently fails flake8 (F401).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_parsers.py` at line 5, Remove the unused import statement for pytest at the top of the file. The pytest module is imported but not used anywhere in the test file, which causes a flake8 F401 (imported but unused) violation. Delete the line containing import pytest to resolve this linting error.Source: Pipeline failures
100-100:⚠️ Potential issue | 🟠 Major | ⚡ Quick winWrap long fixture lines to satisfy flake8 E501
These test lines exceed the configured 100-char limit and currently fail CI.
Proposed fix
- line = '192.168.1.10 - - [20/Mar/2025:10:15:32 +0000] "GET /index.html HTTP/1.1" 200 2326 "-" "Mozilla/5.0"' + line = ( + '192.168.1.10 - - [20/Mar/2025:10:15:32 +0000] ' + '"GET /index.html HTTP/1.1" 200 2326 "-" "Mozilla/5.0"' + ) @@ - line = '192.168.1.10 - - [20/Mar/2025:10:15:32 +0000] "GET /index.html HTTP/1.1" 200 2326 "-" "Mozilla/5.0"' + line = ( + '192.168.1.10 - - [20/Mar/2025:10:15:32 +0000] ' + '"GET /index.html HTTP/1.1" 200 2326 "-" "Mozilla/5.0"' + ) @@ - line = '192.168.1.10 - frank [20/Mar/2025:10:15:32 +0000] "GET /api HTTP/1.1" 200 2326 "https://example.com/page" "Mozilla/5.0"' + line = ( + '192.168.1.10 - frank [20/Mar/2025:10:15:32 +0000] ' + '"GET /api HTTP/1.1" 200 2326 "https://example.com/page" "Mozilla/5.0"' + ) @@ - line = '127.0.0.1 - - [20/Mar/2025:10:15:32 +0000] "GET / HTTP/1.1" 200 2326 "-" "Mozilla/5.0 (X11; Linux x86_64)"' + line = ( + '127.0.0.1 - - [20/Mar/2025:10:15:32 +0000] ' + '"GET / HTTP/1.1" 200 2326 "-" "Mozilla/5.0 (X11; Linux x86_64)"' + ) @@ - line = '10.0.0.1 - - [20/Mar/2025:10:15:32 +0000] "POST /form HTTP/1.1" 500 2326 "http://bad.com" "curl/7.0"' + line = ( + '10.0.0.1 - - [20/Mar/2025:10:15:32 +0000] ' + '"POST /form HTTP/1.1" 500 2326 "http://bad.com" "curl/7.0"' + )Also applies to: 110-110, 125-125, 135-135, 143-143
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_parsers.py` at line 100, The test fixture lines in the file exceed the configured 100-character flake8 limit (E501). Break the long log entry strings into multiple lines using string concatenation or implicit line continuation within parentheses to ensure each line stays within the 100-character limit. Apply this fix to all affected fixture lines that contain the HTTP log entries and other long strings that currently violate the character limit.Source: Pipeline failures
src/log_analyzer_cli/parsers/apache.py (1)
91-94:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPreserve empty combined fields instead of dropping them
referer/user_agentnow correctly match empty quoted values, but these truthiness checks omit empty strings frommetadata, which loses valid parsed data and makes metadata shape inconsistent.Proposed fix
- if groups.get("referer"): - metadata["referer"] = groups["referer"] - if groups.get("user_agent"): - metadata["user_agent"] = groups["user_agent"] + if "referer" in groups and groups["referer"] is not None: + metadata["referer"] = groups["referer"] + if "user_agent" in groups and groups["user_agent"] is not None: + metadata["user_agent"] = groups["user_agent"]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/log_analyzer_cli/parsers/apache.py` around lines 91 - 94, The truthiness checks for referer and user_agent fields prevent empty strings from being added to the metadata dictionary, even though these are valid parsed values. Replace the truthiness checks `if groups.get("referer"):` and `if groups.get("user_agent"):` with checks that explicitly verify the keys exist in the groups dictionary (such as `if "referer" in groups:` and `if "user_agent" in groups:`) rather than relying on the truthiness of their values. This ensures empty strings are preserved in metadata rather than being omitted.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/log_analyzer_cli/parsers/apache.py`:
- Around line 91-94: The truthiness checks for referer and user_agent fields
prevent empty strings from being added to the metadata dictionary, even though
these are valid parsed values. Replace the truthiness checks `if
groups.get("referer"):` and `if groups.get("user_agent"):` with checks that
explicitly verify the keys exist in the groups dictionary (such as `if "referer"
in groups:` and `if "user_agent" in groups:`) rather than relying on the
truthiness of their values. This ensures empty strings are preserved in metadata
rather than being omitted.
In `@tests/test_parsers.py`:
- Line 5: Remove the unused import statement for pytest at the top of the file.
The pytest module is imported but not used anywhere in the test file, which
causes a flake8 F401 (imported but unused) violation. Delete the line containing
import pytest to resolve this linting error.
- Line 100: The test fixture lines in the file exceed the configured
100-character flake8 limit (E501). Break the long log entry strings into
multiple lines using string concatenation or implicit line continuation within
parentheses to ensure each line stays within the 100-character limit. Apply this
fix to all affected fixture lines that contain the HTTP log entries and other
long strings that currently violate the character limit.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ab1be05c-8cd9-4fa9-982d-f2adade7f5fc
⛔ Files ignored due to path filters (2)
examples/apache-sample.logis excluded by!**/*.logexamples/mixed.logis excluded by!**/*.log
📒 Files selected for processing (3)
.gitignoresrc/log_analyzer_cli/parsers/apache.pytests/test_parsers.py
What
The COMBINED_PATTERN's user group required whitespace (
\s+) instead of a non-whitespace token, so the regex never matched a real Apache combined log line. The parse still returned a value, but only because the function fell through to COMMON_PATTERN — which silently dropped the defining fields of the combined format: referer and user_agent.The same line in COMMON_PATTERN uses
\S+for the user group, so the fix is to make COMBINED_PATTERN agree. Also widened the request, referer, user_agent groups to accept empty strings, and the size group to accept a literal dash (the Apache spec allows-for size on responses with no body).Why it matters
The Apache Combined Log Format is the default format used by every stock Apache, Nginx, and most reverse-proxy log files in the wild. With this bug, the analyzer was silently downgrading every combined log file to common format, hiding the most useful fields (user agent, referring page) that distinguish "a user is browsing my site" from "an automated scan is hitting random URLs". This was invisible in the existing test suite because
test_parse_apache_combinedonly checkedlevelandstatus— both of which are also present in the common format.How to verify
Tests
test_parse_apache_combinedto assert onuser_agentandreferer(the exact regression the broken regex would have caught — this was missing before)test_parse_apache_combined_authenticated_user— covers an actual username in the user fieldtest_parse_apache_combined_user_agent_with_space— covers user agents that contain spaces, e.g.Mozilla/5.0 (X11; Linux x86_64)test_parse_apache_combined_5xx_status— covers a 5xx server error to make sure status classification still works for combined formatSummary by Sourcery
Fix Apache combined log parsing to properly recognize user, referer, and user agent fields and handle edge-case field values.
Bug Fixes:
Tests:
Chores:
Summary by CodeRabbit
Bug Fixes
Chores