Skip to content

match Apache combined log format with the real user field shape#10

Open
HrachShah wants to merge 2 commits into
mainfrom
fix/apache-combined-format-user-field
Open

match Apache combined log format with the real user field shape#10
HrachShah wants to merge 2 commits into
mainfrom
fix/apache-combined-format-user-field

Conversation

@HrachShah

@HrachShah HrachShah commented Jun 16, 2026

Copy link
Copy Markdown
Owner

What

The COMBINED_PATTERN's user group required whitespace (\s+) instead of a non-whitespace token, so the regex never matched a real Apache combined log line. The parse still returned a value, but only because the function fell through to COMMON_PATTERN — which silently dropped the defining fields of the combined format: referer and user_agent.

The same line in COMMON_PATTERN uses \S+ for the user group, so the fix is to make COMBINED_PATTERN agree. Also widened the request, referer, user_agent groups to accept empty strings, and the size group to accept a literal dash (the Apache spec allows - for size on responses with no body).

Why it matters

The Apache Combined Log Format is the default format used by every stock Apache, Nginx, and most reverse-proxy log files in the wild. With this bug, the analyzer was silently downgrading every combined log file to common format, hiding the most useful fields (user agent, referring page) that distinguish "a user is browsing my site" from "an automated scan is hitting random URLs". This was invisible in the existing test suite because test_parse_apache_combined only checked level and status — both of which are also present in the common format.

How to verify

pip install -e .
python -m pytest tests/                 # 50 passed
from log_analyzer_cli.parsers import ApacheParser
p = ApacheParser()
line = '192.168.1.10 - - [20/Mar/2025:10:15:32 +0000] "GET /index.html HTTP/1.1" 200 2326 "-" "Mozilla/5.0"'
e = p.parse(line)
print(e.metadata["user_agent"])         # was: missing, now: 'Mozilla/5.0'
print(e.metadata["referer"])            # was: missing, now: '-'

Tests

  • Strengthened test_parse_apache_combined to assert on user_agent and referer (the exact regression the broken regex would have caught — this was missing before)
  • Added test_parse_apache_combined_authenticated_user — covers an actual username in the user field
  • Added test_parse_apache_combined_user_agent_with_space — covers user agents that contain spaces, e.g. Mozilla/5.0 (X11; Linux x86_64)
  • Added test_parse_apache_combined_5xx_status — covers a 5xx server error to make sure status classification still works for combined format

Summary by Sourcery

Fix Apache combined log parsing to properly recognize user, referer, and user agent fields and handle edge-case field values.

Bug Fixes:

  • Correct Apache combined log regex so the user field, referer, and user agent are parsed instead of falling back to the common log pattern.
  • Allow Apache combined log parsing to accept empty request, referer, and user agent fields and a dash as a valid size value.

Tests:

  • Strengthen existing Apache combined log test to assert on user_agent and referer fields.
  • Add tests covering authenticated users, user agents containing spaces, and 5xx error responses in Apache combined log format.

Chores:

  • Add example Apache log files for sample and mixed log inputs.

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced Apache log parser to correctly handle edge cases including empty request fields, missing user-agent values, authenticated users, and server error (5xx) responses
  • Chores

    • Updated repository configuration to preserve test fixture log files in fresh checkouts

Zo Bot added 2 commits June 15, 2026 21:39
tests/conftest.py wires up examples_dir() / apache_file() / mixed_file()
fixtures that point at examples/apache-sample.log and examples/mixed.log,
and the CLI tests (TestCLI::test_analyze_apache_file, etc.) pass these
paths to runner.invoke(main, ['analyze', str(apt)]). When the fixtures
are missing, Click's file_okay=True path validation rejects the path
with exit code 2 and the test fails on the very first line with
'assert result.exit_code == 0', masking whatever the parser would
have done.

The fixtures were added in earlier sessions (01014cd and 85c2dd7) but
the .gitignore entry that tracked them — a re-ignore of examples/*.log
flanked by !examples/ and !examples/<fixture>.log exceptions — is
missing on main, so a fresh clone ends up with only app-json.log and
syslog-sample.log in examples/ and apache/mixed tests fail. This
commit re-applies the same pattern: !examples/ to un-ignore the
directory, !examples/apache-sample.log and !examples/mixed.log to
re-allow the two fixtures, and the trailing examples/*.log rule to
keep ad-hoc *.log files in that folder ignored.

Verified: 47 tests pass on a clean tree, including
test_analyze_apache_file which now finds examples/apache-sample.log
under its fixture path.
The COMBINED_PATTERN's user group required whitespace (\s+) instead of a
non-whitespace token, so the regex never matched a real Apache combined
log line. The parse still returned a value, but only because the
function fell through to COMMON_PATTERN — which silently dropped the
defining fields of the combined format: referer and user_agent.

The same line in COMMON_PATTERN uses \S+ for the user group, so the
fix is to make COMBINED_PATTERN agree. Also widened the request,
referer, user_agent groups to accept empty strings (the trailing - in
each field is valid), and the size group to accept a literal dash (the
Apache spec allows - for size on responses with no body).

Strengthened the existing test_parse_apache_combined to assert on
user_agent and referer (the regression that the broken regex would
have caught) and added three new tests covering an authenticated user,
a user-agent containing a space, and a 5xx response.
@sourcery-ai

sourcery-ai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Reviewer's Guide

Adjusts the Apache combined log regex to correctly parse user, referer, and user_agent fields (including edge cases), and strengthens/extends tests and examples to lock in the behavior.

Flow diagram for ApacheParser.parse combined vs common patterns

flowchart LR
    LogLine[Apache combined log line] --> ApacheParser_parse[ApacheParser.parse]
    ApacheParser_parse --> COMBINED_PATTERN_match[COMBINED_PATTERN.match]

    COMBINED_PATTERN_match -->|match| CombinedEntry[Extract host, ident, user, timestamp, request, status, size, referer, user_agent]
    COMBINED_PATTERN_match -->|no match| COMMON_PATTERN_match[COMMON_PATTERN.match]

    COMMON_PATTERN_match --> CommonEntry[Extract host, ident, user, timestamp, request, status, size]

    CombinedEntry --> ParseResult[Return entry with referer and user_agent]
    CommonEntry --> ParseResult
Loading

File-Level Changes

Change Details Files
Fix Apache combined log regex to correctly capture user, request, size, referer, and user_agent fields and handle edge cases allowed by the Apache spec.
  • Change the user capture group from requiring whitespace to capturing a non-whitespace token followed by whitespace, so real usernames are parsed and the combined pattern actually matches combined-format lines.
  • Relax the request capture group to allow empty strings instead of requiring at least one character, matching valid but empty request fields.
  • Allow the size field to be either a non-whitespace token or a literal dash, to support responses with no body as permitted by Apache.
  • Relax the referer and user_agent capture groups to accept empty strings, so logs with missing or empty values still parse correctly.
  • Keep the trailing pattern flexible so additional content after the user_agent is ignored without breaking parsing.
src/log_analyzer_cli/parsers/apache.py
Strengthen and extend parser tests for Apache combined logs to cover user, referer, user_agent, and error status behavior.
  • Augment the existing Apache combined log test to assert that user_agent and referer metadata is present and correctly parsed, ensuring the combined pattern is used rather than silently falling back to the common pattern.
  • Add a test for combined logs with an authenticated (non-dash) user field to verify user is captured and referer/user_agent are parsed correctly.
  • Add a test for combined logs where the user-agent string contains spaces and parentheses, ensuring the regex captures the full user agent.
  • Add a test for combined logs with a 5xx status code to confirm that error-level classification and combined field parsing work together.
tests/test_parsers.py
Add example Apache log files to demonstrate sample and mixed log inputs.
  • Introduce an apache-sample.log file showing example Apache log entries.
  • Introduce a mixed.log file containing a mix of log formats for demonstration or manual testing.
  • Leave .gitignore unchanged in this diff hunk.
examples/apache-sample.log
examples/mixed.log
.gitignore

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

ApacheParser's COMBINED_PATTERN regex is updated to use \S+ for the user field and [^"]* (zero-or-more) for request, referer, and user-agent captures. Tests are extended with assertions for metadata.user_agent, metadata.referer, authenticated user, space-containing user-agents, and 5xx ERROR level. .gitignore is updated to re-track two fixture log files.

Changes

Apache Combined Log Parsing Fix

Layer / File(s) Summary
COMBINED_PATTERN regex update
src/log_analyzer_cli/parsers/apache.py
Changes user capture to \S+, relaxes request, referer, and user_agent groups to [^"]* (zero-or-more), enabling matches on empty requests and optional referer/user-agent values.
Combined format tests and fixture tracking
tests/test_parsers.py, .gitignore
Adds metadata.user_agent and metadata.referer assertions to the existing combined test; introduces three new test cases covering authenticated user, space-containing user-agent string, and 5xx response mapped to ERROR level. Un-ignores examples/apache-sample.log and examples/mixed.log so fixture files are tracked in git.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 Hop, hop, the regex leaps,
With [^"]* it no longer weeps.
Empty requests? Now let them through!
Referers blank? We match those too.
The fixtures tracked, the tests all green —
The cleanest combined log I've ever seen! 🌿

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'match Apache combined log format with the real user field shape' directly and specifically describes the main change—fixing the regex pattern for the user field in COMBINED_PATTERN to properly match Apache combined log format.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/apache-combined-format-user-field

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • In the COMBINED_PATTERN, (?P<size>\S+|\-) is redundant since \S+ already matches -; if you want to constrain this more closely to Apache semantics, consider something like (?P<size>\d+|-), otherwise you can simplify back to just \S+.
  • Now that request, referer, and user_agent capture [^\"]*, it may be worth considering whether you want to distinguish between an actually empty field and a literal - placeholder (common in Apache logs), e.g. by constraining the regex or normalizing the parsed metadata values.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In the COMBINED_PATTERN, `(?P<size>\S+|\-)` is redundant since `\S+` already matches `-`; if you want to constrain this more closely to Apache semantics, consider something like `(?P<size>\d+|-)`, otherwise you can simplify back to just `\S+`.
- Now that `request`, `referer`, and `user_agent` capture `[^\"]*`, it may be worth considering whether you want to distinguish between an actually empty field and a literal `-` placeholder (common in Apache logs), e.g. by constraining the regex or normalizing the parsed metadata values.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
tests/test_parsers.py (2)

5-5: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove unused pytest import to unblock CI

pytest is unused and currently fails flake8 (F401).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_parsers.py` at line 5, Remove the unused import statement for
pytest at the top of the file. The pytest module is imported but not used
anywhere in the test file, which causes a flake8 F401 (imported but unused)
violation. Delete the line containing import pytest to resolve this linting
error.

Source: Pipeline failures


100-100: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wrap long fixture lines to satisfy flake8 E501

These test lines exceed the configured 100-char limit and currently fail CI.

Proposed fix
-        line = '192.168.1.10 - - [20/Mar/2025:10:15:32 +0000] "GET /index.html HTTP/1.1" 200 2326 "-" "Mozilla/5.0"'
+        line = (
+            '192.168.1.10 - - [20/Mar/2025:10:15:32 +0000] '
+            '"GET /index.html HTTP/1.1" 200 2326 "-" "Mozilla/5.0"'
+        )
@@
-        line = '192.168.1.10 - - [20/Mar/2025:10:15:32 +0000] "GET /index.html HTTP/1.1" 200 2326 "-" "Mozilla/5.0"'
+        line = (
+            '192.168.1.10 - - [20/Mar/2025:10:15:32 +0000] '
+            '"GET /index.html HTTP/1.1" 200 2326 "-" "Mozilla/5.0"'
+        )
@@
-        line = '192.168.1.10 - frank [20/Mar/2025:10:15:32 +0000] "GET /api HTTP/1.1" 200 2326 "https://example.com/page" "Mozilla/5.0"'
+        line = (
+            '192.168.1.10 - frank [20/Mar/2025:10:15:32 +0000] '
+            '"GET /api HTTP/1.1" 200 2326 "https://example.com/page" "Mozilla/5.0"'
+        )
@@
-        line = '127.0.0.1 - - [20/Mar/2025:10:15:32 +0000] "GET / HTTP/1.1" 200 2326 "-" "Mozilla/5.0 (X11; Linux x86_64)"'
+        line = (
+            '127.0.0.1 - - [20/Mar/2025:10:15:32 +0000] '
+            '"GET / HTTP/1.1" 200 2326 "-" "Mozilla/5.0 (X11; Linux x86_64)"'
+        )
@@
-        line = '10.0.0.1 - - [20/Mar/2025:10:15:32 +0000] "POST /form HTTP/1.1" 500 2326 "http://bad.com" "curl/7.0"'
+        line = (
+            '10.0.0.1 - - [20/Mar/2025:10:15:32 +0000] '
+            '"POST /form HTTP/1.1" 500 2326 "http://bad.com" "curl/7.0"'
+        )

Also applies to: 110-110, 125-125, 135-135, 143-143

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_parsers.py` at line 100, The test fixture lines in the file exceed
the configured 100-character flake8 limit (E501). Break the long log entry
strings into multiple lines using string concatenation or implicit line
continuation within parentheses to ensure each line stays within the
100-character limit. Apply this fix to all affected fixture lines that contain
the HTTP log entries and other long strings that currently violate the character
limit.

Source: Pipeline failures

src/log_analyzer_cli/parsers/apache.py (1)

91-94: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve empty combined fields instead of dropping them

referer/user_agent now correctly match empty quoted values, but these truthiness checks omit empty strings from metadata, which loses valid parsed data and makes metadata shape inconsistent.

Proposed fix
-        if groups.get("referer"):
-            metadata["referer"] = groups["referer"]
-        if groups.get("user_agent"):
-            metadata["user_agent"] = groups["user_agent"]
+        if "referer" in groups and groups["referer"] is not None:
+            metadata["referer"] = groups["referer"]
+        if "user_agent" in groups and groups["user_agent"] is not None:
+            metadata["user_agent"] = groups["user_agent"]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/log_analyzer_cli/parsers/apache.py` around lines 91 - 94, The truthiness
checks for referer and user_agent fields prevent empty strings from being added
to the metadata dictionary, even though these are valid parsed values. Replace
the truthiness checks `if groups.get("referer"):` and `if
groups.get("user_agent"):` with checks that explicitly verify the keys exist in
the groups dictionary (such as `if "referer" in groups:` and `if "user_agent" in
groups:`) rather than relying on the truthiness of their values. This ensures
empty strings are preserved in metadata rather than being omitted.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/log_analyzer_cli/parsers/apache.py`:
- Around line 91-94: The truthiness checks for referer and user_agent fields
prevent empty strings from being added to the metadata dictionary, even though
these are valid parsed values. Replace the truthiness checks `if
groups.get("referer"):` and `if groups.get("user_agent"):` with checks that
explicitly verify the keys exist in the groups dictionary (such as `if "referer"
in groups:` and `if "user_agent" in groups:`) rather than relying on the
truthiness of their values. This ensures empty strings are preserved in metadata
rather than being omitted.

In `@tests/test_parsers.py`:
- Line 5: Remove the unused import statement for pytest at the top of the file.
The pytest module is imported but not used anywhere in the test file, which
causes a flake8 F401 (imported but unused) violation. Delete the line containing
import pytest to resolve this linting error.
- Line 100: The test fixture lines in the file exceed the configured
100-character flake8 limit (E501). Break the long log entry strings into
multiple lines using string concatenation or implicit line continuation within
parentheses to ensure each line stays within the 100-character limit. Apply this
fix to all affected fixture lines that contain the HTTP log entries and other
long strings that currently violate the character limit.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ab1be05c-8cd9-4fa9-982d-f2adade7f5fc

📥 Commits

Reviewing files that changed from the base of the PR and between e93757f and ac55df0.

⛔ Files ignored due to path filters (2)
  • examples/apache-sample.log is excluded by !**/*.log
  • examples/mixed.log is excluded by !**/*.log
📒 Files selected for processing (3)
  • .gitignore
  • src/log_analyzer_cli/parsers/apache.py
  • tests/test_parsers.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant