Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions src/log_analyzer_cli/analyzer.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,13 @@ def analyze(
Returns:
Analysis result.
"""
# Clear any patterns accumulated by a previous call so re-using the
# same LogAnalyzer across multiple batches does not leak earlier
# groups into the new result. Callers that want to keep history
# across calls should manage their own aggregation instead of
# relying on internal instance state.
self._error_patterns = {}

result = AnalysisResult()
result.total_lines = len(entries)
result.parsed_entries = len(entries)
Expand Down
4 changes: 2 additions & 2 deletions src/log_analyzer_cli/parsers/syslog.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,6 @@ def parse(self, line: str) -> Optional[ParsedEntry]:
groups = match.groupdict()

timestamp = self._parse_timestamp(groups.get("timestamp", ""))
level = detect_log_level(line) # Check full line for level

metadata = {}
if groups.get("host"):
Expand All @@ -94,11 +93,12 @@ def parse(self, line: str) -> Optional[ParsedEntry]:
metadata["process"] = groups["process"]
source = groups["process"]
else:
# Use host as source when no process name
source = groups.get("host")
if groups.get("pid"):
metadata["pid"] = groups["pid"]

level = detect_log_level(groups.get("message", ""))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Restricting log-level detection to only the message body may miss levels present in the prefix or structured fields.

Previously detect_log_level received the full raw line, so it could pick up level markers in prefixes (e.g., ERROR, WARN, numeric priorities) outside the message field. With the new call using only groups["message"], those indicators will be ignored in formats where the level isn't inside message, which may regress detection for some syslog variants. Consider calling detect_log_level on message first and falling back to the full line when message is empty or the result is unknown, to retain robustness while favoring the new behavior.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Verify CLI level filtering behavior with the parser change

# Check how _parse_file uses level filtering before parsing
ast-grep --pattern $'def _parse_file($$$):
  $$$
  if include_levels:
    $$$
  $$$
  parsed = parser.parse($$$)
  $$$'

# Check if there are any tests that validate level filtering consistency
rg -n -A5 -B5 'include_levels' --glob '*test*.py'

Repository: HrachShah/log-analyzer-cli

Length of output: 52


🏁 Script executed:

# Check CLI code structure around level filtering
fd 'cli.py' -x cat -n {} | head -250

# Also check syslog parser around line 100
fd 'syslog.py' -x cat -n {} | grep -A 10 -B 10 'level = detect_log_level'

Repository: HrachShah/log-analyzer-cli

Length of output: 10564


🏁 Script executed:

# Find the _parse_file function in cli.py
rg -n '_parse_file' --glob '*.py' -A 30

# Also check if detect_log_level is being used elsewhere in cli.py
rg -n 'detect_log_level' --glob '*.py'

Repository: HrachShah/log-analyzer-cli

Length of output: 4722


🏁 Script executed:

# Get the complete _parse_file function to understand the flow
sed -n '186,250p' src/log_analyzer_cli/cli.py

Repository: HrachShah/log-analyzer-cli

Length of output: 1341


🏁 Script executed:

# Check for tests that validate the level filtering behavior
rg -n 'error-db|hostname.*error|include_levels' tests/ --glob '*.py' -A 3 -B 3

# Also check what detect_log_level does
sed -n '121,150p' src/log_analyzer_cli/utils.py

Repository: HrachShah/log-analyzer-cli

Length of output: 863


🏁 Script executed:

# Check if there are any tests showing the hostname/error issue being fixed
rg -n 'error-db|hostname' tests/ --glob '*.py' -A 5 -B 5

# Check if UNKNOWN levels are converted to INFO or other defaults anywhere
rg -n 'UNKNOWN' src/ --glob '*.py' -B 2 -A 2

# List test files to understand test coverage
find tests/ -name '*.py' -type f | head -20

Repository: HrachShah/log-analyzer-cli

Length of output: 3642


🏁 Script executed:

# Check for tests that specifically validate the level filtering consistency
rg -n 'def test.*level' tests/ --glob '*.py' -A 10

# Check the syslog test to see if it validates the message-only level detection
sed -n '34,60p' tests/test_parsers.py

Repository: HrachShah/log-analyzer-cli

Length of output: 3822


The parser and CLI-level filtering have mismatched level detection logic that breaks filtering consistency.

The CLI filters entries using detect_log_level(line) on the full line before parsing (cli.py:209), but the syslog parser detects level using detect_log_level(groups.get("message", "")) on only the message field (syslog.py:100). This creates a critical mismatch:

A line like "error-db-01: connection established" will:

  • Pass the CLI's --level ERROR filter (because "error" in "error-db-01" matches \berror\b)
  • But be stored with level="UNKNOWN" after parsing (because the message contains no level keywords)

This contradicts the PR's goal to prevent false positives from hostnames/process names and breaks the semantic contract of the level filter—users expect filtered results to have the filtered level. The CLI should filter using the parsed entry's level, not the raw line.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/log_analyzer_cli/parsers/syslog.py` at line 100, The CLI filters log
entries using detect_log_level on the raw line before parsing, while the syslog
parser detects level using detect_log_level only on the message field, creating
a mismatch. The CLI filtering logic (in cli.py around line 209) should be
refactored to filter based on the parsed entry's level field instead of calling
detect_log_level on the raw line. This ensures consistency between what the
CLI-level filter accepts and what level is actually stored in the parsed entry
by the syslog parser's level detection at line 100.


return ParsedEntry(
raw=line,
timestamp=timestamp,
Expand Down
25 changes: 25 additions & 0 deletions tests/test_analyzer.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,31 @@ def test_reset(self):
analyzer.reset()
assert len(analyzer._error_patterns) == 0

def test_repeated_analyze_does_not_leak_error_groups(self):
analyzer = LogAnalyzer()
first_batch = [
ParsedEntry(
raw="Error: timeout after 30s",
level="ERROR",
message="timeout after 30s",
),
]
second_batch = [
ParsedEntry(
raw="Error: db connection lost",
level="ERROR",
message="db connection lost",
),
]

first_result = analyzer.analyze(first_batch)
second_result = analyzer.analyze(second_batch)

assert len(first_result.error_groups) == 1
assert len(second_result.error_groups) == 1
assert second_result.error_groups[0].pattern != first_result.error_groups[0].pattern
assert second_result.error_groups[0].count == 1


class TestAnalyzeLogEntries:
"""Tests for the analyze_log_entries function."""
Expand Down
Loading