Skip to content

[Bug] LLM API failures silently produce zero findings with no retry or user notification #149

Description

@mimran-khan

Summary

Imagine going to the doctor for a full-body scan. The MRI machine malfunctions halfway through, but instead of telling you "we couldn't complete the scan," the radiologist hands you a clean report that says "No abnormalities found." You leave thinking you're healthy, but half your body was never actually examined.

When SkillSpector's LLM analyzers encounter an API failure (rate limit, timeout, network error, malformed response), they silently return {"findings": []} — zero findings. There is no retry logic, no backoff, no user-visible warning, and no indication in the report that the semantic analysis was incomplete. The user receives a lower risk score and assumes the skill is safe, when in reality the most sophisticated analysis layer simply never ran.

Relation to Existing Issues

This issue is related to a cluster of reports about LLM failure handling:

What this issue adds that the above do not cover:

  1. The user-facing report is indistinguishable from a clean scan. Even if retry and batch isolation are implemented (Fix(llm): retry transient failures and make concurrency configurable #29, fix(llm): isolate Stage 2 batch failures and keep unanalysed findings #32), the user still has no way to know the LLM layer ran at all. There is no "analysis_complete": false flag, no "llm_status": "partial" field, no warning in terminal output.
  2. No auditability for compliance workflows. Organizations using SkillSpector as a security gate need to prove that all analysis layers completed. The current output provides no evidence of this.
  3. Score deflation as a security gap. The aggregate effect: missing LLM findings → lower risk score → skills pass security gates they should fail. This is distinct from the retry/isolation mechanism — it's about what the user sees when failures occur.

The existing issues focus on why the LLM fails (no retry, batch abort, concurrency). This issue focuses on what the user experiences when it fails (silence, indistinguishable output, no recourse).

Why This Matters — Real-World Scenario

Scenario: Security gate with LLM analysis as the critical detection layer

An organization deploys SkillSpector with LLM analysis enabled because static patterns alone miss sophisticated attacks (prompt injection, indirect tool manipulation, semantic data exfiltration). Their threat model specifically relies on LLM-powered semantic analysis to catch what regex cannot.

During a peak usage period, their LLM provider (OpenAI, Anthropic, etc.) returns HTTP 429 (rate limit exceeded) for several minutes. During this window, 12 skills pass through the scanner. Each one:

  • Gets static analysis results (perhaps 2-3 low findings)
  • Gets LLM analysis results: [] (silently failed)
  • Scores 15-20 (LOW risk)
  • Auto-approved and published

Three of those 12 skills contained sophisticated prompt injection payloads that only the LLM semantic analyzer would have caught. They're now live in production.

The security team discovers this a week later during a routine re-scan (when the API is working again). The skills now score 75+ (HIGH risk). Three malicious skills operated in production for a week because the scanner gave them a false clean bill of health instead of saying "I couldn't complete the analysis."

Reproduction

# Simulate LLM failure by setting an invalid API key
export OPENAI_API_KEY="sk-invalid-key-12345"

skillspector scan ./any-skill/ --format json -o report.json

python -c "
import json
data = json.load(open('report.json'))
issues = data['issues']
llm_findings = [i for i in issues if 'SQP' in i.get('rule_id', '') or 'semantic' in i.get('rule_id', '').lower()]
print(f'Total findings: {len(issues)}')
print(f'LLM findings: {len(llm_findings)}')  # 0 — silently failed
print(f'Score: {data[\"risk_assessment\"][\"score\"]}')
# No warning, no error message, no 'incomplete scan' flag in the output
"

The report looks identical to a scan where LLM analysis ran successfully and found nothing. There is no way to distinguish "LLM found no issues" from "LLM never ran."

Root Cause

In src/skillspector/llm_analyzer_base.py, the run_batches() method (lines 366-393):

async def run_batches(self, file_contents: dict[str, str]) -> list[dict]:
    results = []
    for batch in self._create_batches(file_contents):
        try:
            response = await self._invoke_llm(batch)
            findings = self._parse_response(response)
            results.extend(findings)
        except Exception:
            # Silent failure — no retry, no logging, no user notification
            continue
    return results

There is no:

  • Retry with backoff: A transient 429 or 503 could succeed on the second attempt
  • Error propagation: The caller never knows the LLM failed
  • Report annotation: The output contains no field indicating incomplete analysis
  • Partial failure tracking: If 3 of 5 batches fail, the user sees reduced findings but no warning

The same pattern appears in individual analyzers like semantic_security_discovery.py (lines 98-102):

except Exception:
    return {"findings": []}  # Identical to "no findings" — indistinguishable

Impact

  • False sense of security: Users cannot distinguish "LLM found nothing" from "LLM never ran"
  • Silent degradation: Transient API issues produce permanently incomplete scan results with no way to detect them after the fact
  • No auditability: Compliance workflows require knowing whether all analysis layers completed — this information is lost
  • Risk score deflation: Missing LLM findings lower the score, making dangerous skills appear safer
  • No retry opportunity: Users cannot request "re-run failed LLM batches" because they don't know which ones failed

Affected Version

SkillSpector v2.2.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions