You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Imagine going to the doctor for a full-body scan. The MRI machine malfunctions halfway through, but instead of telling you "we couldn't complete the scan," the radiologist hands you a clean report that says "No abnormalities found." You leave thinking you're healthy, but half your body was never actually examined.
When SkillSpector's LLM analyzers encounter an API failure (rate limit, timeout, network error, malformed response), they silently return {"findings": []} — zero findings. There is no retry logic, no backoff, no user-visible warning, and no indication in the report that the semantic analysis was incomplete. The user receives a lower risk score and assumes the skill is safe, when in reality the most sophisticated analysis layer simply never ran.
Relation to Existing Issues
This issue is related to a cluster of reports about LLM failure handling:
No auditability for compliance workflows. Organizations using SkillSpector as a security gate need to prove that all analysis layers completed. The current output provides no evidence of this.
Score deflation as a security gap. The aggregate effect: missing LLM findings → lower risk score → skills pass security gates they should fail. This is distinct from the retry/isolation mechanism — it's about what the user sees when failures occur.
The existing issues focus on why the LLM fails (no retry, batch abort, concurrency). This issue focuses on what the user experiences when it fails (silence, indistinguishable output, no recourse).
Why This Matters — Real-World Scenario
Scenario: Security gate with LLM analysis as the critical detection layer
An organization deploys SkillSpector with LLM analysis enabled because static patterns alone miss sophisticated attacks (prompt injection, indirect tool manipulation, semantic data exfiltration). Their threat model specifically relies on LLM-powered semantic analysis to catch what regex cannot.
During a peak usage period, their LLM provider (OpenAI, Anthropic, etc.) returns HTTP 429 (rate limit exceeded) for several minutes. During this window, 12 skills pass through the scanner. Each one:
Three of those 12 skills contained sophisticated prompt injection payloads that only the LLM semantic analyzer would have caught. They're now live in production.
The security team discovers this a week later during a routine re-scan (when the API is working again). The skills now score 75+ (HIGH risk). Three malicious skills operated in production for a week because the scanner gave them a false clean bill of health instead of saying "I couldn't complete the analysis."
Reproduction
# Simulate LLM failure by setting an invalid API keyexport OPENAI_API_KEY="sk-invalid-key-12345"
skillspector scan ./any-skill/ --format json -o report.json
python -c "import jsondata = json.load(open('report.json'))issues = data['issues']llm_findings = [i for i in issues if 'SQP' in i.get('rule_id', '') or 'semantic' in i.get('rule_id', '').lower()]print(f'Total findings: {len(issues)}')print(f'LLM findings: {len(llm_findings)}') # 0 — silently failedprint(f'Score: {data[\"risk_assessment\"][\"score\"]}')# No warning, no error message, no 'incomplete scan' flag in the output"
The report looks identical to a scan where LLM analysis ran successfully and found nothing. There is no way to distinguish "LLM found no issues" from "LLM never ran."
Root Cause
In src/skillspector/llm_analyzer_base.py, the run_batches() method (lines 366-393):
asyncdefrun_batches(self, file_contents: dict[str, str]) ->list[dict]:
results= []
forbatchinself._create_batches(file_contents):
try:
response=awaitself._invoke_llm(batch)
findings=self._parse_response(response)
results.extend(findings)
exceptException:
# Silent failure — no retry, no logging, no user notificationcontinuereturnresults
There is no:
Retry with backoff: A transient 429 or 503 could succeed on the second attempt
Error propagation: The caller never knows the LLM failed
Report annotation: The output contains no field indicating incomplete analysis
Partial failure tracking: If 3 of 5 batches fail, the user sees reduced findings but no warning
The same pattern appears in individual analyzers like semantic_security_discovery.py (lines 98-102):
exceptException:
return {"findings": []} # Identical to "no findings" — indistinguishable
Impact
False sense of security: Users cannot distinguish "LLM found nothing" from "LLM never ran"
Silent degradation: Transient API issues produce permanently incomplete scan results with no way to detect them after the fact
No auditability: Compliance workflows require knowing whether all analysis layers completed — this information is lost
Risk score deflation: Missing LLM findings lower the score, making dangerous skills appear safer
No retry opportunity: Users cannot request "re-run failed LLM batches" because they don't know which ones failed
Summary
Imagine going to the doctor for a full-body scan. The MRI machine malfunctions halfway through, but instead of telling you "we couldn't complete the scan," the radiologist hands you a clean report that says "No abnormalities found." You leave thinking you're healthy, but half your body was never actually examined.
When SkillSpector's LLM analyzers encounter an API failure (rate limit, timeout, network error, malformed response), they silently return
{"findings": []}— zero findings. There is no retry logic, no backoff, no user-visible warning, and no indication in the report that the semantic analysis was incomplete. The user receives a lower risk score and assumes the skill is safe, when in reality the most sophisticated analysis layer simply never ran.Relation to Existing Issues
This issue is related to a cluster of reports about LLM failure handling:
apply_filterdrops findings the LLM never analyzed.What this issue adds that the above do not cover:
"analysis_complete": falseflag, no"llm_status": "partial"field, no warning in terminal output.The existing issues focus on why the LLM fails (no retry, batch abort, concurrency). This issue focuses on what the user experiences when it fails (silence, indistinguishable output, no recourse).
Why This Matters — Real-World Scenario
Scenario: Security gate with LLM analysis as the critical detection layer
An organization deploys SkillSpector with LLM analysis enabled because static patterns alone miss sophisticated attacks (prompt injection, indirect tool manipulation, semantic data exfiltration). Their threat model specifically relies on LLM-powered semantic analysis to catch what regex cannot.
During a peak usage period, their LLM provider (OpenAI, Anthropic, etc.) returns HTTP 429 (rate limit exceeded) for several minutes. During this window, 12 skills pass through the scanner. Each one:
[](silently failed)Three of those 12 skills contained sophisticated prompt injection payloads that only the LLM semantic analyzer would have caught. They're now live in production.
The security team discovers this a week later during a routine re-scan (when the API is working again). The skills now score 75+ (HIGH risk). Three malicious skills operated in production for a week because the scanner gave them a false clean bill of health instead of saying "I couldn't complete the analysis."
Reproduction
The report looks identical to a scan where LLM analysis ran successfully and found nothing. There is no way to distinguish "LLM found no issues" from "LLM never ran."
Root Cause
In
src/skillspector/llm_analyzer_base.py, therun_batches()method (lines 366-393):There is no:
The same pattern appears in individual analyzers like
semantic_security_discovery.py(lines 98-102):Impact
Affected Version
SkillSpector v2.2.3