KVScope is a lightweight LLM inference observability and diagnostics tool. It is vLLM-first and reconstructs operational behavior from live serving telemetry.
KVScope focuses on:
- KV cache pressure
- scheduler pressure and queueing
- prefill pressure
- TTFT spikes and latency instability
- decode throughput collapse
- preemption behavior
- runtime phase transitions
- sustained condition episodes
KVScope is not a benchmark harness, not a Grafana replacement, and not a generic metrics dashboard.
KV cache usage is always stored internally as percent units from 0.0 to 100.0.
Raw vLLM Prometheus metrics may expose vllm:kv_cache_usage_perc as a fraction such as 0.734. KVScope normalizes that to 73.4 at the adapter boundary.
kvscope record --server http://127.0.0.1:8080 --duration 90
kvscope analyze results/sessions/<session-id>
kvscope timeline results/sessions/<session-id> --verbose
kvscope doctor results/sessions/<session-id> --verbose
kvscope dashboard --server http://127.0.0.1:8080 --density compactThe same commands can be run from source:
python -m kvscope.cli record --server http://127.0.0.1:8080 --duration 90
python -m kvscope.cli analyze results/sessions/<session-id>
python -m kvscope.cli timeline results/sessions/<session-id> --verbose
python -m kvscope.cli doctor results/sessions/<session-id> --verbose
python -m kvscope.cli dashboard --server http://127.0.0.1:8080 --density fullDashboard density modes:
compact: default mode for normal terminal heightsfull: includes detailed latency and cumulative counter panels
kvscope record writes a session directory under results/sessions/ by default.
Core files:
metadata.json: session metadata, schema fields, source URL, and recorder settingsmetrics.csv: normalized samples as CSVsamples.jsonl: normalizedKVScopeSamplerecordsevents.jsonl: phase transitions, condition changes, condition episodes, and transient operational events
Analysis files:
summary.json: aggregate session metrics fromkvscope analyzedoctor.json: ranked diagnostic hypotheses fromkvscope doctor
Runtime phases are emitted as phase_transition records in events.jsonl.
Current phases:
IDLE: no active workloadHEALTHY: active workload without visible pressureQUEUE_PRESSURE: requests are waiting in the schedulerKV_PRESSURE_RISING: KV usage is high but not saturatedSATURATED: KV usage is in the saturation range
Classification priority:
IDLESATURATEDQUEUE_PRESSUREKV_PRESSURE_RISINGHEALTHY
Historical sessions may contain PREFILL_PRESSURE as a phase. KVScope readers continue to support those records for backward compatibility.
Runtime conditions are concurrent observations that can overlap with the primary phase.
Current conditions:
PREFILL_PRESSURE: prefill-side pressure during active servingHIGH_PROMPT_INGESTION: high running request count plus high prompt token ingestionTTFT_ELEVATED: elevated time to first tokenQUEUE_DELAY: waiting requests or queue-time latency
Example:
State: KV_PRESSURE_RISING
Conditions:
PREFILL_PRESSURE
HIGH_PROMPT_INGESTION
Conditions are grouped into episodes when they are sustained. Episode records include:
start_timeend_timeduration_secpeak_evidence
kvscope timeline reads events.jsonl and renders operational behavior over time. It separates phase transitions from transient events such as TTFT_SPIKE, E2E_LATENCY_SPIKE, KV_SATURATION, and DECODE_THROUGHPUT_DROP. It also summarizes sustained condition episodes with duration and peak evidence.
Example:
python -m kvscope.cli timeline results/sessions/<session-id> --verbose
python -m kvscope.cli timeline results/sessions/<session-id> --jsonStart a vLLM server:
vllm serve /path/to/model \
--host 127.0.0.1 \
--port 8080 \
--dtype float16 \
--max-model-len 2048 \
--gpu-memory-utilization 0.90Record telemetry:
python -m kvscope.cli record --server http://127.0.0.1:8080 --duration 90Run traffic with vLLM bench serve or another OpenAI-compatible client.
Inspect the run:
python -m kvscope.cli timeline results/sessions/<session-id> --verbose
python -m kvscope.cli doctor results/sessions/<session-id> --verboseWorkflow summary:
vLLM server -> kvscope dashboard -> kvscope record -> vllm bench serve -> kvscope timeline -> kvscope doctor
Conference-readiness docs:
docs/architecture.mddocs/demo_walkthrough.mddocs/conference_story.mddocs/evidence_pack.mddocs/cfp_notes.md
KVScope currently uses stdlib unittest.
python -m unittest discover -v
python -m compileall -q kvscope collectors analyzers visualizers tests