Skip to content

ssaketh-ch/KVscope

Repository files navigation

KVScope

KVScope is a lightweight LLM inference observability and diagnostics tool. It is vLLM-first and reconstructs operational behavior from live serving telemetry.

KVScope focuses on:

  • KV cache pressure
  • scheduler pressure and queueing
  • prefill pressure
  • TTFT spikes and latency instability
  • decode throughput collapse
  • preemption behavior
  • runtime phase transitions
  • sustained condition episodes

KVScope is not a benchmark harness, not a Grafana replacement, and not a generic metrics dashboard.

Core Invariant

KV cache usage is always stored internally as percent units from 0.0 to 100.0.

Raw vLLM Prometheus metrics may expose vllm:kv_cache_usage_perc as a fraction such as 0.734. KVScope normalizes that to 73.4 at the adapter boundary.

CLI Commands

kvscope record --server http://127.0.0.1:8080 --duration 90
kvscope analyze results/sessions/<session-id>
kvscope timeline results/sessions/<session-id> --verbose
kvscope doctor results/sessions/<session-id> --verbose
kvscope dashboard --server http://127.0.0.1:8080 --density compact

The same commands can be run from source:

python -m kvscope.cli record --server http://127.0.0.1:8080 --duration 90
python -m kvscope.cli analyze results/sessions/<session-id>
python -m kvscope.cli timeline results/sessions/<session-id> --verbose
python -m kvscope.cli doctor results/sessions/<session-id> --verbose
python -m kvscope.cli dashboard --server http://127.0.0.1:8080 --density full

Dashboard density modes:

  • compact: default mode for normal terminal heights
  • full: includes detailed latency and cumulative counter panels

Session Outputs

kvscope record writes a session directory under results/sessions/ by default.

Core files:

  • metadata.json: session metadata, schema fields, source URL, and recorder settings
  • metrics.csv: normalized samples as CSV
  • samples.jsonl: normalized KVScopeSample records
  • events.jsonl: phase transitions, condition changes, condition episodes, and transient operational events

Analysis files:

  • summary.json: aggregate session metrics from kvscope analyze
  • doctor.json: ranked diagnostic hypotheses from kvscope doctor

Runtime Phases

Runtime phases are emitted as phase_transition records in events.jsonl.

Current phases:

  • IDLE: no active workload
  • HEALTHY: active workload without visible pressure
  • QUEUE_PRESSURE: requests are waiting in the scheduler
  • KV_PRESSURE_RISING: KV usage is high but not saturated
  • SATURATED: KV usage is in the saturation range

Classification priority:

  1. IDLE
  2. SATURATED
  3. QUEUE_PRESSURE
  4. KV_PRESSURE_RISING
  5. HEALTHY

Historical sessions may contain PREFILL_PRESSURE as a phase. KVScope readers continue to support those records for backward compatibility.

Runtime Conditions

Runtime conditions are concurrent observations that can overlap with the primary phase.

Current conditions:

  • PREFILL_PRESSURE: prefill-side pressure during active serving
  • HIGH_PROMPT_INGESTION: high running request count plus high prompt token ingestion
  • TTFT_ELEVATED: elevated time to first token
  • QUEUE_DELAY: waiting requests or queue-time latency

Example:

State: KV_PRESSURE_RISING
Conditions:
  PREFILL_PRESSURE
  HIGH_PROMPT_INGESTION

Conditions are grouped into episodes when they are sustained. Episode records include:

  • start_time
  • end_time
  • duration_sec
  • peak_evidence

Event Timeline

kvscope timeline reads events.jsonl and renders operational behavior over time. It separates phase transitions from transient events such as TTFT_SPIKE, E2E_LATENCY_SPIKE, KV_SATURATION, and DECODE_THROUGHPUT_DROP. It also summarizes sustained condition episodes with duration and peak evidence.

Example:

python -m kvscope.cli timeline results/sessions/<session-id> --verbose
python -m kvscope.cli timeline results/sessions/<session-id> --json

Example Workflow

Start a vLLM server:

vllm serve /path/to/model \
  --host 127.0.0.1 \
  --port 8080 \
  --dtype float16 \
  --max-model-len 2048 \
  --gpu-memory-utilization 0.90

Record telemetry:

python -m kvscope.cli record --server http://127.0.0.1:8080 --duration 90

Run traffic with vLLM bench serve or another OpenAI-compatible client.

Inspect the run:

python -m kvscope.cli timeline results/sessions/<session-id> --verbose
python -m kvscope.cli doctor results/sessions/<session-id> --verbose

Workflow summary:

vLLM server -> kvscope dashboard -> kvscope record -> vllm bench serve -> kvscope timeline -> kvscope doctor

Conference-readiness docs:

  • docs/architecture.md
  • docs/demo_walkthrough.md
  • docs/conference_story.md
  • docs/evidence_pack.md
  • docs/cfp_notes.md

Tests

KVScope currently uses stdlib unittest.

python -m unittest discover -v
python -m compileall -q kvscope collectors analyzers visualizers tests

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors