Skip to content

Commit d826558

Browse files
committed
report job metrics back to orchestrator.
1 parent 3a0b23e commit d826558

25 files changed

Lines changed: 1291 additions & 161 deletions

README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ def generate(ctx: ActionContext, payload: Input) -> Output:
4040
- **Model injection** - Dependency injection for ML models with caching
4141
- **Streaming output** - Support for incremental/streaming responses
4242
- **Progress reporting** - Built-in progress events via `ActionContext`
43+
- **Perf metrics** - Best-effort per-run metrics emitted to gen-orchestrator (`metrics.*` worker events)
4344
- **File handling** - Upload/download assets via Cozy hub file API
4445
- **Model caching** - LRU cache with VRAM/disk management and cache-aware routing
4546

@@ -159,6 +160,12 @@ Local dev / advanced (not injected by orchestrator):
159160
| `COZY_HUB_TOKEN` | - | Local dev only: Cozy Hub bearer token (only used when `WORKER_ALLOW_COZY_HUB_API_RESOLVE=1`) |
160161
| `HF_TOKEN` | - | Hugging Face token (for private `hf:` refs) |
161162

163+
## Metrics
164+
165+
The worker can emit best-effort performance/debug metrics to gen-orchestrator via `WorkerEvent` messages.
166+
167+
See `docs/metrics.md`.
168+
162169
### Hugging Face (`hf:`) download behavior
163170

164171
By default, `hf:` model refs **do not download the full repo**. The worker uses `huggingface_hub.snapshot_download(allow_patterns=...)` to avoid pulling huge legacy weights.

agents/progress.json

Lines changed: 37 additions & 4 deletions
Large diffs are not rendered by default.

docs/metrics.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
# Worker-Reported Perf Metrics (v1)
2+
3+
`gen-worker` can optionally report best-effort performance/debug metrics to `gen-orchestrator` via the existing gRPC stream `WorkerSchedulerMessage.worker_event`.
4+
5+
These metrics are:
6+
7+
- Best-effort: metrics emission must never fail a run.
8+
- Safe: numbers and small strings only. No URLs, secrets, or file paths.
9+
- Optional: only emit keys when known; omit unknown fields entirely.
10+
11+
## Canonical Events
12+
13+
These event types are designed to be stable and low-cardinality. `gen-orchestrator` can persist them into dedicated columns.
14+
15+
- `metrics.compute.started` payload: `{ "at": "<rfc3339>" }`
16+
- `metrics.compute.completed` payload: `{ "at": "<rfc3339>" }`
17+
- `metrics.fetch` payload: `{ "ms": <int> }` (use `0` for warm disk hits)
18+
- `metrics.gpu_load` payload: `{ "ms": <int> }`
19+
- `metrics.inference` payload: `{ "ms": <int> }`
20+
- `metrics.tokens` payload: `{ "output_tokens": <int> }` (only when applicable)
21+
22+
All times are milliseconds as integers.
23+
24+
## Extended Debug Event
25+
26+
Additionally, the worker emits one extended event at the end of each run:
27+
28+
- `metrics.run` payload: JSON object (schema versioned)
29+
30+
### `metrics.run` payload (schema_version=1)
31+
32+
Top-level keys (all optional unless noted):
33+
34+
- `schema_version` (required): `1`
35+
- `function_name`: string
36+
- `cache_state`: `hot_vram | warm_disk | cold_remote`
37+
- `models`: array of objects (best-effort per required model)
38+
- `pipeline_init_ms`: int
39+
- `gpu_load_ms`: int
40+
- `warmup_ms`: int (only for first warmup run; otherwise omit)
41+
- `inference_ms`: int
42+
- diffusion-ish extras (optional): `steps`, `iters_per_s`, `width`, `height`, `guidance`
43+
- post (optional): `png_encode_ms`, `upload_ms`
44+
- resources (optional): `peak_vram_bytes`, `peak_ram_bytes`
45+
46+
Per-model object keys (all optional unless noted):
47+
48+
- `model_id` (required): canonical model id used by worker/scheduler
49+
- `variant_label`: string
50+
- `snapshot_digest`: string
51+
- `cache_state`: `hot_vram | warm_disk | cold_remote`
52+
- `bytes_downloaded`: int (0 if none)
53+
- `download_ms`: int (0 if warm disk hit)
54+
- `bytes_read_disk`: int
55+
56+
## Notes
57+
58+
- `metrics.fetch` is primarily the time spent ensuring required model blobs are present on disk (remote download vs warm disk hit).
59+
- `metrics.gpu_load` is best-effort and currently reflects time spent moving injected model objects to the worker device when supported.
60+
- `metrics.inference` is best-effort and currently reflects time spent executing the user function body (not including scheduler queueing).
61+

examples/flux2-klein-4b/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ FLUX.2-klein-4B example using Cozy’s injection pattern.
44

55
- The worker function only defines input/output + runs inference.
66
- Model selection + downloading is handled by the worker runtime via `[tool.cozy.models]`.
7+
- This model is treated as a turbo model: the worker forces `num_inference_steps=8`.
78

89
Config:
910

examples/flux2-klein-4b/pyproject.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,4 +38,5 @@ torch = "2.10.0"
3838
vram_gb = 12
3939

4040
[tool.cozy.models]
41-
flux2-klein-4b = "hf:black-forest-labs/FLUX.2-klein-4B"
41+
# Use Cozy Hub snapshot (backed by R2) instead of pulling from Hugging Face at runtime.
42+
flux2-klein-4b = "black-forest-labs/flux-2-klein-4b"

examples/flux2-klein-4b/src/flux2_klein_4b/main.py

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,9 @@
2424

2525
class GenerateInput(msgspec.Struct):
2626
prompt: str
27-
num_inference_steps: int = 4
27+
# FLUX.2-klein-4B is a turbo model: we always run at 8 steps.
28+
# Keep the field for API compatibility, but ignore it in `generate()`.
29+
num_inference_steps: int = 8
2830
guidance_scale: float = 1.0
2931
width: int = 1024
3032
height: int = 1024
@@ -51,7 +53,14 @@ def generate(
5153
if ctx.is_canceled():
5254
raise InterruptedError("canceled")
5355

54-
logger.info("[run_id=%s] flux2-klein-4b prompt=%r", ctx.run_id, payload.prompt)
56+
steps = 8 # forced turbo steps
57+
logger.info(
58+
"[run_id=%s] flux2-klein-4b prompt=%r steps=%s (forced, requested=%s)",
59+
ctx.run_id,
60+
payload.prompt,
61+
steps,
62+
payload.num_inference_steps,
63+
)
5564

5665
# FLUX.2-klein-4B can exceed 8GB VRAM; use sequential CPU offload by default.
5766
if torch.cuda.is_available() and _should_enable_seq_offload():
@@ -66,7 +75,7 @@ def generate(
6675

6776
result = pipeline(
6877
prompt=payload.prompt,
69-
num_inference_steps=payload.num_inference_steps,
78+
num_inference_steps=steps,
7079
guidance_scale=payload.guidance_scale,
7180
width=payload.width,
7281
height=payload.height,

examples/sd15/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ Stable Diffusion 1.5 example using Cozy’s injection pattern.
44

55
- The worker function only defines input/output + runs inference.
66
- Model selection + downloading is handled by the worker runtime via `[tool.cozy.models]`.
7+
- The worker clamps `num_inference_steps` to a minimum of 25 for quality.
78

89
Config:
910

examples/sd15/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[project]
22
name = "sd15"
3-
version = "0.1.0"
3+
version = "0.1.1"
44
description = "Stable Diffusion 1.5 example (inference-only; models via [tool.cozy.models])"
55
requires-python = ">=3.12"
66
dependencies = [

examples/sd15/src/sd15/main.py

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
class GenerateInput(msgspec.Struct):
2424
prompt: str
2525
negative_prompt: str = ""
26-
num_inference_steps: int = 20
26+
num_inference_steps: int = 25
2727
guidance_scale: float = 7.5
2828
width: int = 512
2929
height: int = 512
@@ -45,7 +45,18 @@ def generate(
4545
if ctx.is_canceled():
4646
raise InterruptedError("canceled")
4747

48-
logger.info("[run_id=%s] sd15 prompt=%r", ctx.run_id, payload.prompt)
48+
requested_steps = payload.num_inference_steps
49+
steps = requested_steps
50+
if steps < 25:
51+
steps = 25
52+
53+
logger.info(
54+
"[run_id=%s] sd15 prompt=%r steps=%s (requested=%s)",
55+
ctx.run_id,
56+
payload.prompt,
57+
steps,
58+
requested_steps,
59+
)
4960

5061
generator = None
5162
if payload.seed is not None:
@@ -57,7 +68,7 @@ def generate(
5768
result = pipeline(
5869
prompt=payload.prompt,
5970
negative_prompt=payload.negative_prompt,
60-
num_inference_steps=payload.num_inference_steps,
71+
num_inference_steps=steps,
6172
guidance_scale=payload.guidance_scale,
6273
width=payload.width,
6374
height=payload.height,

examples/sd15/uv.lock

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)