diff --git a/language/edge-agentic/README.md b/language/edge-agentic/README.md new file mode 100644 index 0000000000..990bba878d --- /dev/null +++ b/language/edge-agentic/README.md @@ -0,0 +1,123 @@ +# MLPerf Inference Reference Implementation — Edge Agentic (BFCL v4) + +This is the reference implementation for the **Edge Agentic** workload, using +[Berkeley Function Calling Leaderboard v4 (BFCL v4)](https://gorilla.cs.berkeley.edu/blogs/13_bfcl_v4.html) +as the accuracy benchmark. + +The implementation lives in the +[`mlcommons/endpoints`](https://github.com/mlcommons/endpoints) repository. +The runnable example, configuration file, and step-by-step reproducer are at: + +> **[`examples/10_Edge_Agentic_Example/`](https://github.com/mlcommons/endpoints/tree/main/examples/10_Edge_Agentic_Example)** + +The gated accuracy benchmark is **single-turn only** (3 categories, +per-category sampled to a stable ~995-sample point estimate). Multi-turn remains +available as an optional exploratory run but is **not** part of the accuracy +gate. + +--- + +## Model + +| Property | Value | +| --- | --- | +| Model | `Qwen/Qwen3.6-27B` (Q4\_K\_M GGUF quantization validated) | +| Server | Any OpenAI-compatible endpoint (`/v1/chat/completions`) | +| Validated server | [`llama.cpp llama-server`](https://github.com/ggml-org/llama.cpp) (commit `cfff1fc`) built with CUDA on NVIDIA Jetson AGX Thor, `--reasoning off --ctx-size 32768 -np 1` | +| HuggingFace ID | [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B) | + +--- + +## Dataset + +| Property | Value | +| --- | --- | +| Source | [`gorilla-llm/gorilla-eval-set`](https://huggingface.co/datasets/gorilla-llm/gorilla-eval-set) (HuggingFace, public) | +| Download | Automatic at runtime — no separate download step required | +| Gated subsets (single-turn) | `non_live`, `live`, `hallucination` | +| QSL size | ~995 (per-category sampling, see Run Parameters) | +| Multi-turn subsets | `multi_turn_base`, `multi_turn_miss_func`, `multi_turn_miss_param`, `multi_turn_long_context` — **optional, not gated** | + +--- + +## Run Parameters + +| Parameter | Value | +| --- | --- | +| `temperature` | `0` (deterministic, greedy) | +| `top_p` / `top_k` | server default — unused at `temperature 0` | +| `max_tokens` (max\_osl) | `1024` | +| `seed` | `42` | +| `tool_choice` | `auto` | +| ST sampling: `non_live` | 62% (~712 samples) | +| ST sampling: `live` | 10% (subsets ≤ 25 samples taken in full → ~171 samples) | +| ST sampling: `hallucination` | 10% (~112 samples) | +| `subset_floor` | 25 (subsets ≤ 25 taken in full) | + +Total ≈ **995** single-turn samples — large enough for a stable point estimate. + +--- + +## Accuracy Targets + +The pass/fail criterion is a **3% one-sided band** anchored on the validated +Jetson AGX Thor `Qwen3.6-27B-Q4_K_M` reference: a submission passes if its +single-turn score is **≥ 0.97 × reference**, with no upper bound (a higher score +never fails). Accuracy is hardware-independent (deterministic at `temperature 0` ++ fixed seed), so the same thresholds apply on any device. + +| Metric | Reference Score | Pass threshold (0.97 ×) | +| --- | --- | --- | +| Single-turn **overall** (gated) | 86.23% | **≥ 83.64%** | +| Single-turn **non_live-normalized** (gated) | 87.96% | **≥ 85.32%** | +| `non_live` (AST) | 82.59% | not individually gated | +| `live` | 84.12% | not individually gated | +| `hallucination` | 97.16% | not individually gated | + +The two gated metrics are encoded in the ruleset at +[`src/inference_endpoint/config/rulesets/mlcommons/models.py`](https://github.com/mlcommons/endpoints/blob/main/src/inference_endpoint/config/rulesets/mlcommons/models.py) +(`Qwen3_6_27B.accuracy_target_settings`). + +> The optional multi-turn run is **not gated**. For reference, a single run of +> the full 200-entry `multi_turn_base` (no sampling) scored 140/200 = 70.00%, in +> parity with evalscope. + +--- + +## Reproducing the Results + +See the full step-by-step guide at: + +**[`mlcommons/endpoints — examples/10_Edge_Agentic_Example/`](https://github.com/mlcommons/endpoints/tree/main/examples/10_Edge_Agentic_Example)** + +Quick start — accuracy-only (~3 h on an edge device): + +```bash +git clone https://github.com/mlcommons/endpoints.git +cd endpoints +pip install -e ".[bfcl]" +cd examples/10_Edge_Agentic_Example +# Edit model_params.name + endpoint_config.endpoints in the config to match your +# server, then: +inference-endpoint benchmark from-config \ + --config online_edge_full_run.yaml \ + --accuracy-only +``` + +Drop `--accuracy-only` to run the mandated combined benchmark (performance + +accuracy back-to-back, ~5.5 h) from the same config — this prevents performance +and accuracy from being measured under different settings. + +--- + +## Scenario + +The **gated metric is single-turn function-calling accuracy** (offline scenario, +single worker, single connection for deterministic per-sample ordering). + +The same `online_edge_full_run.yaml` config also defines a **performance phase** +(single-stream replay of recorded agentic-coding trajectories, scored by an +inline online checker) that runs back-to-back with the accuracy phase. The two +phases share one config so a submission cannot use different settings for +performance and accuracy. Absolute latency/throughput are hardware-specific; +only accuracy is hardware-independent and gated.