From 7864e48f3c46f9568a9d38f5ea49b2a3cef223d7 Mon Sep 17 00:00:00 2001 From: palanivelg Date: Mon, 8 Jun 2026 10:38:18 -0700 Subject: [PATCH 1/2] feat(edge-agentic): add BFCL v4 edge-agentic reference implementation folder --- language/edge-agentic/README.md | 93 +++++++++++++++++++++++++++++++++ 1 file changed, 93 insertions(+) create mode 100644 language/edge-agentic/README.md diff --git a/language/edge-agentic/README.md b/language/edge-agentic/README.md new file mode 100644 index 0000000000..64952f08d8 --- /dev/null +++ b/language/edge-agentic/README.md @@ -0,0 +1,93 @@ +# MLPerf Inference Reference Implementation — Edge Agentic (BFCL v4) + +This is the reference implementation for the **Edge Agentic** workload, using +[Berkeley Function Calling Leaderboard v4 (BFCL v4)](https://gorilla.cs.berkeley.edu/blogs/13_bfcl_v4.html) +as the accuracy benchmark. + +The implementation lives in the +[`mlcommons/endpoints`](https://github.com/mlcommons/endpoints) repository. +The runnable example, configuration files, and a one-script reproducer are at: + +> **[`examples/10_Edge_Agentic_Example/`](https://github.com/mlcommons/endpoints/tree/main/examples/10_Edge_Agentic_Example)** + +--- + +## Model + +| Property | Value | +| --- | --- | +| Model | `Qwen/Qwen3.6-27B` (Q4\_K\_M GGUF quantization validated) | +| Server | Any OpenAI-compatible endpoint (`/v1/chat/completions`) | +| Validated server | [`llama.cpp llama-server`](https://github.com/ggml-org/llama.cpp) built with CUDA on NVIDIA Jetson Thor | +| HuggingFace ID | [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B) | + +--- + +## Dataset + +| Property | Value | +| --- | --- | +| Source | [`gorilla-llm/gorilla-eval-set`](https://huggingface.co/datasets/gorilla-llm/gorilla-eval-set) (HuggingFace, public) | +| Download | Automatic at runtime — no separate download step required | +| Single-turn subsets | `non_live`, `live`, `hallucination` | +| Multi-turn subsets | `multi_turn_base`, `multi_turn_miss_func`, `multi_turn_miss_param`, `multi_turn_long_context` | + +--- + +## Run Parameters + +| Parameter | Value | +| --- | --- | +| `temperature` | `0` (deterministic) | +| `top_p` | TBD | +| `top_k` | TBD | +| `max_tokens` (max\_osl) | TBD | +| `seed` | `42` | +| `tool_choice` | `auto` | +| ST sampling: `non_live` | 20% | +| ST sampling: `live` | 10% (subsets ≤ 25 samples taken in full) | +| ST sampling: `hallucination` | 5% | +| MT sampling | 3% per subset | + +--- + +## Accuracy Targets + +| Metric | Reference Score | Threshold (99%) | +| --- | --- | --- | +| Single-turn overall | 87.50% | TBD | +| `non_live` (AST) | 86.98% | TBD | +| `live` | 84.12% | TBD | +| `hallucination` | 94.32% | TBD | +| Multi-turn overall (3% sample) | 45.84% | TBD | +| `multi_turn_base` (full, 200 entries) | 70.00% | TBD | + +> Accuracy thresholds (99% of reference score) are TBD pending MLCommons +> working group review. + +--- + +## Reproducing the Results + +See the full step-by-step guide and the `run_accuracy.sh` one-script reproducer at: + +**[`mlcommons/endpoints — examples/10_Edge_Agentic_Example/`](https://github.com/mlcommons/endpoints/tree/main/examples/10_Edge_Agentic_Example)** + +Quick start (≈ 2.5 h on an edge device): + +```bash +git clone https://github.com/mlcommons/endpoints.git +cd endpoints +pip install -e ".[bfcl]" +cd examples/10_Edge_Agentic_Example +MODEL=Qwen3.6-27B-Q4_K_M ENDPOINT=http://localhost:8080 bash run_accuracy.sh +``` + +--- + +## Scenario + +This workload runs in **accuracy-only** mode (offline scenario, single worker, +single connection for deterministic per-sample ordering). There is no +performance (throughput) target for this workload — the benchmark measures +function-calling accuracy, not QPS. From 7af6237aad32003d76d7a85831ac5dac2b24c0ec Mon Sep 17 00:00:00 2001 From: Palanivelg Date: Tue, 23 Jun 2026 17:05:23 -0700 Subject: [PATCH 2/2] docs(edge-agentic): finalize single-turn gate, sampling, combined-config reproducer --- language/edge-agentic/README.md | 86 ++++++++++++++++++++++----------- 1 file changed, 58 insertions(+), 28 deletions(-) diff --git a/language/edge-agentic/README.md b/language/edge-agentic/README.md index 64952f08d8..990bba878d 100644 --- a/language/edge-agentic/README.md +++ b/language/edge-agentic/README.md @@ -6,10 +6,15 @@ as the accuracy benchmark. The implementation lives in the [`mlcommons/endpoints`](https://github.com/mlcommons/endpoints) repository. -The runnable example, configuration files, and a one-script reproducer are at: +The runnable example, configuration file, and step-by-step reproducer are at: > **[`examples/10_Edge_Agentic_Example/`](https://github.com/mlcommons/endpoints/tree/main/examples/10_Edge_Agentic_Example)** +The gated accuracy benchmark is **single-turn only** (3 categories, +per-category sampled to a stable ~995-sample point estimate). Multi-turn remains +available as an optional exploratory run but is **not** part of the accuracy +gate. + --- ## Model @@ -18,7 +23,7 @@ The runnable example, configuration files, and a one-script reproducer are at: | --- | --- | | Model | `Qwen/Qwen3.6-27B` (Q4\_K\_M GGUF quantization validated) | | Server | Any OpenAI-compatible endpoint (`/v1/chat/completions`) | -| Validated server | [`llama.cpp llama-server`](https://github.com/ggml-org/llama.cpp) built with CUDA on NVIDIA Jetson Thor | +| Validated server | [`llama.cpp llama-server`](https://github.com/ggml-org/llama.cpp) (commit `cfff1fc`) built with CUDA on NVIDIA Jetson AGX Thor, `--reasoning off --ctx-size 32768 -np 1` | | HuggingFace ID | [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B) | --- @@ -29,8 +34,9 @@ The runnable example, configuration files, and a one-script reproducer are at: | --- | --- | | Source | [`gorilla-llm/gorilla-eval-set`](https://huggingface.co/datasets/gorilla-llm/gorilla-eval-set) (HuggingFace, public) | | Download | Automatic at runtime — no separate download step required | -| Single-turn subsets | `non_live`, `live`, `hallucination` | -| Multi-turn subsets | `multi_turn_base`, `multi_turn_miss_func`, `multi_turn_miss_param`, `multi_turn_long_context` | +| Gated subsets (single-turn) | `non_live`, `live`, `hallucination` | +| QSL size | ~995 (per-category sampling, see Run Parameters) | +| Multi-turn subsets | `multi_turn_base`, `multi_turn_miss_func`, `multi_turn_miss_param`, `multi_turn_long_context` — **optional, not gated** | --- @@ -38,56 +44,80 @@ The runnable example, configuration files, and a one-script reproducer are at: | Parameter | Value | | --- | --- | -| `temperature` | `0` (deterministic) | -| `top_p` | TBD | -| `top_k` | TBD | -| `max_tokens` (max\_osl) | TBD | +| `temperature` | `0` (deterministic, greedy) | +| `top_p` / `top_k` | server default — unused at `temperature 0` | +| `max_tokens` (max\_osl) | `1024` | | `seed` | `42` | | `tool_choice` | `auto` | -| ST sampling: `non_live` | 20% | -| ST sampling: `live` | 10% (subsets ≤ 25 samples taken in full) | -| ST sampling: `hallucination` | 5% | -| MT sampling | 3% per subset | +| ST sampling: `non_live` | 62% (~712 samples) | +| ST sampling: `live` | 10% (subsets ≤ 25 samples taken in full → ~171 samples) | +| ST sampling: `hallucination` | 10% (~112 samples) | +| `subset_floor` | 25 (subsets ≤ 25 taken in full) | + +Total ≈ **995** single-turn samples — large enough for a stable point estimate. --- ## Accuracy Targets -| Metric | Reference Score | Threshold (99%) | +The pass/fail criterion is a **3% one-sided band** anchored on the validated +Jetson AGX Thor `Qwen3.6-27B-Q4_K_M` reference: a submission passes if its +single-turn score is **≥ 0.97 × reference**, with no upper bound (a higher score +never fails). Accuracy is hardware-independent (deterministic at `temperature 0` ++ fixed seed), so the same thresholds apply on any device. + +| Metric | Reference Score | Pass threshold (0.97 ×) | | --- | --- | --- | -| Single-turn overall | 87.50% | TBD | -| `non_live` (AST) | 86.98% | TBD | -| `live` | 84.12% | TBD | -| `hallucination` | 94.32% | TBD | -| Multi-turn overall (3% sample) | 45.84% | TBD | -| `multi_turn_base` (full, 200 entries) | 70.00% | TBD | +| Single-turn **overall** (gated) | 86.23% | **≥ 83.64%** | +| Single-turn **non_live-normalized** (gated) | 87.96% | **≥ 85.32%** | +| `non_live` (AST) | 82.59% | not individually gated | +| `live` | 84.12% | not individually gated | +| `hallucination` | 97.16% | not individually gated | -> Accuracy thresholds (99% of reference score) are TBD pending MLCommons -> working group review. +The two gated metrics are encoded in the ruleset at +[`src/inference_endpoint/config/rulesets/mlcommons/models.py`](https://github.com/mlcommons/endpoints/blob/main/src/inference_endpoint/config/rulesets/mlcommons/models.py) +(`Qwen3_6_27B.accuracy_target_settings`). + +> The optional multi-turn run is **not gated**. For reference, a single run of +> the full 200-entry `multi_turn_base` (no sampling) scored 140/200 = 70.00%, in +> parity with evalscope. --- ## Reproducing the Results -See the full step-by-step guide and the `run_accuracy.sh` one-script reproducer at: +See the full step-by-step guide at: **[`mlcommons/endpoints — examples/10_Edge_Agentic_Example/`](https://github.com/mlcommons/endpoints/tree/main/examples/10_Edge_Agentic_Example)** -Quick start (≈ 2.5 h on an edge device): +Quick start — accuracy-only (~3 h on an edge device): ```bash git clone https://github.com/mlcommons/endpoints.git cd endpoints pip install -e ".[bfcl]" cd examples/10_Edge_Agentic_Example -MODEL=Qwen3.6-27B-Q4_K_M ENDPOINT=http://localhost:8080 bash run_accuracy.sh +# Edit model_params.name + endpoint_config.endpoints in the config to match your +# server, then: +inference-endpoint benchmark from-config \ + --config online_edge_full_run.yaml \ + --accuracy-only ``` +Drop `--accuracy-only` to run the mandated combined benchmark (performance + +accuracy back-to-back, ~5.5 h) from the same config — this prevents performance +and accuracy from being measured under different settings. + --- ## Scenario -This workload runs in **accuracy-only** mode (offline scenario, single worker, -single connection for deterministic per-sample ordering). There is no -performance (throughput) target for this workload — the benchmark measures -function-calling accuracy, not QPS. +The **gated metric is single-turn function-calling accuracy** (offline scenario, +single worker, single connection for deterministic per-sample ordering). + +The same `online_edge_full_run.yaml` config also defines a **performance phase** +(single-stream replay of recorded agentic-coding trajectories, scored by an +inline online checker) that runs back-to-back with the accuracy phase. The two +phases share one config so a submission cannot use different settings for +performance and accuracy. Absolute latency/throughput are hardware-specific; +only accuracy is hardware-independent and gated.