|
| 1 | +# Structure Evaluation Datasets |
| 2 | + |
| 3 | +Organise evaluation data so each dataset file tests one concern and each metric is declared exactly once — in `agent.yaml`, not in the JSONL. |
| 4 | + |
| 5 | +## The rule: one dataset, one concern |
| 6 | + |
| 7 | +Metrics are declared at the dataset level in the manifest, not per sample in the JSONL file. |
| 8 | +A dataset file contains only data — inputs, expected outputs, and optional context. |
| 9 | + |
| 10 | +This means: **if you need different metrics, use different dataset files.** |
| 11 | + |
| 12 | +```yaml |
| 13 | +spec: |
| 14 | + evaluation: |
| 15 | + framework: ragas |
| 16 | + datasets: |
| 17 | + - name: rag-quality |
| 18 | + path: $file:evals/rag.jsonl |
| 19 | + metrics: [faithfulness, context_recall, answer_relevancy] |
| 20 | + |
| 21 | + - name: safety |
| 22 | + path: $file:evals/safety.jsonl |
| 23 | + metrics: [toxicity, bias] |
| 24 | + |
| 25 | + - name: accuracy |
| 26 | + path: $file:evals/accuracy.jsonl |
| 27 | + metrics: [answer_similarity, hallucination] |
| 28 | + |
| 29 | + thresholds: |
| 30 | + faithfulness: 0.80 |
| 31 | + context_recall: 0.75 |
| 32 | + answer_relevancy: 0.75 |
| 33 | + toxicity: 0.90 |
| 34 | + bias: 0.85 |
| 35 | + answer_similarity: 0.80 |
| 36 | + hallucination: 0.05 |
| 37 | + ciGate: true |
| 38 | +``` |
| 39 | +
|
| 40 | +## JSONL sample format |
| 41 | +
|
| 42 | +Each line in a dataset file is a JSON object. All fields except `input` and `expected` are optional. |
| 43 | + |
| 44 | +```jsonl |
| 45 | +{"input": "What is RAG?", "expected": "Retrieval Augmented Generation", "context": ["RAG combines a retrieval step..."], "tags": ["basics"]} |
| 46 | +{"input": "How does vector search work?", "expected": "By comparing embedding distances", "context": ["Vectors are high-dimensional..."], "reference_contexts": ["Embeddings encode semantic meaning..."], "tags": ["rag", "advanced"], "metadata": {"difficulty": "medium"}} |
| 47 | +``` |
| 48 | + |
| 49 | +| Field | Required | Description | |
| 50 | +|---|---|---| |
| 51 | +| `input` | yes | User query sent to the agent | |
| 52 | +| `expected` | yes | Expected output — used for `answer_similarity` and `string_match` scoring | |
| 53 | +| `context` | for RAG metrics | Retrieved chunks the agent used. Required for `faithfulness`, `context_precision`, `hallucination` | |
| 54 | +| `reference_contexts` | for `context_recall` | Ground-truth relevant chunks. Required for `context_recall` | |
| 55 | +| `tags` | no | Labels for filtering with `--tag` | |
| 56 | +| `metadata` | no | Arbitrary key/value pairs reported in output (e.g. `{"difficulty": "hard", "source": "prod-logs"}`) | |
| 57 | + |
| 58 | +## Which metrics need which fields |
| 59 | + |
| 60 | +| Metric | `context` | `reference_contexts` | |
| 61 | +|---|---|---| |
| 62 | +| `answer_similarity` | no | no | |
| 63 | +| `answer_relevancy` | no | no | |
| 64 | +| `hallucination` | yes | no | |
| 65 | +| `faithfulness` | yes | no | |
| 66 | +| `context_precision` | yes | no | |
| 67 | +| `context_recall` | yes | yes | |
| 68 | +| `toxicity` | no | no | |
| 69 | +| `bias` | no | no | |
| 70 | + |
| 71 | +If a dataset declares a RAG metric but its samples have no `context` field, the evaluation framework will error or return meaningless scores. Splitting by concern prevents this. |
| 72 | + |
| 73 | +## Running a dataset |
| 74 | + |
| 75 | +```bash |
| 76 | +# Run all samples |
| 77 | +agentspec evaluate agent.yaml --url http://localhost:4000 --dataset rag-quality |
| 78 | +
|
| 79 | +# Run 20 random samples |
| 80 | +agentspec evaluate agent.yaml --url http://localhost:4000 --dataset rag-quality --sample-size 20 |
| 81 | +
|
| 82 | +# Run only samples tagged "advanced" |
| 83 | +agentspec evaluate agent.yaml --url http://localhost:4000 --dataset rag-quality --tag advanced |
| 84 | +
|
| 85 | +# Machine-readable output |
| 86 | +agentspec evaluate agent.yaml --url http://localhost:4000 --dataset safety --json |
| 87 | +``` |
| 88 | + |
| 89 | +Exit code `1` when `ciGate: true` and any metric falls below its threshold. |
| 90 | + |
| 91 | +## Recommended file layout |
| 92 | + |
| 93 | +``` |
| 94 | +evals/ |
| 95 | + rag.jsonl # faithfulness, context_recall, answer_relevancy |
| 96 | + safety.jsonl # toxicity, bias |
| 97 | + accuracy.jsonl # answer_similarity, hallucination |
| 98 | + regression.jsonl # string_match on known Q&A pairs (no context needed) |
| 99 | +``` |
| 100 | +
|
| 101 | +One JSONL per concern keeps datasets independently runnable, independently versionable, and easy to extend without touching other test suites. |
| 102 | +
|
| 103 | +## See also |
| 104 | +
|
| 105 | +- [`agentspec evaluate` CLI reference](../reference/cli.md#agentspec-evaluate) |
| 106 | +- [Probe coverage & evidence tiers](../concepts/probe-coverage.md) |
| 107 | +- [CI integration](./ci-integration.md) |
0 commit comments