DataDog · yahya-mouman · Mar 23, 2026
@@ -0,0 +1,53 @@
+# Exact Match Evaluator
+
+## What this evaluator does
+
+The Exact Match evaluator answers one question for each dataset record:
+
+**“Did the task output exactly equal the expected output?”**
+
+It emits a binary pass/fail result.
+
+## Inputs it reads
+
+For each record evaluation call, it consumes:
+
+- **Task output**: the value produced by the experiment task for that record.
+- **Expected output**: the gold/reference answer stored in the dataset record.
+
+## Evaluation logic (step-by-step)
+
+1. Receive the task output and expected output.
+2. Perform strict equality comparison between the two values.
+3. Return:
+   - `true` when values are exactly equal,
+   - `false` when they differ.
+
+## Output type and downstream metric mapping
+
+- Evaluator return type: **boolean** (`true`/`false`).
+- Experiment metric type generated from this value: **`boolean`**.
+
+In dashboards, this behaves as a pass-rate style signal (for example, percent of records with `true`).
+
+## Error behavior
+
+This evaluator typically does not produce its own error unless the equality comparison cannot be performed for the concrete runtime types.
+
+If an error does occur, experiment behavior depends on run configuration:
+
+- default: error is recorded and execution continues,
+- abort mode (`WithAbortOnError(true)`): run stops on evaluator failure.
+
+## When to use it
+
+Use Exact Match when outputs are deterministic and canonicalized, such as:
+
+- classification labels,
+- IDs,
+- normalized short answers,
+- exact expected strings.
+
+## When it is not sufficient
+
+Avoid relying only on Exact Match when acceptable answers can vary in wording, format, ordering, or punctuation. In those cases pair it with a fuzzy score evaluator.
@@ -0,0 +1,46 @@
+# Fake LLM-as-a-Judge Evaluator (Categorical Label Pattern)
+
+## What this evaluator does
+
+This evaluator demonstrates a **categorical judgment** pattern by always returning the same quality label.
+
+Example label: `excellent`.
+
+It exists to illustrate how judge-style outcomes can be represented in experiment metrics.
+
+## Inputs it reads
+
+Conceptually, judge evaluators look at:
+
+- the dataset record context,
+- the task output,
+- (optionally) rubric or policy criteria.
+
+In this simplified pattern, the returned label is constant and does not vary by input.
+
+## Evaluation logic (step-by-step)
+
+1. Receive record/output context.
+2. Assign a qualitative label.
+3. Return that label as the evaluation value.
+
+## Output type and downstream metric mapping
+
+- Evaluator return type: **string label**.
+- Experiment metric type generated from this value: **`categorical`**.
+
+Typical dashboards then slice counts/ratios by label.
+
+## Why this evaluator exists
+
+It demonstrates the integration contract for LLM-as-a-judge style evaluators without requiring a real model call.
+
+## How to use this pattern in production
+
+Replace the constant label with model- or rubric-driven labeling, for example:
+
+- `excellent` / `good` / `fair` / `poor`,
+- `safe` / `unsafe`,
+- `grounded` / `hallucinated`.
+
+For reliability, define clear rubric criteria and keep labels stable over time.
@@ -0,0 +1,56 @@
+# Overlap Evaluator (Jaccard Similarity on Character Sets)
+
+## What this evaluator does
+
+The Overlap evaluator measures **how much the produced answer and expected answer share in common** using set similarity.
+
+It returns a numeric score in the range **0.0 to 1.0**.
+
+## Inputs it reads
+
+For each record evaluation call, it consumes:
+
+- **Task output** (expected to be text),
+- **Expected output** (expected to be text).
+
+If either value is not textual, the evaluator reports an error.
+
+## Evaluation logic (step-by-step)
+
+1. Convert both answers into sets of unique characters (runes).
+2. Compute:
+   - **intersection size**: number of characters present in both sets,
+   - **union size**: number of unique characters present in either set.
+3. Return Jaccard similarity:
+   - `intersection / union` when union is non-zero,
+   - `1.0` when both sets are empty.
+
+Interpretation:
+
+- `1.0` means identical character-set coverage,
+- `0.0` means no shared characters,
+- intermediate values represent partial overlap.
+
+## Output type and downstream metric mapping
+
+- Evaluator return type: **floating-point score**.
+- Experiment metric type generated from this value: **`score`**.
+
+## Error behavior
+
+The evaluator can fail if output shapes are incompatible with text comparison (for example, non-string values). Error handling then follows experiment run settings (continue vs abort).
+
+## Characteristics to keep in mind
+
+- Set-based: repeated characters do not increase score.
+- Character-level: it does not understand words, syntax, or semantics.
+- Case-sensitive unless pre-normalized before evaluation.
+- Fast and deterministic.
+
+## When to use it
+
+Use this as a lightweight fuzzy signal when exact equality is too strict and you want a cheap, deterministic similarity score.
+
+## When it is not sufficient
+
+For semantic correctness (meaning-level similarity), use stronger evaluators (embedding similarity, rubric-based judge, or domain-specific scoring).
@@ -0,0 +1,49 @@
+# Similarity Evaluator (Heuristic Score Pattern)
+
+## What this evaluator does
+
+This evaluator demonstrates a **minimal numeric scoring pattern** for experiments.
+
+It emits one of two scores:
+
+- `1.0` if output exactly matches expected output,
+- `0.5` if it does not.
+
+The goal is to show the score-evaluator contract, not provide a production-quality similarity method.
+
+## Inputs it reads
+
+For each record evaluation call, it consumes:
+
+- **Task output**,
+- **Expected output** from the dataset record.
+
+## Evaluation logic (step-by-step)
+
+1. Compare produced and expected values.
+2. Return full score (`1.0`) for exact match.
+3. Return fallback partial score (`0.5`) otherwise.
+
+## Output type and downstream metric mapping
+
+- Evaluator return type: **floating-point score**.
+- Experiment metric type generated from this value: **`score`**.
+
+This makes the result aggregatable as an average quality signal across records.
+
+## Why this evaluator exists
+
+It is intentionally simple so that test/example flows can validate score handling without depending on heavy NLP/ML logic.
+
+## When to use it
+
+Use this pattern as a scaffold when building your own scoring evaluator and you want to wire experiment metrics first.
+
+## How to evolve it for production
+
+Replace the fixed fallback value with domain-aware scoring, such as:
+
+- edit-distance normalization,
+- token overlap,
+- embedding cosine similarity,
+- rubric-based judge scoring.
@@ -0,0 +1,54 @@
+# LLMObs Experiment SDK: Code Evaluator Documentation
+
+This directory explains the evaluator patterns currently used by the Go LLMObs experiment SDK examples/tests.
+
+The pages are written to be standalone references so readers can understand evaluator behavior without opening source files.
+
+## Evaluator pages
+
+1. [Exact Match evaluator](./code-evaluator-exact-match.md)
+2. [Overlap evaluator (Jaccard on character sets)](./code-evaluator-overlap.md)
+3. [Similarity evaluator (heuristic score pattern)](./code-evaluator-similarity.md)
+4. [Fake LLM-as-a-Judge evaluator (categorical label pattern)](./code-evaluator-fake-llm-as-a-judge.md)
+
+## How experiment evaluations run (end-to-end)
+
+An experiment run processes evaluations in five phases:
+
+1. **Task execution phase**
+   - Each dataset record is executed by the configured task.
+   - The run captures output, timestamps, and tracing metadata per record.
+
+2. **Per-record evaluator phase**
+   - Each configured evaluator runs against each record output.
+   - Each evaluator produces an `Evaluation` containing name, value, and optional error.
+
+3. **Summary evaluator phase (optional)**
+   - Aggregate evaluators can run after per-record processing.
+   - These operate over the full set of record results.
+
+4. **Metric normalization phase**
+   - Evaluation values are mapped to Datadog experiment metric types:
+     - booleans → `boolean`,
+     - numbers → `score`,
+     - other values (for example strings) → `categorical`.
+
+5. **Publish phase**
+   - Normalized evaluation metric events are sent to the backend for analysis and visualization.
+
+## Error-handling model
+
+Evaluator failures are captured on the corresponding evaluation item.
+
+- In default mode, runs continue after evaluator errors.
+- In abort mode (`WithAbortOnError(true)`), evaluator errors stop the run.
+
+## Design guidance for custom evaluators
+
+When creating custom evaluators, choose return types intentionally because type controls metric semantics:
+
+- return **bool** for pass/fail,
+- return **number** for quality scores,
+- return **string/enum** for category labels.
+
+This keeps downstream experiment analytics interpretable and consistent.