Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions llmobs/experiment/docs/code-evaluator-exact-match.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Exact Match Evaluator

## What this evaluator does

The Exact Match evaluator answers one question for each dataset record:

**“Did the task output exactly equal the expected output?”**

It emits a binary pass/fail result.

## Inputs it reads

For each record evaluation call, it consumes:

- **Task output**: the value produced by the experiment task for that record.
- **Expected output**: the gold/reference answer stored in the dataset record.

## Evaluation logic (step-by-step)

1. Receive the task output and expected output.
2. Perform strict equality comparison between the two values.
3. Return:
- `true` when values are exactly equal,
- `false` when they differ.

## Output type and downstream metric mapping

- Evaluator return type: **boolean** (`true`/`false`).
- Experiment metric type generated from this value: **`boolean`**.

In dashboards, this behaves as a pass-rate style signal (for example, percent of records with `true`).

## Error behavior

This evaluator typically does not produce its own error unless the equality comparison cannot be performed for the concrete runtime types.

If an error does occur, experiment behavior depends on run configuration:

- default: error is recorded and execution continues,
- abort mode (`WithAbortOnError(true)`): run stops on evaluator failure.

## When to use it

Use Exact Match when outputs are deterministic and canonicalized, such as:

- classification labels,
- IDs,
- normalized short answers,
- exact expected strings.

## When it is not sufficient

Avoid relying only on Exact Match when acceptable answers can vary in wording, format, ordering, or punctuation. In those cases pair it with a fuzzy score evaluator.
46 changes: 46 additions & 0 deletions llmobs/experiment/docs/code-evaluator-fake-llm-as-a-judge.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Fake LLM-as-a-Judge Evaluator (Categorical Label Pattern)

## What this evaluator does

This evaluator demonstrates a **categorical judgment** pattern by always returning the same quality label.

Example label: `excellent`.

It exists to illustrate how judge-style outcomes can be represented in experiment metrics.

## Inputs it reads

Conceptually, judge evaluators look at:

- the dataset record context,
- the task output,
- (optionally) rubric or policy criteria.

In this simplified pattern, the returned label is constant and does not vary by input.

## Evaluation logic (step-by-step)

1. Receive record/output context.
2. Assign a qualitative label.
3. Return that label as the evaluation value.

## Output type and downstream metric mapping

- Evaluator return type: **string label**.
- Experiment metric type generated from this value: **`categorical`**.

Typical dashboards then slice counts/ratios by label.

## Why this evaluator exists

It demonstrates the integration contract for LLM-as-a-judge style evaluators without requiring a real model call.

## How to use this pattern in production

Replace the constant label with model- or rubric-driven labeling, for example:

- `excellent` / `good` / `fair` / `poor`,
- `safe` / `unsafe`,
- `grounded` / `hallucinated`.

For reliability, define clear rubric criteria and keep labels stable over time.
56 changes: 56 additions & 0 deletions llmobs/experiment/docs/code-evaluator-overlap.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Overlap Evaluator (Jaccard Similarity on Character Sets)

## What this evaluator does

The Overlap evaluator measures **how much the produced answer and expected answer share in common** using set similarity.

It returns a numeric score in the range **0.0 to 1.0**.

## Inputs it reads

For each record evaluation call, it consumes:

- **Task output** (expected to be text),
- **Expected output** (expected to be text).

If either value is not textual, the evaluator reports an error.

## Evaluation logic (step-by-step)

1. Convert both answers into sets of unique characters (runes).
2. Compute:
- **intersection size**: number of characters present in both sets,
- **union size**: number of unique characters present in either set.
3. Return Jaccard similarity:
- `intersection / union` when union is non-zero,
- `1.0` when both sets are empty.

Interpretation:

- `1.0` means identical character-set coverage,
- `0.0` means no shared characters,
- intermediate values represent partial overlap.

## Output type and downstream metric mapping

- Evaluator return type: **floating-point score**.
- Experiment metric type generated from this value: **`score`**.

## Error behavior

The evaluator can fail if output shapes are incompatible with text comparison (for example, non-string values). Error handling then follows experiment run settings (continue vs abort).

## Characteristics to keep in mind

- Set-based: repeated characters do not increase score.
- Character-level: it does not understand words, syntax, or semantics.
- Case-sensitive unless pre-normalized before evaluation.
- Fast and deterministic.

## When to use it

Use this as a lightweight fuzzy signal when exact equality is too strict and you want a cheap, deterministic similarity score.

## When it is not sufficient

For semantic correctness (meaning-level similarity), use stronger evaluators (embedding similarity, rubric-based judge, or domain-specific scoring).
49 changes: 49 additions & 0 deletions llmobs/experiment/docs/code-evaluator-similarity.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Similarity Evaluator (Heuristic Score Pattern)

## What this evaluator does

This evaluator demonstrates a **minimal numeric scoring pattern** for experiments.

It emits one of two scores:

- `1.0` if output exactly matches expected output,
- `0.5` if it does not.

The goal is to show the score-evaluator contract, not provide a production-quality similarity method.

## Inputs it reads

For each record evaluation call, it consumes:

- **Task output**,
- **Expected output** from the dataset record.

## Evaluation logic (step-by-step)

1. Compare produced and expected values.
2. Return full score (`1.0`) for exact match.
3. Return fallback partial score (`0.5`) otherwise.

## Output type and downstream metric mapping

- Evaluator return type: **floating-point score**.
- Experiment metric type generated from this value: **`score`**.

This makes the result aggregatable as an average quality signal across records.

## Why this evaluator exists

It is intentionally simple so that test/example flows can validate score handling without depending on heavy NLP/ML logic.

## When to use it

Use this pattern as a scaffold when building your own scoring evaluator and you want to wire experiment metrics first.

## How to evolve it for production

Replace the fixed fallback value with domain-aware scoring, such as:

- edit-distance normalization,
- token overlap,
- embedding cosine similarity,
- rubric-based judge scoring.
54 changes: 54 additions & 0 deletions llmobs/experiment/docs/code-evaluators-index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# LLMObs Experiment SDK: Code Evaluator Documentation

This directory explains the evaluator patterns currently used by the Go LLMObs experiment SDK examples/tests.

The pages are written to be standalone references so readers can understand evaluator behavior without opening source files.

## Evaluator pages

1. [Exact Match evaluator](./code-evaluator-exact-match.md)
2. [Overlap evaluator (Jaccard on character sets)](./code-evaluator-overlap.md)
3. [Similarity evaluator (heuristic score pattern)](./code-evaluator-similarity.md)
4. [Fake LLM-as-a-Judge evaluator (categorical label pattern)](./code-evaluator-fake-llm-as-a-judge.md)

## How experiment evaluations run (end-to-end)

An experiment run processes evaluations in five phases:

1. **Task execution phase**
- Each dataset record is executed by the configured task.
- The run captures output, timestamps, and tracing metadata per record.

2. **Per-record evaluator phase**
- Each configured evaluator runs against each record output.
- Each evaluator produces an `Evaluation` containing name, value, and optional error.

3. **Summary evaluator phase (optional)**
- Aggregate evaluators can run after per-record processing.
- These operate over the full set of record results.

4. **Metric normalization phase**
- Evaluation values are mapped to Datadog experiment metric types:
- booleans → `boolean`,
- numbers → `score`,
- other values (for example strings) → `categorical`.

5. **Publish phase**
- Normalized evaluation metric events are sent to the backend for analysis and visualization.

## Error-handling model

Evaluator failures are captured on the corresponding evaluation item.

- In default mode, runs continue after evaluator errors.
- In abort mode (`WithAbortOnError(true)`), evaluator errors stop the run.

## Design guidance for custom evaluators

When creating custom evaluators, choose return types intentionally because type controls metric semantics:

- return **bool** for pass/fail,
- return **number** for quality scores,
- return **string/enum** for category labels.

This keeps downstream experiment analytics interpretable and consistent.
Loading