Add documentation for code evaluators and index by yahya-mouman · Pull Request #4585 · DataDog/dd-trace-go

yahya-mouman · 2026-03-23T14:09:33Z

Motivation

Provide human-readable documentation for the evaluator patterns used by the LLMObs experiment SDK so implementers can understand evaluator contracts without reading source.
Clarify how evaluator return types map to experiment metric types and how evaluator errors are handled (including WithAbortOnError(true)).
Offer simple, illustrative evaluator examples (exact match, overlap, similarity, and a fake judge) to accelerate testing and onboarding.

Description

Add code-evaluators-index.md that documents evaluator lifecycle phases, metric normalization rules, error-handling model, and guidance for choosing return types.
Add code-evaluator-exact-match.md describing a boolean exact-equality evaluator for deterministic outputs.
Add code-evaluator-overlap.md describing a character-set Jaccard similarity evaluator that returns a 0.0–1.0 score.
Add code-evaluator-similarity.md describing a minimal heuristic score evaluator returning 1.0 or 0.5.
Add code-evaluator-fake-llm-as-a-judge.md describing a constant-label categorical evaluator as a pattern for LLM-as-judge style outputs.

Testing

No automated tests were run for this documentation-only change.

Codex Task

datadog-official · 2026-03-23T14:10:01Z

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
• Patch Coverage: 100.00%
• Overall Coverage: 59.78% (+3.78%)

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: a124c92 | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!}

pr-commenter · 2026-03-23T14:35:36Z

Benchmarks

Benchmark execution time: 2026-03-23 14:36:50

Comparing candidate commit f3970b8 in PR branch codex/document-built-in-code-evaluators-in-llmobs with baseline commit f3970b8 in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 216 metrics, 8 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

🟩 = significantly better candidate vs. baseline
🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

docs(llmobs): make evaluator docs standalone and workflow-focused

a124c92

yahya-mouman added the codex label Mar 23, 2026 — with ChatGPT Codex Connector

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add documentation for code evaluators and index#4585

Add documentation for code evaluators and index#4585
yahya-mouman wants to merge 1 commit intomainfrom
codex/document-built-in-code-evaluators-in-llmobs

yahya-mouman commented Mar 23, 2026

Uh oh!

datadog-official bot commented Mar 23, 2026 •

edited by datadog-prod-us1-6 bot

Loading

Uh oh!

pr-commenter bot commented Mar 23, 2026 •

edited

Loading

Explanation

More details about the CI and significant changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yahya-mouman commented Mar 23, 2026

Motivation

Description

Testing

Uh oh!

datadog-official bot commented Mar 23, 2026 • edited by datadog-prod-us1-6 bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pr-commenter bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Explanation

More details about the CI and significant changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

datadog-official bot commented Mar 23, 2026 •

edited by datadog-prod-us1-6 bot

Loading

pr-commenter bot commented Mar 23, 2026 •

edited

Loading