Skip to content

Add documentation for code evaluators and index#4585

Draft
yahya-mouman wants to merge 1 commit intomainfrom
codex/document-built-in-code-evaluators-in-llmobs
Draft

Add documentation for code evaluators and index#4585
yahya-mouman wants to merge 1 commit intomainfrom
codex/document-built-in-code-evaluators-in-llmobs

Conversation

@yahya-mouman
Copy link
Copy Markdown

Motivation

  • Provide human-readable documentation for the evaluator patterns used by the LLMObs experiment SDK so implementers can understand evaluator contracts without reading source.
  • Clarify how evaluator return types map to experiment metric types and how evaluator errors are handled (including WithAbortOnError(true)).
  • Offer simple, illustrative evaluator examples (exact match, overlap, similarity, and a fake judge) to accelerate testing and onboarding.

Description

  • Add code-evaluators-index.md that documents evaluator lifecycle phases, metric normalization rules, error-handling model, and guidance for choosing return types.
  • Add code-evaluator-exact-match.md describing a boolean exact-equality evaluator for deterministic outputs.
  • Add code-evaluator-overlap.md describing a character-set Jaccard similarity evaluator that returns a 0.0–1.0 score.
  • Add code-evaluator-similarity.md describing a minimal heuristic score evaluator returning 1.0 or 0.5.
  • Add code-evaluator-fake-llm-as-a-judge.md describing a constant-label categorical evaluator as a pattern for LLM-as-judge style outputs.

Testing

  • No automated tests were run for this documentation-only change.

Codex Task

@datadog-official
Copy link
Copy Markdown
Contributor

datadog-official bot commented Mar 23, 2026

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 59.78% (+3.78%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: a124c92 | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!

@pr-commenter
Copy link
Copy Markdown

pr-commenter bot commented Mar 23, 2026

Benchmarks

Benchmark execution time: 2026-03-23 14:36:50

Comparing candidate commit f3970b8 in PR branch codex/document-built-in-code-evaluators-in-llmobs with baseline commit f3970b8 in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 216 metrics, 8 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant