Skip to content

fix(openfeature): improve FFE eval metrics cross-tracer consistency#4590

Open
sameerank wants to merge 3 commits intomainfrom
sameerank/FFL-1972/fix-flag-evaluation-metrics-consistency-issues
Open

fix(openfeature): improve FFE eval metrics cross-tracer consistency#4590
sameerank wants to merge 3 commits intomainfrom
sameerank/FFL-1972/fix-flag-evaluation-metrics-consistency-issues

Conversation

@sameerank
Copy link
Copy Markdown
Contributor

@sameerank sameerank commented Mar 24, 2026

What does this PR do?

Fixes consistency issues in FFE flag evaluation metrics to align with OpenFeature telemetry conventions and other SDK implementations:

  1. Map errNoConfiguration to PROVIDER_NOT_READY error code

    • Previously returned GENERAL, now returns provider_not_ready
    • Aligns with Python, Ruby, Java, .NET, and JS implementations and is tested by Test_FFE_Eval_No_Config_Loaded in system-tests
  2. Return TYPE_MISMATCH for NUMERIC→INTEGER conversion

    • Previously returned parse_error when evaluating a NUMERIC flag as INTEGER
    • Now returns type_mismatch to align with libdatadog FFE behavior
    • libdatadog treats NUMERIC and INTEGER as incompatible types (bitwise type check)
  3. Add "unknown" fallback for empty reason values

  4. Use raw lowercase error codes directly

    • Removed errorCodeToTag() helper function
    • OpenFeature ErrorCode values are already snake_case (e.g., FLAG_NOT_FOUND)
    • Just lowercase them directly for the metric tag, similar to dd-trace-py

Motivation

FFL-1972 - Cross-tracer consistency for FFE eval metrics

Changes in dd-trace-py that deviated from dd-trace-go and resulted in needing to make this consistent: DataDog/dd-trace-py#17029

System-tests PR that enforces that result.reason and error.type are consistent: DataDog/system-tests#6545

Reviewer's Checklist

  • Changed code has unit tests for its functionality at or near 100% coverage.
  • System-Tests covering this feature have been added and enabled with the va.b.c-dev version tag.
  • There is a benchmark for any new code, or changes to existing code.
  • If this interacts with the agent in a new way, a system test has been added.
  • New code is free of linting errors.
  • New code doesn't break existing tests.
  • Add an appropriate team label so this PR gets put in the right place for the release notes.
  • All generated files are up to date.
  • Non-trivial go.mod changes reviewed by @DataDog/dd-trace-go-guild.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 88.88889% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 60.67%. Comparing base (7fbda1c) to head (7712585).

Files with missing lines Patch % Lines
openfeature/provider.go 50.00% 1 Missing ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
openfeature/flageval_metrics.go 85.36% <100.00%> (-8.12%) ⬇️
openfeature/provider.go 74.24% <50.00%> (-0.76%) ⬇️

... and 265 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@datadog-official
Copy link
Copy Markdown
Contributor

datadog-official bot commented Mar 24, 2026

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 88.89%
Overall Coverage: 60.03% (-0.06%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 7712585 | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!

@pr-commenter
Copy link
Copy Markdown

pr-commenter bot commented Mar 24, 2026

Benchmarks

Benchmark execution time: 2026-03-28 03:21:46

Comparing candidate commit 7712585 in PR branch sameerank/FFL-1972/fix-flag-evaluation-metrics-consistency-issues with baseline commit 7fbda1c in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 217 metrics, 7 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

Comment on lines +361 to +363
// NUMERIC and INTEGER are distinct types; reject float-to-int conversion
// to align with libdatadog FFE which treats them as incompatible types.
conversionErr = fmt.Errorf("%w: flag %q is NUMERIC but INTEGER was requested", errTypeMismatch, flagKey)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should normally never happen. The fact that it does, means we've ignored the type field and now trying to cast arbitrary values to integer. numeric -> integer is one conversion you just fixed but there's also json, which may contain arbitrary values including integers (probably worth adding to test cases).

Return TYPE_MISMATCH for NUMERIC→INTEGER conversion

  • Previously returned parse_error when evaluating a NUMERIC flag as INTEGER
  • Now returns type_mismatch to align with libdatadog FFE behavior

We should return parse_error when configuration's value does not match the type declared in configuration (e.g. when variant type is set to integer but the value is string). Do we still handle this case?

This commit fixes three consistency issues in FFE flag evaluation metrics
to align with the OpenFeature telemetry conventions and other SDK implementations:

1. Map errNoConfiguration to PROVIDER_NOT_READY error code
   Previously returned GENERAL, now returns provider_not_ready to match
   Python, Ruby, Java, .NET, and JS implementations.

2. Add "unknown" fallback for empty reason values
   Matches the OpenFeature SDK telemetry convention for missing reasons.

3. Use raw lowercase error codes directly
   Remove errorCodeToTag() helper function since OpenFeature ErrorCode
   values are already snake_case (e.g., FLAG_NOT_FOUND). Just lowercase
   them directly for the metric tag.

FFL-1972 #close
NUMERIC and INTEGER are distinct flag types. Attempting to evaluate a NUMERIC
flag as INTEGER should return TypeMismatch (not ParseError) to align with
libdatadog FFE which treats them as incompatible types.
@sameerank sameerank force-pushed the sameerank/FFL-1972/fix-flag-evaluation-metrics-consistency-issues branch from 6ebda74 to 7712585 Compare March 28, 2026 02:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants