fix(llmobs): flush span events buffer when it reaches certain size#4524
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files
🚀 New features to boost your workflow:
|
|
✅ Tests 🎉 All green!❄️ No new flaky tests detected 🎯 Code Coverage (details) 🔗 Commit SHA: 8174c0a | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback! |
BenchmarksBenchmark execution time: 2026-04-14 13:23:25 Comparing candidate commit 8174c0a in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 215 metrics, 9 unstable metrics.
|
|
/merge |
|
View all feedbacks in Devflow UI.
The expected merge time in
|
…4524) <!-- * New contributors are highly encouraged to read our [CONTRIBUTING](/CONTRIBUTING.md) documentation. * Commit and PR titles should be prefixed with the general area of the pull request's change. --> ### What does this PR do? <!-- * A brief description of the change being made with this pull request. * If the description here cannot be expressed in a succinct form, consider opening multiple pull requests instead of a single one. --> This PR fixes the behavior of the llmobs SDK for big payloads: - Fix span events flush logic: before, span events were flushed on a fixed 2-second interval regardless of payload size. Now the buffer tracks its cumulative size and flushes automatically when it reaches the 5MB limit enforced by the backend. - Fix `Dataset.Push` for large payloads: before this PR, the logic was to fall back to bulk CSV upload for large changes, but the backend rejects large multipart requests. This switches to chunking inserts across multiple `batch_update` calls instead. - Remove the global flush timeout: the previous code applied a single 2-second deadline shared across all retries when sending span events, causing later retries to fail immediately if the first attempt was slow. Each transport retry now gets its own independent per-request timeout. ### Motivation <!-- * What inspired you to submit this pull request? * Link any related GitHub issues or PRs here. * If this resolves a GitHub issue, include "Fixes #XXXX" to link the issue and auto-close it on merge. --> ### Reviewer's Checklist <!-- * Authors can use this list as a reference to ensure that there are no problems during the review but the signing off is to be done by the reviewer(s). --> - [x] Changed code has unit tests for its functionality at or near 100% coverage. - [ ] [System-Tests](https://github.com/DataDog/system-tests/) covering this feature have been added and enabled with the va.b.c-dev version tag. - [ ] There is a benchmark for any new code, or changes to existing code. - [ ] If this interacts with the agent in a new way, a system test has been added. - [ ] New code is free of linting errors. You can check this by running `make lint` locally. - [ ] New code doesn't break existing tests. You can check this by running `make test` locally. - [ ] Add an appropriate team label so this PR gets put in the right place for the release notes. - [ ] All generated files are up to date. You can check this by running `make generate` locally. - [ ] Non-trivial go.mod changes, e.g. adding new modules, are reviewed by @DataDog/dd-trace-go-guild. Make sure all nested modules are up to date by running `make fix-modules` locally. Unsure? Have a question? Request a review! Co-authored-by: rodrigo.arguello <rodrigo.arguello@datadoghq.com>
What does this PR do?
This PR fixes the behavior of the llmobs SDK for big payloads:
Dataset.Pushfor large payloads: before this PR, the logic was to fall back to bulk CSV upload for large changes, but the backend rejects large multipart requests. This switches to chunking inserts across multiplebatch_updatecalls instead.Motivation
Reviewer's Checklist
make lintlocally.make testlocally.make generatelocally.make fix-moduleslocally.Unsure? Have a question? Request a review!