fix(jfr): prevent duplicate method-pool ids (PROF-15130)#602
Conversation
resolveMethod() derived the method-pool id from _method_map->size()+1. cleanupUnreferencedMethods() erases entries, so size()+1 later reissues an id still owned by a live method, producing duplicate jdk.types.Method constant-pool ids in a chunk. First-wins strict parsers (JMC, Datadog backend) then render a phantom method. Replace size()+1 with a free-list id allocator (allocId/freeId) that recycles an id only after the owning method is erased, so no two live methods ever share an id. Also drop the dead _base_id field and FlightRecorder::flush(), simplify switchChunk to the always-valid-fd path, and add an open()-failure guard in dump(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@codex review |
CI Test ResultsRun: #27767532406 | Commit:
Status Overview
Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled Summary: Total: 32 | Passed: 32 | Failed: 0 Updated: 2026-06-18 15:06:34 UTC |
There was a problem hiding this comment.
Pull request overview
This PR fixes PROF-15130 by replacing the JFR method constant-pool id assignment scheme (size()+1) with a free-list allocator that prevents id reuse collisions among live methods within a chunk, eliminating “phantom” method symbolization in strict JFR parsers.
Changes:
- Introduce
MethodMap::allocId()/freeId()and wire it into method resolution + cleanup to ensure live-id uniqueness. - Remove the now-dead
_base_idoffset scheme and delete the unusedFlightRecorder::flush(), simplifyingswitchChunk()to the always-valid-fd path. - Add regression coverage via a new gtest and a raw-chunk JUnit oracle; also add an
open()failure guard inFlightRecorder::dump().
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| ddprof-test/src/test/java/com/datadoghq/profiler/metadata/MethodIdReuseTest.java | Adds a regression test that parses raw JFR chunks to detect duplicate jdk.types.Method ids. |
| ddprof-lib/src/test/cpp/methodMapId_ut.cpp | Adds a gtest suite validating the allocator invariants and demonstrating the old size()+1 collision. |
| ddprof-lib/src/main/cpp/flightRecorder.h | Adds the free-list id allocator to MethodMap and removes _base_id / flush() declarations. |
| ddprof-lib/src/main/cpp/flightRecorder.cpp | Uses allocId() for method ids, recycles ids on erase, removes _base_id usage, simplifies chunk switching, and guards dump file open() failure. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7e6d55d6ac
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Reliability/chaos tests download the branch SNAPSHOT from Maven, which is published by deploy-artifact. Without the dependency the child pipeline could start before the snapshot existed (cold branch) or pull a stale one. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- generate a distinct Runnable class per target (was one lambda call site -> one method id) so phase 1 builds the high-id population the oracle depends on - drop the unreachable 9th-byte branch in readVarLong - remove unused imports (ByteOrder, Assertions.fail) - delete temp JFR dumps in after() unless ddprof_test.keep_jfrs Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
rkennke
left a comment
There was a problem hiding this comment.
Looks good to me. There seem to be some minor unrelated changes, your call to keep them or propose in a separate change.
| // if the filename to dump the recording to is specified move the current | ||
| // working file there | ||
| int copy_fd = open(filename, O_CREAT | O_RDWR | O_TRUNC, 0644); | ||
| if (copy_fd == -1) { |
There was a problem hiding this comment.
This looks unrelated, but still a reasonable change. Any particular reasoning behind it? (Safe to keep, IMO, just curious).
There was a problem hiding this comment.
Yes. I took a stab at finally removing dead code - actually, the work started with the assumption that there were some issues in this part of code so I started elimination dead branches and then it looked too pretty to revert :)
|
@rkennke thanks for the prompt review! |
What does this PR do?:
Replaces the JFR method-pool id scheme
mi->_key = _method_map->size() + 1with a free-list id allocator (MethodMap::allocId()/freeId()) that recycles an id only after the owning method is erased. This guarantees no two live methods ever share ajdk.types.Methodconstant-pool id within a chunk.This PR also drops the now-dead
_base_idfield (and all| _base_idreferences) and the unusedFlightRecorder::flush(), simplifiesswitchChunk()to the always-valid-fd path, and adds a missingopen()-failure guard inFlightRecorder::dump().Motivation:
Root cause (PROF-15130):
resolveMethod()assignedmi->_key = _method_map->size() + 1.cleanupUnreferencedMethods()erases unreferenced entries and shrinks the map, sosize()+1later reissues an id that is still owned by a surviving method. Two live methods then share one id in a single chunk's method constant pool. First-wins strict parsers (JMC, jafar, the Datadog backend) render the wrong (phantom) method for that id — phantom symbolization. Last-wins OpenJDKjfrhappened to be unaffected, which is why this went unnoticed.The free-list allocator recycles a freed id only after its method is erased (the erased method is never written again), bounding the id range to the peak number of live methods while guaranteeing no live-id collisions.
Additional Notes:
The
_base_idremoval is required by the allocator change: with recycled ids the<< 0x1000000chunk-base offset is no longer meaningful and the field is dead.Unrelated CI fix (
ci: gate reliability tests on deploy-artifact):Not part of the PROF-15130 fix, but needed to make this PR's CI go green. The reliability/chaos child pipeline downloads
com.datadoghq:ddprof:<branch>-SNAPSHOTfrom Maven snapshots, an artifact published bydeploy-artifact.run-reliability-testsdeclared explicitneeds:(so GitLab's DAG ignored stage order) but did not listdeploy-artifact, so on a cold branch the tests ran before the snapshot was published and failed with "no profiler artifact available" — and on warm branches silently tested the previous push's stale snapshot. Addsdeploy-artifact(optional: true, so release branches where it never runs stay satisfiable) to the gate.How to test the change?:
Two regression tests are added:
methodMapId_ut— drives the allocator directly: models thesize()+1collision in the exact erase+reuse cycle (pinning the bug so the test is not vacuous) and proves the allocator never hands out a still-live id.MethodIdReuseTest— a dependency-free raw-chunk oracle that parses every chunk's constant pools and counts anyjdk.types.Methodid mapping to two distinct method definitions. A deterministic erase+reuse driver (phase-1 lambdas age out pastAGE_THRESHOLDwhile a persistent set survives) fails without the fix and reports 0 duplicates with it.Local results (debug build, macOS):
:ddprof-lib:buildDebug -Pskip-gtest→ BUILD SUCCESSFUL:ddprof-lib:gtestDebug_methodMapId_ut→[ PASSED ] 5 tests:ddprof-test:testDebug --tests MethodIdReuseTest→ tests=1, failures=0; every dump reportsduplicate method-pool ids: 0For Datadog employees:
If this PR touches code that signs or publishes builds or packages, or handles
credentials of any kind, I've requested a security review (run the
dd:platform-security-reviewskill, or file a request via the PSEC review form).
bewairealso runs automatically on every PR.This PR doesn't touch any of that.
JIRA: PROF-15130
🤖 Generated with Claude Code