Skip to content

fix(jfr): prevent duplicate method-pool ids (PROF-15130)#602

Merged
jbachorik merged 4 commits into
mainfrom
jb/prof-15130-method-pool-ids
Jun 18, 2026
Merged

fix(jfr): prevent duplicate method-pool ids (PROF-15130)#602
jbachorik merged 4 commits into
mainfrom
jb/prof-15130-method-pool-ids

Conversation

@jbachorik

@jbachorik jbachorik commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

What does this PR do?:

Replaces the JFR method-pool id scheme mi->_key = _method_map->size() + 1 with a free-list id allocator (MethodMap::allocId() / freeId()) that recycles an id only after the owning method is erased. This guarantees no two live methods ever share a jdk.types.Method constant-pool id within a chunk.

This PR also drops the now-dead _base_id field (and all | _base_id references) and the unused FlightRecorder::flush(), simplifies switchChunk() to the always-valid-fd path, and adds a missing open()-failure guard in FlightRecorder::dump().

Motivation:

Root cause (PROF-15130): resolveMethod() assigned mi->_key = _method_map->size() + 1. cleanupUnreferencedMethods() erases unreferenced entries and shrinks the map, so size()+1 later reissues an id that is still owned by a surviving method. Two live methods then share one id in a single chunk's method constant pool. First-wins strict parsers (JMC, jafar, the Datadog backend) render the wrong (phantom) method for that id — phantom symbolization. Last-wins OpenJDK jfr happened to be unaffected, which is why this went unnoticed.

The free-list allocator recycles a freed id only after its method is erased (the erased method is never written again), bounding the id range to the peak number of live methods while guaranteeing no live-id collisions.

Additional Notes:

The _base_id removal is required by the allocator change: with recycled ids the << 0x1000000 chunk-base offset is no longer meaningful and the field is dead.

Unrelated CI fix (ci: gate reliability tests on deploy-artifact):

Not part of the PROF-15130 fix, but needed to make this PR's CI go green. The reliability/chaos child pipeline downloads com.datadoghq:ddprof:<branch>-SNAPSHOT from Maven snapshots, an artifact published by deploy-artifact. run-reliability-tests declared explicit needs: (so GitLab's DAG ignored stage order) but did not list deploy-artifact, so on a cold branch the tests ran before the snapshot was published and failed with "no profiler artifact available" — and on warm branches silently tested the previous push's stale snapshot. Adds deploy-artifact (optional: true, so release branches where it never runs stay satisfiable) to the gate.

How to test the change?:

Two regression tests are added:

  • gtest methodMapId_ut — drives the allocator directly: models the size()+1 collision in the exact erase+reuse cycle (pinning the bug so the test is not vacuous) and proves the allocator never hands out a still-live id.
  • JUnit MethodIdReuseTest — a dependency-free raw-chunk oracle that parses every chunk's constant pools and counts any jdk.types.Method id mapping to two distinct method definitions. A deterministic erase+reuse driver (phase-1 lambdas age out past AGE_THRESHOLD while a persistent set survives) fails without the fix and reports 0 duplicates with it.

Local results (debug build, macOS):

  • :ddprof-lib:buildDebug -Pskip-gtest → BUILD SUCCESSFUL
  • :ddprof-lib:gtestDebug_methodMapId_ut[ PASSED ] 5 tests
  • :ddprof-test:testDebug --tests MethodIdReuseTest → tests=1, failures=0; every dump reports duplicate method-pool ids: 0

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a security review (run the dd:platform-security-review
    skill, or file a request via the PSEC review form).
    bewaire also runs automatically on every PR.

  • This PR doesn't touch any of that.

  • JIRA: PROF-15130

🤖 Generated with Claude Code

resolveMethod() derived the method-pool id from _method_map->size()+1.
cleanupUnreferencedMethods() erases entries, so size()+1 later reissues an
id still owned by a live method, producing duplicate jdk.types.Method
constant-pool ids in a chunk. First-wins strict parsers (JMC, Datadog
backend) then render a phantom method.

Replace size()+1 with a free-list id allocator (allocId/freeId) that
recycles an id only after the owning method is erased, so no two live
methods ever share an id. Also drop the dead _base_id field and
FlightRecorder::flush(), simplify switchChunk to the always-valid-fd path,
and add an open()-failure guard in dump().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jbachorik

Copy link
Copy Markdown
Collaborator Author

@codex review

@dd-octo-sts

dd-octo-sts Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

CI Test Results

Run: #27767532406 | Commit: 5ee3fb4 | Duration: 19m 11s (longest job)

All 32 test jobs passed

Status Overview

JDK glibc-aarch64/debug glibc-amd64/debug musl-aarch64/debug musl-amd64/debug
8 - - -
8-ibm - - -
8-j9 - -
8-librca - -
8-orcl - - -
11 - - -
11-j9 - -
11-librca - -
17 - -
17-graal - -
17-j9 - -
17-librca - -
21 - -
21-graal - -
21-librca - -
25 - -
25-graal - -
25-librca - -

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Summary: Total: 32 | Passed: 32 | Failed: 0


Updated: 2026-06-18 15:06:34 UTC

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes PROF-15130 by replacing the JFR method constant-pool id assignment scheme (size()+1) with a free-list allocator that prevents id reuse collisions among live methods within a chunk, eliminating “phantom” method symbolization in strict JFR parsers.

Changes:

  • Introduce MethodMap::allocId() / freeId() and wire it into method resolution + cleanup to ensure live-id uniqueness.
  • Remove the now-dead _base_id offset scheme and delete the unused FlightRecorder::flush(), simplifying switchChunk() to the always-valid-fd path.
  • Add regression coverage via a new gtest and a raw-chunk JUnit oracle; also add an open() failure guard in FlightRecorder::dump().

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
ddprof-test/src/test/java/com/datadoghq/profiler/metadata/MethodIdReuseTest.java Adds a regression test that parses raw JFR chunks to detect duplicate jdk.types.Method ids.
ddprof-lib/src/test/cpp/methodMapId_ut.cpp Adds a gtest suite validating the allocator invariants and demonstrating the old size()+1 collision.
ddprof-lib/src/main/cpp/flightRecorder.h Adds the free-list id allocator to MethodMap and removes _base_id / flush() declarations.
ddprof-lib/src/main/cpp/flightRecorder.cpp Uses allocId() for method ids, recycles ids on erase, removes _base_id usage, simplifies chunk switching, and guards dump file open() failure.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ddprof-test/src/test/java/com/datadoghq/profiler/metadata/MethodIdReuseTest.java Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7e6d55d6ac

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread ddprof-test/src/test/java/com/datadoghq/profiler/metadata/MethodIdReuseTest.java Outdated
@datadog-datadog-prod-us1-2

datadog-datadog-prod-us1-2 Bot commented Jun 18, 2026

Copy link
Copy Markdown

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 14 Pipeline jobs failed

DataDog/java-profiler | benchmarks-candidate-aarch64: [alloc]   View in Datadog   GitLab

DataDog/java-profiler | benchmarks-candidate-aarch64: [cpu,wall,alloc,memleak]   View in Datadog   GitLab

DataDog/java-profiler | benchmarks-candidate-aarch64: [cpu,wall]   View in Datadog   GitLab

View all 14 failed jobs.

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: b4c615e | Docs | Datadog PR Page | Give us feedback!

jbachorik and others added 2 commits June 18, 2026 11:40
Reliability/chaos tests download the branch SNAPSHOT from Maven, which is
published by deploy-artifact. Without the dependency the child pipeline could
start before the snapshot existed (cold branch) or pull a stale one.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- generate a distinct Runnable class per target (was one lambda call
  site -> one method id) so phase 1 builds the high-id population the
  oracle depends on
- drop the unreachable 9th-byte branch in readVarLong
- remove unused imports (ByteOrder, Assertions.fail)
- delete temp JFR dumps in after() unless ddprof_test.keep_jfrs

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jbachorik jbachorik marked this pull request as ready for review June 18, 2026 14:43
@jbachorik jbachorik requested a review from a team as a code owner June 18, 2026 14:43
@jbachorik jbachorik requested review from rkennke and zhengyu123 June 18, 2026 16:35

@rkennke rkennke left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. There seem to be some minor unrelated changes, your call to keep them or propose in a separate change.

// if the filename to dump the recording to is specified move the current
// working file there
int copy_fd = open(filename, O_CREAT | O_RDWR | O_TRUNC, 0644);
if (copy_fd == -1) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks unrelated, but still a reasonable change. Any particular reasoning behind it? (Safe to keep, IMO, just curious).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I took a stab at finally removing dead code - actually, the work started with the assumption that there were some issues in this part of code so I started elimination dead branches and then it looked too pretty to revert :)

Comment thread .gitlab-ci.yml
@jbachorik

Copy link
Copy Markdown
Collaborator Author

@rkennke thanks for the prompt review!

@jbachorik jbachorik merged commit 7a0af67 into main Jun 18, 2026
105 checks passed
@jbachorik jbachorik deleted the jb/prof-15130-method-pool-ids branch June 18, 2026 17:05
@github-actions github-actions Bot added this to the 1.45.0 milestone Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants