Add support for emitting cgroup IDs from perf events#297
Open
leafcompost wants to merge 3 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This patch makes it possible to record the cgroup ID that perf reports on each tracepoint sample. To do so, we add a builder,
with_cgroup_data(), that setsPERF_SAMPLE_CGROUPon the session (if we confirm that it has kernel support via probing). The rest of the changes are plumbing that threads the cgroup ID to the record-trace side.PERF_SAMPLE_CGROUPhas been part of perf_event_open(2) since Linux 5.7. When it is set, the kernel records the cgroup ID of the sampled task on every sample. one-collect already defines the PERF_SAMPLE_CGROUP constant in its ABI definitions, but it never set the bit or read the value.For our use cases, we need to attribute each event back to the service that caused it to fire. Today we can do that attribution on the PID from the event, but PIDs get reused and the PIDs of forked processes/children have to be looked up again. With this change, each event would carry the cgroup id of the originating task, so it would let us attribute the connection to the cgroup (of its systemd service) easily.
Closes #174.
Usage
This behavior is enabled by default. However, unlike the other optional sample fields,
with_cgroup_data()can fail outright on kernels before 5.7, so we request it only after a kernel-support probe succeeds. Before turning it on for a session, we open a throwaway perf event with just that sample bit set. If the kernel accepts it, we obtain the cgroup ID on every sample.A consumer can read it off the sample with
context.cgroup_id(). Or, for a usual record-trace script like so:Running record-trace in the background gives you something like this:
Performance implications
The change adds very little to the per-sample hot path. When cgroup capture is on, each tracepoint sample does one extra 8-byte read at a known offset in the sample record.
To measure how much CPU time that this would add to record_trace, I start with a small program to generate some syscall traffic (here I do a million
write(2)s):Then, I use a script like the following to repeatedly invoke trials of record-trace, against either the main branch build or my feature branch build. The time that it measures is off of record-trace's own PID:
After 30 trials each, I get:
So there is no meaningful difference in CPU time between the two versions, other than noise (Δ = after - main = 0.0015s (-0.86%))