Skip to content

Add support for emitting cgroup IDs from perf events#297

Open
leafcompost wants to merge 3 commits into
microsoft:mainfrom
leafcompost:record-cgid
Open

Add support for emitting cgroup IDs from perf events#297
leafcompost wants to merge 3 commits into
microsoft:mainfrom
leafcompost:record-cgid

Conversation

@leafcompost

@leafcompost leafcompost commented Jun 17, 2026

Copy link
Copy Markdown
Member

This patch makes it possible to record the cgroup ID that perf reports on each tracepoint sample. To do so, we add a builder, with_cgroup_data(), that sets PERF_SAMPLE_CGROUP on the session (if we confirm that it has kernel support via probing). The rest of the changes are plumbing that threads the cgroup ID to the record-trace side.

PERF_SAMPLE_CGROUP has been part of perf_event_open(2) since Linux 5.7. When it is set, the kernel records the cgroup ID of the sampled task on every sample. one-collect already defines the PERF_SAMPLE_CGROUP constant in its ABI definitions, but it never set the bit or read the value.

For our use cases, we need to attribute each event back to the service that caused it to fire. Today we can do that attribution on the PID from the event, but PIDs get reused and the PIDs of forked processes/children have to be looked up again. With this change, each event would carry the cgroup id of the originating task, so it would let us attribute the connection to the cgroup (of its systemd service) easily.

Closes #174.

Usage

This behavior is enabled by default. However, unlike the other optional sample fields, with_cgroup_data() can fail outright on kernels before 5.7, so we request it only after a kernel-support probe succeeds. Before turning it on for a session, we open a throwaway perf event with just that sample bit set. If the kernel accepts it, we obtain the cgroup ID on every sample.

A consumer can read it off the sample with context.cgroup_id(). Or, for a usual record-trace script like so:

let ev = event_from_tracefs("syscalls", "sys_enter_write");
record_event(ev);

Running record-trace in the background gives you something like this:

+1.4937: syscalls/sys_enter_write(cgwriter, PID=1881653): CGroup=606750 1 Count
+1.4937: syscalls/sys_enter_write(cgwriter, PID=1881654): CGroup=606768 1 Count

Performance implications

The change adds very little to the per-sample hot path. When cgroup capture is on, each tracepoint sample does one extra 8-byte read at a known offset in the sample record.

To measure how much CPU time that this would add to record_trace, I start with a small program to generate some syscall traffic (here I do a million write(2)s):

#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>

int main(int argc, char **argv)
{
    long writes = (argc > 1) ? atol(argv[1]) : 1000000L;
    double delay = (argc > 2) ? atof(argv[2]) : 1.0;

    usleep((useconds_t)(delay * 1e6));

    int fd = open("/dev/null", O_WRONLY);
    char b = 'x';
    for (long i = 0; i < writes; i++) {
        if (write(fd, &b, 1) < 0) {
            return 1;
        }
    }
    return 0;
}

Then, I use a script like the following to repeatedly invoke trials of record-trace, against either the main branch build or my feature branch build. The time that it measures is off of record-trace's own PID:

def trial(binary):
    gen = subprocess.Popen(["/tmp/perf_gen", "1000000", "1.0"])
    time.sleep(0.2)
    rt = subprocess.Popen(
        [binary, "--script-file", "/tmp/perf.script", "--pid", str(gen.pid),
        "--out", "/tmp/rtout", "--log-mode", "disabled"],
        stdout=subprocess.DEVNULL,
        stderr=subprocess.DEVNULL,
    )
    gen.wait()
    time.sleep(0.3)
    os.kill(rt.pid, signal.SIGINT)
    _, _, usage = os.wait4(rt.pid, 0)
    return usage.ru_utime + usage.ru_stime

After 30 trials each, I get:

Version n mean sd median
main 30 0.1785s 0.0160 0.1762
after 30 0.1770s 0.0052 0.1768

So there is no meaningful difference in CPU time between the two versions, other than noise (Δ = after - main = 0.0015s (-0.86%))

@leafcompost leafcompost marked this pull request as ready for review June 17, 2026 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Need to bubble up cgroup id from events

1 participant