[SVLS-8351] Add CPU Enhanced Metrics in Linux Azure Functions by kathiehuang · Pull Request #77 · DataDog/serverless-components

kathiehuang · 2026-02-13T20:19:14Z

What does this PR do?

Adds CPU limit and usage enhanced metrics for Linux Azure Functions.

Adds a new datadog-metrics-collector crate that reads CPU metrics every second and submits them to the Datadog backend every 10 seconds when DD_ENHANCED_METRICS_ENABLED=true (default on)
- This creates an OS-agnostic CpuMetricsCollector struct and CpuStatsReader trait. Currently this only collects CPU metrics in Linux. CPU metrics in Windows will be completed in a future PR
Emits two new metrics:
- azure.functions.enhanced.cpu.usage - container-level CPU consumption rate in nanocores, sourced from cpuacct.usage
- azure.functions.enhanced.cpu.limit - CPU limit in nanocores, computed as min(cpuset.cpus, cfs_quota/cfs_period), falling back to host CPU count if no cgroup limit is set

Additional Notes

Tags attached to all CPU metrics:
- Azure resource metadata from libdd-common:
- resource_group
- subscription_id
- name
- Metadata from other environment variables:
  - region
  - plan_tier
  - service
  - env
  - version
  - serverless_compat_version
Categorizes azure.functions.* metrics as ServerlessEnhanced origin in the dogstatsd origin classifier so that they show up as Enhanced rather than Custom metrics in Datadog Metrics Summary
Sets up:
- CgroupStats struct for reading statistics from cgroup v1 files
  - This normalizes the stats to nanoseconds
- CpuStats struct to store the computed CPU total and limit metrics
  - Converts u64 values to to f64
  - Calculates CPU limit percentage
Separates start_dogstatsd into two functions
- start_aggregator, which starts the aggregator service and metrics flusher
- start_dogstatsd_listener, which enables custom metrics to be received from user code
  - This separation enables enhanced metrics to be submitted to the aggregator service and flushed even when DD_USE_DOGSTATSD is off
Metrics are submitted as distribution metrics because not all metrics have tags with a unique identifier from the instance they are sent from
If the collector cannot read the cgroup files successfully, it will not submit enhanced metrics for that interval and log accordingly

Motivation

https://datadoghq.atlassian.net/browse/SVLS-8351

Describe how to test/QA your changes

Build with serverless-compat-self-monitoring.

Added debug logs to verify calculations:

DEBUG datadog_trace_agent::metrics_collector: Contents of /sys/fs/cgroup/cpuset/cpuset.cpus: 0-1
DEBUG datadog_trace_agent::metrics_collector: Range: ["0", "1"]
DEBUG datadog_trace_agent::metrics_collector: Total CPU count: 2
DEBUG datadog_trace_agent::metrics_collector: CFS scheduler quota is -1, setting to None
DEBUG datadog_trace_agent::metrics_collector: Could not read scheduler quota from /sys/fs/cgroup/cpu/cpu.cfs_quota_us
DEBUG datadog_trace_agent::metrics_collector: No CPU limit found, defaulting to host CPU count: 2 CPUs
DEBUG datadog_trace_agent::metrics_collector: Collected cpu stats!
DEBUG datadog_trace_agent::metrics_collector: CPU usage: 9871234519
DEBUG datadog_trace_agent::metrics_collector: CPU limit: 200%, defaulted: true
DEBUG datadog_trace_agent::metrics_collector: Submitting CPU metrics!

Oftentimes scheduler quota cannot be read from cpu.cfs_quota_us, so it will falls back to the host cpu count from the crate num_cpus

This was deployed with the serverless-compat-self-monitoring pipeline across all runtimes and hosting plans. All hosting plans in Linux were tested to verify metrics submit correctly, and a Windows function was tested to enhanced metrics are disabled in Windows environments.

Testing to compare against Azure Monitor is documented in an internal doc in Enhanced Metrics in the Serverless Compatibility Layer. Detailed calculations are explained in Calculating CPU Enhanced Metrics in Windows and Linux Azure Functions.

Verified CPU usage increases with a CPU-intensive function
Verified CPU limit increases when the hosting plan is scaled up
Verified enhanced metrics still send when DD_USE_DOGSTATSD is off
Verified custom metrics still send when DD_ENHANCED_METRICS_ENABLED is off

References: datadog-agent cgroup collection and calculation logic

kathiehuang · 2026-03-07T01:04:19Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c6a55dc810

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

crates/datadog-serverless-compat/src/main.rs

kathiehuang · 2026-03-07T01:23:51Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dfe28a3a43

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-07T01:29:11Z

crates/datadog-metrics-collector/src/cpu.rs

+            self.last_usage_ns = current_usage_ns;
+
+            // Divide nanoseconds delta by collection interval to get usage rate in nanocores
+            let usage_rate_nc = delta_ns / self.collection_interval_secs as f64;


Use real elapsed time for CPU usage rate

CpuMetricsCollector::collect_and_submit computes usage_rate_nc by dividing the CPU delta by a fixed configured interval, but collection is driven from a tokio::select! loop where the flush branch awaits network I/O (metrics_flusher.flush().await), so real sampling gaps can be longer/shorter than 3s under timeout/retry or scheduler delay; this skews the reported nanocore rate and can produce false spikes/drops in production dashboards.

Useful? React with 👍 / 👎.

Addressed this in c6a55dc - tracked the last collection time and the current time to get a more precise time interval to divide by!

chatgpt-codex-connector · 2026-03-07T01:29:11Z

crates/datadog-serverless-compat/src/main.rs

-        .await;
-        if let Some(ref windows_pipe_name) = dd_dogstatsd_windows_pipe_name {
-            info!("dogstatsd-pipe: starting to listen on pipe {windows_pipe_name}");
+    let needs_aggregator = dd_use_dogstatsd || dd_enhanced_metrics;


Gate aggregator startup to Azure enhanced-metrics path

The new needs_aggregator condition enables the aggregator whenever DD_ENHANCED_METRICS_ENABLED is true (default), even in non-Azure environments where cpu_collector is never created, so deployments with DD_USE_DOGSTATSD=false still start/flush an unused aggregator and can log DD_API_KEY not set errors despite no enhanced metrics being emitted; this should be gated by env_type == EnvironmentType::AzureFunction.

Useful? React with 👍 / 👎.

Fixed by only setting DD_ENHANCED_METRICS to default true if we're in an Azure Function
36bba17

kathiehuang · 2026-03-07T01:45:55Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 058ef533f8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-07T01:52:46Z

crates/datadog-serverless-compat/src/main.rs

+    let mut cpu_collector = if dd_enhanced_metrics && env_type == EnvironmentType::AzureFunction {
+        aggregator_handle.as_ref().map(|handle| {
+            let tags = build_cpu_metrics_tags();
+            CpuMetricsCollector::new(handle.clone(), tags)


Skip CPU collector when flusher setup has failed

This block creates CpuMetricsCollector whenever Azure enhanced metrics are enabled, even if start_aggregator failed to build a flusher (DD_API_KEY missing, bad site/proxy) and returned metrics_flusher = None. In that scenario there is no flush path, but the collector still inserts two new timestamped distributions every 3s, so contexts accumulate until overflow and metrics are dropped while consuming resources; enhanced collection should be gated on a usable flusher (or disabled when flushing is unavailable).

Useful? React with 👍 / 👎.

Addressed in c626c03 to only create CPUMetricsCollector if the metrics flusher was successfully created

chatgpt-codex-connector · 2026-03-07T01:52:46Z

crates/datadog-serverless-compat/src/main.rs

+                if let Some(metrics_flusher) = metrics_flusher.as_ref() {
+                    debug!("Flushing dogstatsd metrics");
+                    metrics_flusher.flush().await;


Decouple CPU sampling from metrics flush awaits

The CPU collection timer and metrics_flusher.flush().await share the same tokio::select! loop, so a slow/timeouting flush blocks the task and prevents CPU collection from running at the intended 3-second cadence. When flushes stall (for example during network degradation), sampling becomes delayed and irregular, which directly reduces CPU usage metric fidelity; CPU collection should run in its own task or otherwise avoid being blocked by flush I/O.

Useful? React with 👍 / 👎.

Addressed in 071d2f0 to spawn the metrics flusher as an independent task from the collector so that a slow flush will not block CPU collection.

DOGSTATSD_TIMEOUT_DURATION is 5 seconds

kathiehuang · 2026-03-09T14:29:22Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 071d2f0488

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-09T14:35:21Z

crates/datadog-serverless-compat/src/main.rs

+                    tokio::spawn(async move {
+                        metrics_flusher.flush().await;
+                    });


Serialize metrics flushes to prevent overlapping upload tasks

This change spawns a new flush() task on every interval tick, so if a flush takes longer than 10s (for example during network slowness or outages), multiple flushes run concurrently and keep accumulating over time. Because Flusher::flush performs network I/O with retries, this can happen in production and causes unnecessary task/socket/memory pressure; the previous logic awaited each flush and guaranteed only one in-flight flush at a time.

Useful? React with 👍 / 👎.

The client has a 5-second timeout with RetryStrategy::LinearBackoff(3, 1), so this would require three consecutive full timeouts. I'm happy to add this, but I'm not sure if it's needed

duncanista

I'd suggest using features for OS specific business logic

Also suggest checking how ADP is doing agent checks in rust, this sounds like an agent check for a very specific use case

duncanista · 2026-03-09T17:29:56Z

crates/datadog-serverless-compat/src/main.rs

 use dogstatsd::metric::{SortedTags, EMPTY_TAGS};
 use tokio_util::sync::CancellationToken;

+const CPU_METRICS_COLLECTION_INTERVAL: u64 = 3;


Suggested change

const CPU_METRICS_COLLECTION_INTERVAL: u64 = 3;

const CPU_METRICS_COLLECTION_INTERVAL_SECONDS: u64 = 3;

Updated in aae174a

crates/datadog-metrics-collector/src/cpu.rs

crates/datadog-metrics-collector/src/linux.rs

Lewis-E · 2026-03-09T20:12:26Z

crates/datadog-metrics-collector/src/cpu.rs

+            );
+
+            if let Err(e) = self.aggregator.insert_batch(vec![usage_metric]) {
+                error!("Failed to insert CPU usage metric: {}", e);


In what situations would we see this error? Would we hit this repeatedly or can the aggregator recover from errors quickly? (Also applies to line 111)

insert_batch calls tx.send, which is on an unbounded channel that has infinite capacity. An error will only happen if the receive half of the channel is closed or dropped, which means the aggregator service isn't working anymore and every subsequent call should also fail. This means that metrics would stop sending, with error logs on every attempted insert. It seems the only way to recover would be for the customer to stop and start their function app to restart the agent

Error logging but continuing is what the lambda extension does

If we're worried about log spam, I could change this to return early on the CPU usage metric insert failure - this would halve the error logs

Or maybe a better solution would be to have collect_and_submit return a Result, and main.rs could set cpu_collector=None on error?

I personally like the return a Result option but am also curious why the lambda extension would send a repeating error log. Also, do you think this bit is unit testable?

Digging into it more, it seems like dogstatsd does the same thing

serverless-components/crates/dogstatsd/src/dogstatsd.rs

Lines 567 to 569 in 05e5c26

if let Err(e) = aggregator.insert_batch(all_valid_metrics) {

error!("Failed to send metrics to aggregator: {}", e);

}

I think I should be able to create a handle with a dead receiver to unit test this if we want to do this! I talked with Shreya and it doesn't seem like customers have been running into this - this pattern came from the existing enhanced metrics

crates/datadog-metrics-collector/src/linux.rs

Lewis-E · 2026-03-09T20:30:18Z

So, ~6 debug logs every 3 seconds? Do these go to Datadog & cost money?

crates/datadog-serverless-compat/src/main.rs

duncanpharvey

Excellent work! I added a few suggestions to consider

duncanpharvey · 2026-03-10T21:27:41Z

crates/datadog-trace-agent/Cargo.toml

+dogstatsd = { path = "../dogstatsd", default-features = true }
+num_cpus = "1.16"


Are these dependencies needed in datadog-trace-agent?

Oh good catch - this was accidentally left over from before I moved the metrics collector into its own crate. Fixed in 2ad5f24!

duncanpharvey · 2026-03-10T21:35:11Z

crates/datadog-serverless-compat/src/main.rs

+    let (metrics_flusher, aggregator_handle) = if needs_aggregator {
+        debug!("Creating metrics flusher and aggregator");
+
+        let (flusher, handle) =


I think a comment here to note why the aggregator is started separately from the dogstatsd listener would be helpful - just enough to be clear that there are different configuration options that require this (dogstatsd enabled/disabled, enhanced metrics enabled/disabled).

Maybe a unit test as well to assert that all of these combinations are covered?

Good point! I added a comment in 60cdecf

It seems like it'll be a little hard to make a meaningful unit test since the aggregator/dogstatsd startup logic has side effects that would make it hard to test in isolation? Maybe I could refactor the startup decision into a struct that describes what to start to separate the decision from execution?

struct AgentConfig { start_aggregator: bool, start_dogstatsd: bool, start_enhanced_metrics: bool, } fn resolve_agent_config(dd_use_dogstatsd: bool, dd_enhanced_metrics: bool) -> AgentConfig { AgentConfig { start_aggregator: dd_use_dogstatsd || dd_enhanced_metrics, start_dogstatsd: dd_use_dogstatsd, start_enhanced_metrics: dd_enhanced_metrics, } }

duncanpharvey · 2026-03-10T21:37:42Z

crates/datadog-serverless-compat/src/main.rs

    Ok(builder.build()?)
 }
+
+fn build_cpu_metrics_tags() -> Option<SortedTags> {


Would this method make more sense to live in the datadog-metrics-collector crate and be used internally within the crate?

That makes sense! I moved it in f867f6f

kathiehuang · 2026-03-11T16:38:05Z

Also suggest checking how ADP is doing agent checks in rust, this sounds like an agent check for a very specific use case

I looked into how ADP does agent checks in Rust - it looks like it's still experimental and may be too high-level for this use case. I will make a Jira ticket for the backlog though so we can come back to this in the future and see if anything in the way they do checks is applicable!

https://datadoghq.atlassian.net/browse/SVLS-8699

crates/datadog-serverless-compat/src/main.rs

…created

…eeded

…llector

…rics-collector

…llector from initializing and logging that metrics collection is being skipped

kathiehuang changed the title ~~[SVLS-8351] Add CPU Enhanced Metrics~~ [SVLS-8351] Add CPU Enhanced Metrics in Linux Azure Functions Mar 6, 2026

kathiehuang force-pushed the kathie.huang/add-cpu-enhanced-metrics branch 2 times, most recently from 7010e35 to 5953d68 Compare March 6, 2026 20:38

DataDog deleted a comment from chatgpt-codex-connector bot Mar 7, 2026

chatgpt-codex-connector bot reviewed Mar 7, 2026

View reviewed changes

crates/datadog-serverless-compat/src/main.rs Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Mar 7, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Mar 9, 2026

View reviewed changes

kathiehuang marked this pull request as ready for review March 9, 2026 17:26

kathiehuang requested review from a team as code owners March 9, 2026 17:26

kathiehuang requested review from Lewis-E, duncanpharvey and lym953 and removed request for a team March 9, 2026 17:26

duncanista reviewed Mar 9, 2026

View reviewed changes

Lewis-E reviewed Mar 9, 2026

View reviewed changes

crates/datadog-metrics-collector/src/cpu.rs Outdated Show resolved Hide resolved

Lewis-E reviewed Mar 9, 2026

View reviewed changes

crates/datadog-metrics-collector/src/linux.rs Outdated Show resolved Hide resolved

Lewis-E reviewed Mar 9, 2026

View reviewed changes

crates/datadog-metrics-collector/src/linux.rs Show resolved Hide resolved

Lewis-E reviewed Mar 10, 2026

View reviewed changes

crates/datadog-serverless-compat/src/main.rs Outdated Show resolved Hide resolved

duncanpharvey reviewed Mar 10, 2026

View reviewed changes

Lewis-E reviewed Mar 19, 2026

View reviewed changes

crates/datadog-serverless-compat/src/main.rs Show resolved Hide resolved

kathiehuang added 27 commits April 1, 2026 17:04

Separate CPU host usage metrics collection into separate PR

f22e570

Remove functionname tag

48e3d43

Send enhanced metrics even if custom metrics are turned off

b5fa7bf

Pull out building metrics tags into function

4991d66

Add unit tests

454d20a

Clean up

28e5c3b

Refactor

86a46e6

Remove last_collection_time

f272433

Only send enhanced metrics for Azure Functions

15b5bd9

Add back last_collection_time

c1eec7b

Only enable enhanced metrics for Azure Functions

4aca646

Only create CPUMetricsCollector when metrics flusher is successfully …

234ca72

…created

Launch metrics flusher as independent task from collector

afa07cd

Create windows-enhanced-metrics feature for Windows-specific logic

989266e

Add unit to collection interval variable

d78bcc1

Make last_usage_ns an Option and keep CPU total as u64 until f64 is n…

29740a7

…eeded

Change collection interval to 1 for precision and remove unneeded logs

a9ae22d

Add comment to clarify shared aggregator between dogstatsd and cpu co…

fe264b3

…llector

Move tag building logic from datadog-serverless-compat to datadog-met…

78fbcf4

…rics-collector

Remove unused dependencies from datadog-trace-agent

20dc196

Formatting

9603aad

Turn off DD_ENHANCED_METRICS in Windows for now to prevent metrics co…

ba23441

…llector from initializing and logging that metrics collection is being skipped

Handle malformed cpuset.cpus file

002aa75

Skip collection when elapsed_secs is less than or equal to 0

592385a

Update comments to clarify that Windows is not supported yet

a998457

Add unit test for metric classification

bb4f3b0

Log when scheduler quota can't be parsed

76cd53e

kathiehuang force-pushed the kathie.huang/add-cpu-enhanced-metrics branch from c5206f1 to 76cd53e Compare April 1, 2026 21:05

kathiehuang added 2 commits April 1, 2026 17:09

nit: address clippy warning

723d121

Remove resource_id tag and add check for invalid CPU set range

250dfcd

	const CPU_METRICS_COLLECTION_INTERVAL: u64 = 3;
	const CPU_METRICS_COLLECTION_INTERVAL_SECONDS: u64 = 3;

	if let Err(e) = aggregator.insert_batch(all_valid_metrics) {
	error!("Failed to send metrics to aggregator: {}", e);
	}

		dogstatsd = { path = "../dogstatsd", default-features = true }
		num_cpus = "1.16"

Conversation

kathiehuang commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Additional Notes

Motivation

Describe how to test/QA your changes

Uh oh!

kathiehuang commented Mar 7, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

kathiehuang commented Mar 7, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kathiehuang commented Mar 7, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kathiehuang commented Mar 9, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duncanista left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Lewis-E Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Lewis-E Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kathiehuang commented Feb 13, 2026 •

edited

Loading

Lewis-E Mar 9, 2026 •

edited

Loading

Lewis-E Mar 19, 2026 •

edited

Loading

Lewis-E commented Mar 9, 2026 •

edited

Loading

kathiehuang commented Mar 11, 2026 •

edited

Loading