Skip to content

[SVLS-8351] Add CPU Enhanced Metrics in Linux Azure Functions#77

Open
kathiehuang wants to merge 43 commits intomainfrom
kathie.huang/add-cpu-enhanced-metrics
Open

[SVLS-8351] Add CPU Enhanced Metrics in Linux Azure Functions#77
kathiehuang wants to merge 43 commits intomainfrom
kathie.huang/add-cpu-enhanced-metrics

Conversation

@kathiehuang
Copy link
Copy Markdown
Contributor

@kathiehuang kathiehuang commented Feb 13, 2026

What does this PR do?

Adds CPU limit and usage enhanced metrics for Linux Azure Functions.

  • Adds a new datadog-metrics-collector crate that reads CPU metrics every second and submits them to the Datadog backend every 10 seconds when DD_ENHANCED_METRICS_ENABLED=true (default on)
    • This creates an OS-agnostic CpuMetricsCollector struct and CpuStatsReader trait. Currently this only collects CPU metrics in Linux. CPU metrics in Windows will be completed in a future PR
  • Emits two new metrics:
    • azure.functions.enhanced.cpu.usage - container-level CPU consumption rate in nanocores, sourced from cpuacct.usage
    • azure.functions.enhanced.cpu.limit - CPU limit in nanocores, computed as min(cpuset.cpus, cfs_quota/cfs_period), falling back to host CPU count if no cgroup limit is set

Additional Notes

  • Tags attached to all CPU metrics:
    • Azure resource metadata from libdd-common:
    • resource_group
    • subscription_id
    • name
    • Metadata from other environment variables:
      • region
      • plan_tier
      • service
      • env
      • version
      • serverless_compat_version
  • Categorizes azure.functions.* metrics as ServerlessEnhanced origin in the dogstatsd origin classifier so that they show up as Enhanced rather than Custom metrics in Datadog Metrics Summary
  • Sets up:
    • CgroupStats struct for reading statistics from cgroup v1 files
      • This normalizes the stats to nanoseconds
    • CpuStats struct to store the computed CPU total and limit metrics
      • Converts u64 values to to f64
      • Calculates CPU limit percentage
  • Separates start_dogstatsd into two functions
    • start_aggregator, which starts the aggregator service and metrics flusher
    • start_dogstatsd_listener, which enables custom metrics to be received from user code
      • This separation enables enhanced metrics to be submitted to the aggregator service and flushed even when DD_USE_DOGSTATSD is off
  • Metrics are submitted as distribution metrics because not all metrics have tags with a unique identifier from the instance they are sent from
  • If the collector cannot read the cgroup files successfully, it will not submit enhanced metrics for that interval and log accordingly

Motivation

https://datadoghq.atlassian.net/browse/SVLS-8351

Describe how to test/QA your changes

Build with serverless-compat-self-monitoring.

Added debug logs to verify calculations:

DEBUG datadog_trace_agent::metrics_collector: Contents of /sys/fs/cgroup/cpuset/cpuset.cpus: 0-1
DEBUG datadog_trace_agent::metrics_collector: Range: ["0", "1"]
DEBUG datadog_trace_agent::metrics_collector: Total CPU count: 2
DEBUG datadog_trace_agent::metrics_collector: CFS scheduler quota is -1, setting to None
DEBUG datadog_trace_agent::metrics_collector: Could not read scheduler quota from /sys/fs/cgroup/cpu/cpu.cfs_quota_us
DEBUG datadog_trace_agent::metrics_collector: No CPU limit found, defaulting to host CPU count: 2 CPUs
DEBUG datadog_trace_agent::metrics_collector: Collected cpu stats!
DEBUG datadog_trace_agent::metrics_collector: CPU usage: 9871234519
DEBUG datadog_trace_agent::metrics_collector: CPU limit: 200%, defaulted: true
DEBUG datadog_trace_agent::metrics_collector: Submitting CPU metrics!
  • Oftentimes scheduler quota cannot be read from cpu.cfs_quota_us, so it will falls back to the host cpu count from the crate num_cpus

This was deployed with the serverless-compat-self-monitoring pipeline across all runtimes and hosting plans. All hosting plans in Linux were tested to verify metrics submit correctly, and a Windows function was tested to enhanced metrics are disabled in Windows environments.

Testing to compare against Azure Monitor is documented in an internal doc in Enhanced Metrics in the Serverless Compatibility Layer. Detailed calculations are explained in Calculating CPU Enhanced Metrics in Windows and Linux Azure Functions.

  • Verified CPU usage increases with a CPU-intensive function
  • Verified CPU limit increases when the hosting plan is scaled up
  • Verified enhanced metrics still send when DD_USE_DOGSTATSD is off
  • Verified custom metrics still send when DD_ENHANCED_METRICS_ENABLED is off

References: datadog-agent cgroup collection and calculation logic

@kathiehuang kathiehuang changed the title [SVLS-8351] Add CPU Enhanced Metrics [SVLS-8351] Add CPU Enhanced Metrics in Linux Azure Functions Mar 6, 2026
@kathiehuang kathiehuang force-pushed the kathie.huang/add-cpu-enhanced-metrics branch 2 times, most recently from 7010e35 to 5953d68 Compare March 6, 2026 20:38
@DataDog DataDog deleted a comment from chatgpt-codex-connector bot Mar 7, 2026
@kathiehuang
Copy link
Copy Markdown
Contributor Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c6a55dc810

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@kathiehuang
Copy link
Copy Markdown
Contributor Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dfe28a3a43

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

self.last_usage_ns = current_usage_ns;

// Divide nanoseconds delta by collection interval to get usage rate in nanocores
let usage_rate_nc = delta_ns / self.collection_interval_secs as f64;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use real elapsed time for CPU usage rate

CpuMetricsCollector::collect_and_submit computes usage_rate_nc by dividing the CPU delta by a fixed configured interval, but collection is driven from a tokio::select! loop where the flush branch awaits network I/O (metrics_flusher.flush().await), so real sampling gaps can be longer/shorter than 3s under timeout/retry or scheduler delay; this skews the reported nanocore rate and can produce false spikes/drops in production dashboards.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this in c6a55dc - tracked the last collection time and the current time to get a more precise time interval to divide by!

.await;
if let Some(ref windows_pipe_name) = dd_dogstatsd_windows_pipe_name {
info!("dogstatsd-pipe: starting to listen on pipe {windows_pipe_name}");
let needs_aggregator = dd_use_dogstatsd || dd_enhanced_metrics;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gate aggregator startup to Azure enhanced-metrics path

The new needs_aggregator condition enables the aggregator whenever DD_ENHANCED_METRICS_ENABLED is true (default), even in non-Azure environments where cpu_collector is never created, so deployments with DD_USE_DOGSTATSD=false still start/flush an unused aggregator and can log DD_API_KEY not set errors despite no enhanced metrics being emitted; this should be gated by env_type == EnvironmentType::AzureFunction.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by only setting DD_ENHANCED_METRICS to default true if we're in an Azure Function
36bba17

@kathiehuang
Copy link
Copy Markdown
Contributor Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 058ef533f8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +214 to +217
let mut cpu_collector = if dd_enhanced_metrics && env_type == EnvironmentType::AzureFunction {
aggregator_handle.as_ref().map(|handle| {
let tags = build_cpu_metrics_tags();
CpuMetricsCollector::new(handle.clone(), tags)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Skip CPU collector when flusher setup has failed

This block creates CpuMetricsCollector whenever Azure enhanced metrics are enabled, even if start_aggregator failed to build a flusher (DD_API_KEY missing, bad site/proxy) and returned metrics_flusher = None. In that scenario there is no flush path, but the collector still inserts two new timestamped distributions every 3s, so contexts accumulate until overflow and metrics are dropped while consuming resources; enhanced collection should be gated on a usable flusher (or disabled when flushing is unavailable).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in c626c03 to only create CPUMetricsCollector if the metrics flusher was successfully created

Comment on lines +233 to +235
if let Some(metrics_flusher) = metrics_flusher.as_ref() {
debug!("Flushing dogstatsd metrics");
metrics_flusher.flush().await;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Decouple CPU sampling from metrics flush awaits

The CPU collection timer and metrics_flusher.flush().await share the same tokio::select! loop, so a slow/timeouting flush blocks the task and prevents CPU collection from running at the intended 3-second cadence. When flushes stall (for example during network degradation), sampling becomes delayed and irregular, which directly reduces CPU usage metric fidelity; CPU collection should run in its own task or otherwise avoid being blocked by flush I/O.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 071d2f0 to spawn the metrics flusher as an independent task from the collector so that a slow flush will not block CPU collection.

DOGSTATSD_TIMEOUT_DURATION is 5 seconds

@kathiehuang
Copy link
Copy Markdown
Contributor Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 071d2f0488

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +240 to +242
tokio::spawn(async move {
metrics_flusher.flush().await;
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Serialize metrics flushes to prevent overlapping upload tasks

This change spawns a new flush() task on every interval tick, so if a flush takes longer than 10s (for example during network slowness or outages), multiple flushes run concurrently and keep accumulating over time. Because Flusher::flush performs network I/O with retries, this can happen in production and causes unnecessary task/socket/memory pressure; the previous logic awaited each flush and guaranteed only one in-flight flush at a time.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client has a 5-second timeout with RetryStrategy::LinearBackoff(3, 1), so this would require three consecutive full timeouts. I'm happy to add this, but I'm not sure if it's needed

@kathiehuang kathiehuang marked this pull request as ready for review March 9, 2026 17:26
@kathiehuang kathiehuang requested review from a team as code owners March 9, 2026 17:26
@kathiehuang kathiehuang requested review from Lewis-E, duncanpharvey and lym953 and removed request for a team March 9, 2026 17:26
Copy link
Copy Markdown
Contributor

@duncanista duncanista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest using features for OS specific business logic

Also suggest checking how ADP is doing agent checks in rust, this sounds like an agent check for a very specific use case

use dogstatsd::metric::{SortedTags, EMPTY_TAGS};
use tokio_util::sync::CancellationToken;

const CPU_METRICS_COLLECTION_INTERVAL: u64 = 3;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const CPU_METRICS_COLLECTION_INTERVAL: u64 = 3;
const CPU_METRICS_COLLECTION_INTERVAL_SECONDS: u64 = 3;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in aae174a

);

if let Err(e) = self.aggregator.insert_batch(vec![usage_metric]) {
error!("Failed to insert CPU usage metric: {}", e);
Copy link
Copy Markdown
Contributor

@Lewis-E Lewis-E Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what situations would we see this error? Would we hit this repeatedly or can the aggregator recover from errors quickly? (Also applies to line 111)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

insert_batch calls tx.send, which is on an unbounded channel that has infinite capacity. An error will only happen if the receive half of the channel is closed or dropped, which means the aggregator service isn't working anymore and every subsequent call should also fail. This means that metrics would stop sending, with error logs on every attempted insert. It seems the only way to recover would be for the customer to stop and start their function app to restart the agent

Error logging but continuing is what the lambda extension does

If we're worried about log spam, I could change this to return early on the CPU usage metric insert failure - this would halve the error logs

Or maybe a better solution would be to have collect_and_submit return a Result, and main.rs could set cpu_collector=None on error?

Copy link
Copy Markdown
Contributor

@Lewis-E Lewis-E Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally like the return a Result option but am also curious why the lambda extension would send a repeating error log. Also, do you think this bit is unit testable?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Digging into it more, it seems like dogstatsd does the same thing

if let Err(e) = aggregator.insert_batch(all_valid_metrics) {
error!("Failed to send metrics to aggregator: {}", e);
}

I think I should be able to create a handle with a dead receiver to unit test this if we want to do this! I talked with Shreya and it doesn't seem like customers have been running into this - this pattern came from the existing enhanced metrics

@Lewis-E
Copy link
Copy Markdown
Contributor

Lewis-E commented Mar 9, 2026

So, ~6 debug logs every 3 seconds? Do these go to Datadog & cost money?

Copy link
Copy Markdown
Collaborator

@duncanpharvey duncanpharvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work! I added a few suggestions to consider

Comment on lines +36 to +37
dogstatsd = { path = "../dogstatsd", default-features = true }
num_cpus = "1.16"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these dependencies needed in datadog-trace-agent?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good catch - this was accidentally left over from before I moved the metrics collector into its own crate. Fixed in 2ad5f24!

let (metrics_flusher, aggregator_handle) = if needs_aggregator {
debug!("Creating metrics flusher and aggregator");

let (flusher, handle) =
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a comment here to note why the aggregator is started separately from the dogstatsd listener would be helpful - just enough to be clear that there are different configuration options that require this (dogstatsd enabled/disabled, enhanced metrics enabled/disabled).

Maybe a unit test as well to assert that all of these combinations are covered?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I added a comment in 60cdecf

It seems like it'll be a little hard to make a meaningful unit test since the aggregator/dogstatsd startup logic has side effects that would make it hard to test in isolation? Maybe I could refactor the startup decision into a struct that describes what to start to separate the decision from execution?

struct AgentConfig {
    start_aggregator: bool,
    start_dogstatsd: bool,
    start_enhanced_metrics: bool,
}

fn resolve_agent_config(dd_use_dogstatsd: bool, dd_enhanced_metrics: bool) -> AgentConfig {
    AgentConfig {
        start_aggregator: dd_use_dogstatsd || dd_enhanced_metrics,
        start_dogstatsd: dd_use_dogstatsd,
        start_enhanced_metrics: dd_enhanced_metrics,
    }
}

Ok(builder.build()?)
}

fn build_cpu_metrics_tags() -> Option<SortedTags> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this method make more sense to live in the datadog-metrics-collector crate and be used internally within the crate?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense! I moved it in f867f6f

@kathiehuang
Copy link
Copy Markdown
Contributor Author

kathiehuang commented Mar 11, 2026

Also suggest checking how ADP is doing agent checks in rust, this sounds like an agent check for a very specific use case

I looked into how ADP does agent checks in Rust - it looks like it's still experimental and may be too high-level for this use case. I will make a Jira ticket for the backlog though so we can come back to this in the future and see if anything in the way they do checks is applicable!

https://datadoghq.atlassian.net/browse/SVLS-8699

…llector from initializing and logging that metrics collection is being skipped
@kathiehuang kathiehuang force-pushed the kathie.huang/add-cpu-enhanced-metrics branch from c5206f1 to 76cd53e Compare April 1, 2026 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants