[SVLS-8351] Add CPU Enhanced Metrics in Linux Azure Functions#77
[SVLS-8351] Add CPU Enhanced Metrics in Linux Azure Functions#77kathiehuang wants to merge 43 commits intomainfrom
Conversation
7010e35 to
5953d68
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c6a55dc810
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dfe28a3a43
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| self.last_usage_ns = current_usage_ns; | ||
|
|
||
| // Divide nanoseconds delta by collection interval to get usage rate in nanocores | ||
| let usage_rate_nc = delta_ns / self.collection_interval_secs as f64; |
There was a problem hiding this comment.
Use real elapsed time for CPU usage rate
CpuMetricsCollector::collect_and_submit computes usage_rate_nc by dividing the CPU delta by a fixed configured interval, but collection is driven from a tokio::select! loop where the flush branch awaits network I/O (metrics_flusher.flush().await), so real sampling gaps can be longer/shorter than 3s under timeout/retry or scheduler delay; this skews the reported nanocore rate and can produce false spikes/drops in production dashboards.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed this in c6a55dc - tracked the last collection time and the current time to get a more precise time interval to divide by!
| .await; | ||
| if let Some(ref windows_pipe_name) = dd_dogstatsd_windows_pipe_name { | ||
| info!("dogstatsd-pipe: starting to listen on pipe {windows_pipe_name}"); | ||
| let needs_aggregator = dd_use_dogstatsd || dd_enhanced_metrics; |
There was a problem hiding this comment.
Gate aggregator startup to Azure enhanced-metrics path
The new needs_aggregator condition enables the aggregator whenever DD_ENHANCED_METRICS_ENABLED is true (default), even in non-Azure environments where cpu_collector is never created, so deployments with DD_USE_DOGSTATSD=false still start/flush an unused aggregator and can log DD_API_KEY not set errors despite no enhanced metrics being emitted; this should be gated by env_type == EnvironmentType::AzureFunction.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed by only setting DD_ENHANCED_METRICS to default true if we're in an Azure Function
36bba17
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 058ef533f8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let mut cpu_collector = if dd_enhanced_metrics && env_type == EnvironmentType::AzureFunction { | ||
| aggregator_handle.as_ref().map(|handle| { | ||
| let tags = build_cpu_metrics_tags(); | ||
| CpuMetricsCollector::new(handle.clone(), tags) |
There was a problem hiding this comment.
Skip CPU collector when flusher setup has failed
This block creates CpuMetricsCollector whenever Azure enhanced metrics are enabled, even if start_aggregator failed to build a flusher (DD_API_KEY missing, bad site/proxy) and returned metrics_flusher = None. In that scenario there is no flush path, but the collector still inserts two new timestamped distributions every 3s, so contexts accumulate until overflow and metrics are dropped while consuming resources; enhanced collection should be gated on a usable flusher (or disabled when flushing is unavailable).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in c626c03 to only create CPUMetricsCollector if the metrics flusher was successfully created
| if let Some(metrics_flusher) = metrics_flusher.as_ref() { | ||
| debug!("Flushing dogstatsd metrics"); | ||
| metrics_flusher.flush().await; |
There was a problem hiding this comment.
Decouple CPU sampling from metrics flush awaits
The CPU collection timer and metrics_flusher.flush().await share the same tokio::select! loop, so a slow/timeouting flush blocks the task and prevents CPU collection from running at the intended 3-second cadence. When flushes stall (for example during network degradation), sampling becomes delayed and irregular, which directly reduces CPU usage metric fidelity; CPU collection should run in its own task or otherwise avoid being blocked by flush I/O.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in 071d2f0 to spawn the metrics flusher as an independent task from the collector so that a slow flush will not block CPU collection.
DOGSTATSD_TIMEOUT_DURATION is 5 seconds
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 071d2f0488
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| tokio::spawn(async move { | ||
| metrics_flusher.flush().await; | ||
| }); |
There was a problem hiding this comment.
Serialize metrics flushes to prevent overlapping upload tasks
This change spawns a new flush() task on every interval tick, so if a flush takes longer than 10s (for example during network slowness or outages), multiple flushes run concurrently and keep accumulating over time. Because Flusher::flush performs network I/O with retries, this can happen in production and causes unnecessary task/socket/memory pressure; the previous logic awaited each flush and guaranteed only one in-flight flush at a time.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
The client has a 5-second timeout with RetryStrategy::LinearBackoff(3, 1), so this would require three consecutive full timeouts. I'm happy to add this, but I'm not sure if it's needed
duncanista
left a comment
There was a problem hiding this comment.
I'd suggest using features for OS specific business logic
Also suggest checking how ADP is doing agent checks in rust, this sounds like an agent check for a very specific use case
| use dogstatsd::metric::{SortedTags, EMPTY_TAGS}; | ||
| use tokio_util::sync::CancellationToken; | ||
|
|
||
| const CPU_METRICS_COLLECTION_INTERVAL: u64 = 3; |
There was a problem hiding this comment.
| const CPU_METRICS_COLLECTION_INTERVAL: u64 = 3; | |
| const CPU_METRICS_COLLECTION_INTERVAL_SECONDS: u64 = 3; |
| ); | ||
|
|
||
| if let Err(e) = self.aggregator.insert_batch(vec![usage_metric]) { | ||
| error!("Failed to insert CPU usage metric: {}", e); |
There was a problem hiding this comment.
In what situations would we see this error? Would we hit this repeatedly or can the aggregator recover from errors quickly? (Also applies to line 111)
There was a problem hiding this comment.
insert_batch calls tx.send, which is on an unbounded channel that has infinite capacity. An error will only happen if the receive half of the channel is closed or dropped, which means the aggregator service isn't working anymore and every subsequent call should also fail. This means that metrics would stop sending, with error logs on every attempted insert. It seems the only way to recover would be for the customer to stop and start their function app to restart the agent
Error logging but continuing is what the lambda extension does
If we're worried about log spam, I could change this to return early on the CPU usage metric insert failure - this would halve the error logs
Or maybe a better solution would be to have collect_and_submit return a Result, and main.rs could set cpu_collector=None on error?
There was a problem hiding this comment.
I personally like the return a Result option but am also curious why the lambda extension would send a repeating error log. Also, do you think this bit is unit testable?
There was a problem hiding this comment.
Digging into it more, it seems like dogstatsd does the same thing
serverless-components/crates/dogstatsd/src/dogstatsd.rs
Lines 567 to 569 in 05e5c26
I think I should be able to create a handle with a dead receiver to unit test this if we want to do this! I talked with Shreya and it doesn't seem like customers have been running into this - this pattern came from the existing enhanced metrics
|
So, ~6 debug logs every 3 seconds? Do these go to Datadog & cost money? |
duncanpharvey
left a comment
There was a problem hiding this comment.
Excellent work! I added a few suggestions to consider
| dogstatsd = { path = "../dogstatsd", default-features = true } | ||
| num_cpus = "1.16" |
There was a problem hiding this comment.
Are these dependencies needed in datadog-trace-agent?
There was a problem hiding this comment.
Oh good catch - this was accidentally left over from before I moved the metrics collector into its own crate. Fixed in 2ad5f24!
| let (metrics_flusher, aggregator_handle) = if needs_aggregator { | ||
| debug!("Creating metrics flusher and aggregator"); | ||
|
|
||
| let (flusher, handle) = |
There was a problem hiding this comment.
I think a comment here to note why the aggregator is started separately from the dogstatsd listener would be helpful - just enough to be clear that there are different configuration options that require this (dogstatsd enabled/disabled, enhanced metrics enabled/disabled).
Maybe a unit test as well to assert that all of these combinations are covered?
There was a problem hiding this comment.
Good point! I added a comment in 60cdecf
It seems like it'll be a little hard to make a meaningful unit test since the aggregator/dogstatsd startup logic has side effects that would make it hard to test in isolation? Maybe I could refactor the startup decision into a struct that describes what to start to separate the decision from execution?
struct AgentConfig {
start_aggregator: bool,
start_dogstatsd: bool,
start_enhanced_metrics: bool,
}
fn resolve_agent_config(dd_use_dogstatsd: bool, dd_enhanced_metrics: bool) -> AgentConfig {
AgentConfig {
start_aggregator: dd_use_dogstatsd || dd_enhanced_metrics,
start_dogstatsd: dd_use_dogstatsd,
start_enhanced_metrics: dd_enhanced_metrics,
}
}
| Ok(builder.build()?) | ||
| } | ||
|
|
||
| fn build_cpu_metrics_tags() -> Option<SortedTags> { |
There was a problem hiding this comment.
Would this method make more sense to live in the datadog-metrics-collector crate and be used internally within the crate?
I looked into how ADP does agent checks in Rust - it looks like it's still experimental and may be too high-level for this use case. I will make a Jira ticket for the backlog though so we can come back to this in the future and see if anything in the way they do checks is applicable! |
…llector from initializing and logging that metrics collection is being skipped
c5206f1 to
76cd53e
Compare
What does this PR do?
Adds CPU limit and usage enhanced metrics for Linux Azure Functions.
datadog-metrics-collectorcrate that reads CPU metrics every second and submits them to the Datadog backend every 10 seconds whenDD_ENHANCED_METRICS_ENABLED=true(default on)CpuMetricsCollectorstruct andCpuStatsReadertrait. Currently this only collects CPU metrics in Linux. CPU metrics in Windows will be completed in a future PRazure.functions.enhanced.cpu.usage- container-level CPU consumption rate in nanocores, sourced fromcpuacct.usageazure.functions.enhanced.cpu.limit- CPU limit in nanocores, computed asmin(cpuset.cpus, cfs_quota/cfs_period), falling back to host CPU count if no cgroup limit is setAdditional Notes
libdd-common:resource_groupsubscription_idnameregionplan_tierserviceenvversionserverless_compat_versionazure.functions.*metrics asServerlessEnhancedorigin in the dogstatsd origin classifier so that they show up as Enhanced rather than Custom metrics in Datadog Metrics Summarystart_dogstatsdinto two functionsstart_aggregator, which starts the aggregator service and metrics flusherstart_dogstatsd_listener, which enables custom metrics to be received from user codeDD_USE_DOGSTATSDis offMotivation
https://datadoghq.atlassian.net/browse/SVLS-8351
Describe how to test/QA your changes
Build with serverless-compat-self-monitoring.
Added debug logs to verify calculations:
cpu.cfs_quota_us, so it will falls back to the host cpu count from the cratenum_cpusThis was deployed with the serverless-compat-self-monitoring pipeline across all runtimes and hosting plans. All hosting plans in Linux were tested to verify metrics submit correctly, and a Windows function was tested to enhanced metrics are disabled in Windows environments.
Testing to compare against Azure Monitor is documented in an internal doc in Enhanced Metrics in the Serverless Compatibility Layer. Detailed calculations are explained in Calculating CPU Enhanced Metrics in Windows and Linux Azure Functions.
DD_USE_DOGSTATSDis offDD_ENHANCED_METRICS_ENABLEDis offReferences: datadog-agent cgroup collection and calculation logic