feat(observability): propagate W3C trace context across nico-api's network boundaries#2700
feat(observability): propagate W3C trace context across nico-api's network boundaries#2700adnandnv wants to merge 1 commit into
Conversation
…twork boundaries nico-api now accepts and produces W3C Trace Context headers (traceparent, tracestate) at its network boundaries, so a request already traced upstream stays one trace as it flows through nico-api and on to the services it calls. - Install the standard W3C TraceContextPropagator at startup (the OTel default is no-op). - Ingress (REST + gRPC): one shared per-request layer extracts the inbound context and parents the request span to it; with no valid inbound context the span is a fresh root. - Egress: gRPC clients are wrapped in a tower TraceInjectService; reqwest clients are built through reqwest-tracing's middleware; the two OAuth2 token providers inject manually, since the oauth2 crate builds their requests. - The local tracing-enabled flag stays the master switch: an inbound sampled flag never forces recording here. Shared extract/inject glue and the tower middleware live in a new trace-propagation crate, with unit tests and an ingress->egress round-trip. Per-client details and maintainer guidance are in docs/observability/tracing.md. Closes NVIDIA#2438. Signed-off-by: Adnan Dosani <258082943+adnandnv@users.noreply.github.com>
Summary by CodeRabbit
WalkthroughIntroduces a new ChangesW3C Trace Context Propagation
Sequence Diagram(s)sequenceDiagram
rect rgba(173, 216, 230, 0.5)
note over ExternalCaller,LogService: Ingress: inbound request enters nico-api
ExternalCaller->>LogService: HTTP request (traceparent/tracestate headers)
LogService->>set_span_parent_from_headers: request_span + HeaderMap
set_span_parent_from_headers->>extract_context: global propagator extracts remote Context
extract_context-->>set_span_parent_from_headers: remote SpanContext
set_span_parent_from_headers->>request_span: set_parent(remote Context) if valid
end
rect rgba(144, 238, 144, 0.5)
note over CarbideSpanSampler,OTLPExporter: Sampling decision
request_span->>CarbideSpanSampler: should_sample(parent_context, ...)
CarbideSpanSampler-->>request_span: Record/Drop + propagate tracestate
end
rect rgba(255, 200, 100, 0.5)
note over request_span,DownstreamService: Egress: outbound calls carry trace context
request_span->>TraceInjectService: gRPC call via build_channel
TraceInjectService->>inject_current_context: inject into request headers
inject_current_context->>DownstreamService: gRPC request (traceparent/tracestate)
request_span->>TracingMiddleware: HTTP call via reqwest-middleware
TracingMiddleware->>inject_current_context: inject into request headers
inject_current_context->>DownstreamService: HTTP request (traceparent/tracestate)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
🌿 Preview your docs: https://nvidia-preview-pull-request-2700.docs.buildwithfern.com/infra-controller |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/nras/src/client.rs`:
- Around line 52-57: The HTTP client created in the new_with_config method lacks
timeout configuration, which can cause indefinite blocking during network issues
or NRAS service outages. Add a timeout field to the Config struct to make it
configurable, then apply this timeout to the reqwest client builder by calling
the .timeout() method on the ClientBuilder chain before calling .build(). Ensure
the timeout value is propagated from Config and applied either at the client
level in new_with_config or on individual requests made through methods like
attest_gpu to enforce bounded wait times for external service calls.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 931e99d9-7c3a-48f6-9587-76664d251c90
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (32)
Cargo.tomlcrates/api-core/Cargo.tomlcrates/api-core/src/handlers/redfish.rscrates/api-core/src/logging/api_logs.rscrates/api-core/src/logging/setup.rscrates/api-core/src/machine_identity/token_exchange.rscrates/api-web/Cargo.tomlcrates/api-web/src/auth.rscrates/component-manager/Cargo.tomlcrates/component-manager/src/nsm.rscrates/component-manager/src/psm.rscrates/component-manager/src/tls.rscrates/firmware/Cargo.tomlcrates/firmware/src/downloader.rscrates/libmlx/Cargo.tomlcrates/libmlx/src/firmware/source.rscrates/libnmxc/Cargo.tomlcrates/libnmxc/src/lib.rscrates/libnmxc/src/nmxc_api.rscrates/libnmxm/Cargo.tomlcrates/libnmxm/src/lib.rscrates/mqttea/Cargo.tomlcrates/mqttea/src/auth/oauth2_provider.rscrates/nras/Cargo.tomlcrates/nras/src/client.rscrates/nras/src/lib.rscrates/rpc/Cargo.tomlcrates/rpc/src/forge_tls_client.rscrates/trace-propagation/Cargo.tomlcrates/trace-propagation/src/lib.rscrates/trace-propagation/tests/no_global_propagator.rsdocs/observability/tracing.md
| pub fn new_with_config(config: &crate::Config) -> NrasVerifierClient { | ||
| NrasVerifierClient { | ||
| config: config.clone(), | ||
| http_client: reqwest::Client::new(), | ||
| http_client: reqwest_middleware::ClientBuilder::new(reqwest::Client::new()) | ||
| .with(reqwest_tracing::TracingMiddleware::default()) | ||
| .build(), |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify timeout configuration on the NRAS egress path.
rg -n "NrasVerifierClient|Client::new\\(|Client::builder\\(|connect_timeout\\(|\\.timeout\\(" crates/nras/src/client.rs -C3Repository: NVIDIA/infra-controller
Length of output: 1072
🏁 Script executed:
#!/bin/bash
# Search for Config definition and any timeout fields
rg -n "struct Config|timeout|Timeout" crates/nras/src/ -A2 -B2 | head -100Repository: NVIDIA/infra-controller
Length of output: 329
🏁 Script executed:
#!/bin/bash
# Search for attest_gpu implementation and how the http_client is used
rg -n "async fn attest_gpu" crates/nras/src/client.rs -A 30Repository: NVIDIA/infra-controller
Length of output: 2719
🏁 Script executed:
#!/bin/bash
# Check if reqwest::Client has any builder methods or default timeout behavior
fd -e rs -e toml . crates/nras/ | xargs rg -l "reqwest" | head -5Repository: NVIDIA/infra-controller
Length of output: 167
🏁 Script executed:
#!/bin/bash
# Get the full Config struct definition
rg -n "pub struct Config" crates/nras/src/lib.rs -A 20Repository: NVIDIA/infra-controller
Length of output: 699
🏁 Script executed:
#!/bin/bash
# Search across all crates for any timeout configuration patterns for reqwest clients
rg -n "\.timeout\(|connect_timeout\(" --type rs crates/ | grep -i "nras\|http_client" | head -20Repository: NVIDIA/infra-controller
Length of output: 94
🏁 Script executed:
#!/bin/bash
# Check Cargo.toml for reqwest version and any timeout-related dependencies
cat crates/nras/Cargo.tomlRepository: NVIDIA/infra-controller
Length of output: 1427
🏁 Script executed:
#!/bin/bash
# Verify reqwest default timeout behavior and check if any global timeout patterns exist elsewhere
rg -n "timeout\(" crates/nras/ -iRepository: NVIDIA/infra-controller
Length of output: 49
Configure explicit timeout bounds on the NRAS HTTP client.
The reqwest::Client::new() at line 55 lacks any timeout configuration, permitting indefinite blocking during NRAS outages or network degradation. The Config struct contains no timeout fields, and individual requests in attest_gpu (line 81) are not wrapped with timeout constraints. Propagate a configurable timeout through Config and apply it via .timeout() on the client or per-request to ensure bounded wait times during external service calls.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/nras/src/client.rs` around lines 52 - 57, The HTTP client created in
the new_with_config method lacks timeout configuration, which can cause
indefinite blocking during network issues or NRAS service outages. Add a timeout
field to the Config struct to make it configurable, then apply this timeout to
the reqwest client builder by calling the .timeout() method on the ClientBuilder
chain before calling .build(). Ensure the timeout value is propagated from
Config and applied either at the client level in new_with_config or on
individual requests made through methods like attest_gpu to enforce bounded wait
times for external service calls.
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
Implements W3C Trace Context propagation for nico-api so a request traced upstream keeps one trace as
it passes through nico-api and into the services it calls.
TraceContextPropagatorat startup (the OTel default is no-op).traceparent/tracestateand parents the request span to it; no valid inbound context → fresh root.
TraceInjectService; reqwest clients go throughreqwest-tracing's middleware; the two OAuth2 token providers inject manually (theoauth2cratebuilds their requests).
tracing-enabledswitch stays the master control — an inboundsampledflag can't force local recording.Per-client details and how to add a new client are in
docs/observability/tracing.md.Related issues
#2438
Type of Change
Breaking Changes
Testing
Additional Notes
bmc-proxyandhardware-healthare intentionally not instrumented:they build no span exporter, so injecting there would be dead code.
docs/observability/tracing.md§1.7documents how to add propagation to a new client if that changes.
trace-propagationholds the shared glue + tower middleware.tracing.mdmarker references are a doc-staleness fix, not part of this feature. The doc stilldescribed the sampler's old
code.namespacemarker; fix: Fixes for OTLP tracing #2268 replaced it withcarbide.trace_rootwithout updating the doc. The
code.namespace→carbide.trace_rootdoc edits just correct that.(The doc's
ParentBasedmention is also updated, since this PR unwraps the sampler.)opentelemetry-http0.31,reqwest-middleware0.5,reqwest-tracing0.7.