Skip to content

feat(observability): propagate W3C trace context across nico-api's network boundaries#2700

Open
adnandnv wants to merge 1 commit into
NVIDIA:mainfrom
adnandnv:feature/2438-w3c-trace-propagation
Open

feat(observability): propagate W3C trace context across nico-api's network boundaries#2700
adnandnv wants to merge 1 commit into
NVIDIA:mainfrom
adnandnv:feature/2438-w3c-trace-propagation

Conversation

@adnandnv

Copy link
Copy Markdown
Contributor

Implements W3C Trace Context propagation for nico-api so a request traced upstream keeps one trace as
it passes through nico-api and into the services it calls.

  • Propagator: install the standard W3C TraceContextPropagator at startup (the OTel default is no-op).
  • Ingress (REST + gRPC): one shared per-request layer extracts the inbound traceparent/tracestate
    and parents the request span to it; no valid inbound context → fresh root.
  • Egress: gRPC clients are wrapped in a tower TraceInjectService; reqwest clients go through
    reqwest-tracing's middleware; the two OAuth2 token providers inject manually (the oauth2 crate
    builds their requests).
  • Enable flag unchanged: the local tracing-enabled switch stays the master control — an inbound
    sampled flag can't force local recording.

Per-client details and how to add a new client are in docs/observability/tracing.md.

Related issues

#2438

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

  • Scope is the nico-api process. bmc-proxy and hardware-health are intentionally not instrumented:
    they build no span exporter, so injecting there would be dead code. docs/observability/tracing.md §1.7
    documents how to add propagation to a new client if that changes.
  • New crate trace-propagation holds the shared glue + tower middleware.
  • tracing.md marker references are a doc-staleness fix, not part of this feature. The doc still
    described the sampler's old code.namespace marker; fix: Fixes for OTLP tracing #2268 replaced it with carbide.trace_root
    without updating the doc. The code.namespacecarbide.trace_root doc edits just correct that.
    (The doc's ParentBased mention is also updated, since this PR unwraps the sampler.)
  • Dependencies added: opentelemetry-http 0.31, reqwest-middleware 0.5, reqwest-tracing 0.7.

…twork boundaries

nico-api now accepts and produces W3C Trace Context headers (traceparent,
tracestate) at its network boundaries, so a request already traced upstream
stays one trace as it flows through nico-api and on to the services it calls.

- Install the standard W3C TraceContextPropagator at startup (the OTel default
  is no-op).
- Ingress (REST + gRPC): one shared per-request layer extracts the inbound
  context and parents the request span to it; with no valid inbound context the
  span is a fresh root.
- Egress: gRPC clients are wrapped in a tower TraceInjectService; reqwest
  clients are built through reqwest-tracing's middleware; the two OAuth2 token
  providers inject manually, since the oauth2 crate builds their requests.
- The local tracing-enabled flag stays the master switch: an inbound sampled
  flag never forces recording here.

Shared extract/inject glue and the tower middleware live in a new
trace-propagation crate, with unit tests and an ingress->egress round-trip.
Per-client details and maintainer guidance are in docs/observability/tracing.md.

Closes NVIDIA#2438.

Signed-off-by: Adnan Dosani <258082943+adnandnv@users.noreply.github.com>
@adnandnv adnandnv requested review from a team and Coco-Ben as code owners June 19, 2026 02:00
@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Summary by CodeRabbit

  • New Features

    • Distributed tracing now propagates W3C trace context across REST and gRPC service boundaries, enabling end-to-end request tracing across multiple services without affecting overall logging behavior.
  • Documentation

    • Updated tracing documentation to clarify trace context propagation behavior, sampling decisions, and troubleshooting for distributed traces.

Walkthrough

Introduces a new trace-propagation crate providing W3C trace context ingress/egress helpers and a TraceInjectService Tower wrapper. Installs a global TraceContextPropagator, rewrites CarbideSpanSampler to honor in-process parent context and propagate tracestate, and integrates context injection at every gRPC channel and HTTP client boundary across the codebase.

Changes

W3C Trace Context Propagation

Layer / File(s) Summary
New trace-propagation crate: primitives, TraceInjectService, and tests
crates/trace-propagation/Cargo.toml, crates/trace-propagation/src/lib.rs, crates/trace-propagation/tests/no_global_propagator.rs
Creates the trace-propagation crate with extract_context, set_span_parent_from_headers, inject_context, inject_current_context, and TraceInjectService<S> (Tower wrapper). Tests cover ingress/egress round-trips, tracestate preservation, in-memory exporter end-to-end linkage, and safe no-op behavior when no global propagator is installed.
Global TraceContextPropagator installation and CarbideSpanSampler rewrite
Cargo.toml, crates/api-core/src/logging/setup.rs
setup_logging installs TraceContextPropagator globally. CarbideSpanSampler::should_sample is rewritten to drop when disabled, inherit from a valid non-remote in-process parent, record only carbide.trace_root-marked spans, and propagate parent tracestate into SamplingResult. Removes into_sampler. Extends test suite for all new decision paths.
Ingress: span parent adoption from inbound W3C headers
crates/api-core/Cargo.toml, crates/api-core/src/logging/api_logs.rs
LogService::call calls set_span_parent_from_headers on the inbound request immediately after creating the request span, adopting the remote caller's trace as parent when a valid traceparent is present.
gRPC egress: TraceInjectService wiring across component-manager, libnmxc, and rpc
crates/component-manager/Cargo.toml, crates/component-manager/src/tls.rs, crates/component-manager/src/nsm.rs, crates/component-manager/src/psm.rs, crates/libnmxc/Cargo.toml, crates/libnmxc/src/lib.rs, crates/libnmxc/src/nmxc_api.rs, crates/rpc/Cargo.toml, crates/rpc/src/forge_tls_client.rs
build_channel returns TraceInjectService<Channel>; the type propagates through NsmSwitchBackend, PsmPowerShelfBackend, their helpers, NmxControllerClient, NmxcApi, and both ForgeTlsClient::build paths. Every gRPC channel now injects the current trace context before forwarding requests.
HTTP egress: reqwest-middleware + TracingMiddleware across all HTTP clients
crates/firmware/..., crates/libmlx/..., crates/libnmxm/..., crates/nras/..., crates/api-core/src/handlers/redfish.rs, crates/api-core/src/machine_identity/token_exchange.rs
All reqwest::Client instances are wrapped with reqwest_middleware::ClientBuilder + TracingMiddleware::default(). Return types, field types, and From error conversions are updated to reqwest_middleware types; tests are updated to construct middleware-enabled clients.
OAuth2 manual trace header injection
crates/api-web/Cargo.toml, crates/api-web/src/auth.rs, crates/mqttea/Cargo.toml, crates/mqttea/src/auth/oauth2_provider.rs
OAuth2 HTTP call implementations take the request as mutable and call inject_current_context before the HTTP exchange, since the OAuth2 client abstraction does not support middleware.
Observability documentation
docs/observability/tracing.md
Expands tracing documentation to cover CarbideSpanSampler root-recording semantics, W3C trace context ingress/egress behavior, forwarding vs. recording distinction, the trace-breaking limitation without an OTLP exporter, and guidance for integrating new network clients.

Sequence Diagram(s)

sequenceDiagram
  rect rgba(173, 216, 230, 0.5)
    note over ExternalCaller,LogService: Ingress: inbound request enters nico-api
    ExternalCaller->>LogService: HTTP request (traceparent/tracestate headers)
    LogService->>set_span_parent_from_headers: request_span + HeaderMap
    set_span_parent_from_headers->>extract_context: global propagator extracts remote Context
    extract_context-->>set_span_parent_from_headers: remote SpanContext
    set_span_parent_from_headers->>request_span: set_parent(remote Context) if valid
  end

  rect rgba(144, 238, 144, 0.5)
    note over CarbideSpanSampler,OTLPExporter: Sampling decision
    request_span->>CarbideSpanSampler: should_sample(parent_context, ...)
    CarbideSpanSampler-->>request_span: Record/Drop + propagate tracestate
  end

  rect rgba(255, 200, 100, 0.5)
    note over request_span,DownstreamService: Egress: outbound calls carry trace context
    request_span->>TraceInjectService: gRPC call via build_channel
    TraceInjectService->>inject_current_context: inject into request headers
    inject_current_context->>DownstreamService: gRPC request (traceparent/tracestate)

    request_span->>TracingMiddleware: HTTP call via reqwest-middleware
    TracingMiddleware->>inject_current_context: inject into request headers
    inject_current_context->>DownstreamService: HTTP request (traceparent/tracestate)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the primary change: implementing W3C trace context propagation across nico-api's network boundaries, which is the central feature of this comprehensive instrumentation effort.
Description check ✅ Passed The description thoroughly explains the propagation strategy (ingress/egress mechanics, per-client details), clarifies scope and intent, acknowledges breaking changes correctly, and documents both the feature and unrelated documentation corrections.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/nras/src/client.rs`:
- Around line 52-57: The HTTP client created in the new_with_config method lacks
timeout configuration, which can cause indefinite blocking during network issues
or NRAS service outages. Add a timeout field to the Config struct to make it
configurable, then apply this timeout to the reqwest client builder by calling
the .timeout() method on the ClientBuilder chain before calling .build(). Ensure
the timeout value is propagated from Config and applied either at the client
level in new_with_config or on individual requests made through methods like
attest_gpu to enforce bounded wait times for external service calls.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 931e99d9-7c3a-48f6-9587-76664d251c90

📥 Commits

Reviewing files that changed from the base of the PR and between 47e42bc and 426cd76.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (32)
  • Cargo.toml
  • crates/api-core/Cargo.toml
  • crates/api-core/src/handlers/redfish.rs
  • crates/api-core/src/logging/api_logs.rs
  • crates/api-core/src/logging/setup.rs
  • crates/api-core/src/machine_identity/token_exchange.rs
  • crates/api-web/Cargo.toml
  • crates/api-web/src/auth.rs
  • crates/component-manager/Cargo.toml
  • crates/component-manager/src/nsm.rs
  • crates/component-manager/src/psm.rs
  • crates/component-manager/src/tls.rs
  • crates/firmware/Cargo.toml
  • crates/firmware/src/downloader.rs
  • crates/libmlx/Cargo.toml
  • crates/libmlx/src/firmware/source.rs
  • crates/libnmxc/Cargo.toml
  • crates/libnmxc/src/lib.rs
  • crates/libnmxc/src/nmxc_api.rs
  • crates/libnmxm/Cargo.toml
  • crates/libnmxm/src/lib.rs
  • crates/mqttea/Cargo.toml
  • crates/mqttea/src/auth/oauth2_provider.rs
  • crates/nras/Cargo.toml
  • crates/nras/src/client.rs
  • crates/nras/src/lib.rs
  • crates/rpc/Cargo.toml
  • crates/rpc/src/forge_tls_client.rs
  • crates/trace-propagation/Cargo.toml
  • crates/trace-propagation/src/lib.rs
  • crates/trace-propagation/tests/no_global_propagator.rs
  • docs/observability/tracing.md

Comment thread crates/nras/src/client.rs
Comment on lines 52 to +57
pub fn new_with_config(config: &crate::Config) -> NrasVerifierClient {
NrasVerifierClient {
config: config.clone(),
http_client: reqwest::Client::new(),
http_client: reqwest_middleware::ClientBuilder::new(reqwest::Client::new())
.with(reqwest_tracing::TracingMiddleware::default())
.build(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify timeout configuration on the NRAS egress path.
rg -n "NrasVerifierClient|Client::new\\(|Client::builder\\(|connect_timeout\\(|\\.timeout\\(" crates/nras/src/client.rs -C3

Repository: NVIDIA/infra-controller

Length of output: 1072


🏁 Script executed:

#!/bin/bash
# Search for Config definition and any timeout fields
rg -n "struct Config|timeout|Timeout" crates/nras/src/ -A2 -B2 | head -100

Repository: NVIDIA/infra-controller

Length of output: 329


🏁 Script executed:

#!/bin/bash
# Search for attest_gpu implementation and how the http_client is used
rg -n "async fn attest_gpu" crates/nras/src/client.rs -A 30

Repository: NVIDIA/infra-controller

Length of output: 2719


🏁 Script executed:

#!/bin/bash
# Check if reqwest::Client has any builder methods or default timeout behavior
fd -e rs -e toml . crates/nras/ | xargs rg -l "reqwest" | head -5

Repository: NVIDIA/infra-controller

Length of output: 167


🏁 Script executed:

#!/bin/bash
# Get the full Config struct definition
rg -n "pub struct Config" crates/nras/src/lib.rs -A 20

Repository: NVIDIA/infra-controller

Length of output: 699


🏁 Script executed:

#!/bin/bash
# Search across all crates for any timeout configuration patterns for reqwest clients
rg -n "\.timeout\(|connect_timeout\(" --type rs crates/ | grep -i "nras\|http_client" | head -20

Repository: NVIDIA/infra-controller

Length of output: 94


🏁 Script executed:

#!/bin/bash
# Check Cargo.toml for reqwest version and any timeout-related dependencies
cat crates/nras/Cargo.toml

Repository: NVIDIA/infra-controller

Length of output: 1427


🏁 Script executed:

#!/bin/bash
# Verify reqwest default timeout behavior and check if any global timeout patterns exist elsewhere
rg -n "timeout\(" crates/nras/ -i

Repository: NVIDIA/infra-controller

Length of output: 49


Configure explicit timeout bounds on the NRAS HTTP client.

The reqwest::Client::new() at line 55 lacks any timeout configuration, permitting indefinite blocking during NRAS outages or network degradation. The Config struct contains no timeout fields, and individual requests in attest_gpu (line 81) are not wrapped with timeout constraints. Propagate a configurable timeout through Config and apply it via .timeout() on the client or per-request to ensure bounded wait times during external service calls.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/nras/src/client.rs` around lines 52 - 57, The HTTP client created in
the new_with_config method lacks timeout configuration, which can cause
indefinite blocking during network issues or NRAS service outages. Add a timeout
field to the Config struct to make it configurable, then apply this timeout to
the reqwest client builder by calling the .timeout() method on the ClientBuilder
chain before calling .build(). Ensure the timeout value is propagated from
Config and applied either at the client level in new_with_config or on
individual requests made through methods like attest_gpu to enforce bounded wait
times for external service calls.

@github-actions

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 261 5 22 103 6 125
machine-validation-runner 691 31 182 257 35 186
machine_validation 691 31 182 257 35 186
nvmetal-carbide 691 31 182 257 35 186
TOTAL 2340 98 568 880 111 683

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant