feat(trainer): add telemetry module for optional OpenTelemetry instrumentation#401
feat(trainer): add telemetry module for optional OpenTelemetry instrumentation#401prabindersinghh wants to merge 1 commit intokubeflow:mainfrom
Conversation
…entation Adds kubeflow/common/telemetry.py as the shared OTel instrumentation layer: - get_tracer(): zero-overhead NoOp when opentelemetry-api not installed - SpanNames/SpanAttributes: canonical constants for all SDK clients - configure(): one-call setup for Jaeger/OTLP/Console exporters - _NoOpTracer/_NoOpSpan: safe fallback with empty method bodies Instruments KubernetesBackend.train() and wait_for_job_status() as reference implementation. Polling loop produces per-iteration child spans with status_check span events. Related to kubeflow#164 Signed-off-by: Prabinder Singh <prabindersinghh@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Adds a shared OpenTelemetry (OTel) instrumentation layer under kubeflow/common/ and wires initial tracing into the Trainer Kubernetes backend, aiming to provide optional, low-overhead observability for SDK operations.
Changes:
- Introduces
kubeflow/common/telemetry.pywithget_tracer(),configure(), canonical span/attribute constants, and NoOp fallbacks. - Adds unit tests validating the NoOp behavior and naming/attribute conventions.
- Instruments
KubernetesBackend.train()andwait_for_job_status()with root/child spans and span events.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
kubeflow/common/telemetry.py |
Adds shared tracing helpers, exporter configuration, and canonical naming/attribute constants. |
kubeflow/common/telemetry_test.py |
Adds tests for NoOp fallback, env-var disabling, and naming/attribute conventions. |
kubeflow/trainer/backends/kubernetes/backend.py |
Wraps training + polling in spans and emits polling events/attributes. |
|
Went through the full diff against the Polling span creation (
Exporter naming (
Test coverage gap |
Summary
Adds
kubeflow/common/telemetry.pyas the shared OTel instrumentationlayer for the Kubeflow Python SDK, implementing the architecture proposed
in #164.
What this PR adds
kubeflow/common/telemetry.pyget_tracer()— real OTel tracer or zero-overhead_NoOpTracerwhen
opentelemetry-apinot installed orKUBEFLOW_TRACING_DISABLED=1SpanNames— canonical span name constants for all SDK clientsSpanAttributes— canonical attribute key constantsconfigure()— one-call setup for Jaeger/OTLP/Console exporters_NoOpTracer/_NoOpSpan— safe fallback with empty method bodieskubeflow/common/telemetry_test.pybehavior, and naming convention validation
TestCase+pytest.mark.parametrizepatternkubeflow/trainer/backends/kubernetes/backend.pyKubernetesBackend.train()wrapped with root spanwait_for_job_status()polling loop: per-iteration child spanswith
status_checkspan eventsSpan hierarchy produced
kubeflow.sdk.trainer.train [root]
├── kubeflow.sdk.trainer.get_runtime
├── kubeflow.sdk.trainer.create_trainjob
│ event: trainjob_submitted
└── kubeflow.sdk.trainer.poll_status [× N iterations]
attributes: poll.iteration, trainjob.status
events: status_check, job_reached_expected_status
Design decisions
Centralized in
kubeflow/common/— single module imported by allSDK clients, enforcing consistent naming via
SpanNamesandSpanAttributesconstants.Zero overhead —
_is_otel_available()checks once and caches._NoOpTracerhas empty method bodies. No allocation when disabled.Optional dependency —
opentelemetry-apinot added to corerequirements. Users opt in via
pip install 'kubeflow[telemetry]'.Polling loop design — per-iteration child spans give full
visibility into state transitions for long-running jobs.
Validation
make verify → ruff check + format: PASSED
telemetry_test.py → 15/15 PASSED
backend_test.py → 47/47 PASSED (zero regressions)
Related
🚧 Draft — opening for early feedback as part of GSoC 2026 proposal work