feat(trainer): add telemetry module for optional OpenTelemetry instrumentation by prabindersinghh · Pull Request #401 · kubeflow/sdk

prabindersinghh · 2026-03-19T09:32:58Z

Summary

Adds kubeflow/common/telemetry.py as the shared OTel instrumentation
layer for the Kubeflow Python SDK, implementing the architecture proposed
in #164.

What this PR adds

kubeflow/common/telemetry.py

get_tracer() — real OTel tracer or zero-overhead _NoOpTracer
when opentelemetry-api not installed or KUBEFLOW_TRACING_DISABLED=1
SpanNames — canonical span name constants for all SDK clients
SpanAttributes — canonical attribute key constants
configure() — one-call setup for Jaeger/OTLP/Console exporters
_NoOpTracer / _NoOpSpan — safe fallback with empty method bodies

kubeflow/common/telemetry_test.py

15 tests covering NoOp path, env var disabling, context manager
behavior, and naming convention validation
Follows existing TestCase + pytest.mark.parametrize pattern

kubeflow/trainer/backends/kubernetes/backend.py

KubernetesBackend.train() wrapped with root span
wait_for_job_status() polling loop: per-iteration child spans
with status_check span events
All 47 existing backend tests pass with zero regressions

Span hierarchy produced

kubeflow.sdk.trainer.train [root]
├── kubeflow.sdk.trainer.get_runtime
├── kubeflow.sdk.trainer.create_trainjob
│ event: trainjob_submitted
└── kubeflow.sdk.trainer.poll_status [× N iterations]
attributes: poll.iteration, trainjob.status
events: status_check, job_reached_expected_status

Design decisions

Centralized in kubeflow/common/ — single module imported by all
SDK clients, enforcing consistent naming via SpanNames and
SpanAttributes constants.

Zero overhead — _is_otel_available() checks once and caches.
_NoOpTracer has empty method bodies. No allocation when disabled.

Optional dependency — opentelemetry-api not added to core
requirements. Users opt in via pip install 'kubeflow[telemetry]'.

Polling loop design — per-iteration child spans give full
visibility into state transitions for long-running jobs.

Validation

make verify → ruff check + format: PASSED
telemetry_test.py → 15/15 PASSED
backend_test.py → 47/47 PASSED (zero regressions)

…entation Adds kubeflow/common/telemetry.py as the shared OTel instrumentation layer: - get_tracer(): zero-overhead NoOp when opentelemetry-api not installed - SpanNames/SpanAttributes: canonical constants for all SDK clients - configure(): one-call setup for Jaeger/OTLP/Console exporters - _NoOpTracer/_NoOpSpan: safe fallback with empty method bodies Instruments KubernetesBackend.train() and wait_for_job_status() as reference implementation. Polling loop produces per-iteration child spans with status_check span events. Related to kubeflow#164 Signed-off-by: Prabinder Singh <prabindersinghh@gmail.com>

google-oss-prow · 2026-03-19T09:33:05Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kramaranya for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

Adds a shared OpenTelemetry (OTel) instrumentation layer under kubeflow/common/ and wires initial tracing into the Trainer Kubernetes backend, aiming to provide optional, low-overhead observability for SDK operations.

Changes:

Introduces kubeflow/common/telemetry.py with get_tracer(), configure(), canonical span/attribute constants, and NoOp fallbacks.
Adds unit tests validating the NoOp behavior and naming/attribute conventions.
Instruments KubernetesBackend.train() and wait_for_job_status() with root/child spans and span events.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
`kubeflow/common/telemetry.py`	Adds shared tracing helpers, exporter configuration, and canonical naming/attribute constants.
`kubeflow/common/telemetry_test.py`	Adds tests for NoOp fallback, env-var disabling, and naming/attribute conventions.
`kubeflow/trainer/backends/kubernetes/backend.py`	Wraps training + polling in spans and emits polling events/attributes.

kubeflow/common/telemetry.py

kubeflow/trainer/backends/kubernetes/backend.py

kubeflow/common/telemetry.py

Rajneesh180 · 2026-03-31T11:00:12Z

Went through the full diff against the KubernetesBackend source. Solid foundation — the NoOp fallback design is the right call for an optional dependency. Some things I'd flag from working on the OTel Collector side:

Polling span creation (wait_for_job_status)
start_as_current_span(SpanNames.TRAINER_POLL_STATUS) is inside the for loop, creating a new child span per iteration. With timeout=600 and polling_interval=2, that's up to 300 spans per TrainJob. Wrapping the entire for loop in a single TRAINER_POLL_STATUS span and using span.add_event("status_check", {SpanAttributes.TRAINJOB_STATUS: trainjob.status}) per iteration would give identical Jaeger timeline visibility without the storage overhead. Also avoids POLL_TIMEOUT and POLL_INTERVAL attributes being duplicated 300 times — set them once on the parent span.

_NoOpSpan interface gap
Missing is_recording() -> bool and get_span_context(). If any downstream code or OTel instrumentation lib checks span.is_recording(), this raises AttributeError. Since this is a duck-type stand-in for opentelemetry.trace.Span, adding def is_recording(self) -> bool: return False would make it safer.

Exporter naming (_build_exporter)
exporter="jaeger" maps to OTLPSpanExporter — same path as "otlp". Since opentelemetry-exporter-jaeger was removed from the Python SDK, keeping "jaeger" as an option creates the impression of a different backend. I'd simplify to "otlp" and "console", noting in the docstring that Jaeger accepts OTLP natively on 4317.

_NoOpTracer allocation
get_tracer() returns _NoOpTracer() — new instance each call. In KubernetesBackend.__init__ it's fine since it's called once, but a module-level _NOOP_TRACER = _NoOpTracer() sentinel avoids unnecessary allocations if other code calls get_tracer() per-request.

Test coverage gap
The 15 tests validate the telemetry module in isolation (NoOp path, constants, env disabling). What's missing is an integration-level test that mocks KubernetesBackend, calls .train(), and verifies the span hierarchy/attributes with an InMemorySpanExporter. The existing 47 backend tests pass, but they don't assert on the instrumentation behavior.

Copilot AI review requested due to automatic review settings March 19, 2026 09:32

google-oss-prow bot requested review from andreyvelich, astefanutti and szaher March 19, 2026 09:33

google-oss-prow bot added the size/XL label Mar 19, 2026

Copilot AI reviewed Mar 19, 2026

View reviewed changes

prabindersinghh changed the title ~~feat(common): add telemetry module for optional OpenTelemetry instrumentation~~ feat(trainer): add telemetry module for optional OpenTelemetry instrumentation Mar 19, 2026

prabindersinghh mentioned this pull request Mar 19, 2026

Integrate Kubeflow SDK with OpenTelemetry #164

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(trainer): add telemetry module for optional OpenTelemetry instrumentation#401

feat(trainer): add telemetry module for optional OpenTelemetry instrumentation#401
prabindersinghh wants to merge 1 commit intokubeflow:mainfrom
prabindersinghh:feat/otel-telemetry-module

prabindersinghh commented Mar 19, 2026

Uh oh!

google-oss-prow bot commented Mar 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Rajneesh180 commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

prabindersinghh commented Mar 19, 2026

Summary

What this PR adds

Span hierarchy produced

Design decisions

Validation

Related

Uh oh!

google-oss-prow bot commented Mar 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Rajneesh180 commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants