Skip to content

feat(trainer): add telemetry module for optional OpenTelemetry instrumentation#401

Open
prabindersinghh wants to merge 1 commit intokubeflow:mainfrom
prabindersinghh:feat/otel-telemetry-module
Open

feat(trainer): add telemetry module for optional OpenTelemetry instrumentation#401
prabindersinghh wants to merge 1 commit intokubeflow:mainfrom
prabindersinghh:feat/otel-telemetry-module

Conversation

@prabindersinghh
Copy link
Copy Markdown

Summary

Adds kubeflow/common/telemetry.py as the shared OTel instrumentation
layer for the Kubeflow Python SDK, implementing the architecture proposed
in #164.

What this PR adds

kubeflow/common/telemetry.py

  • get_tracer() — real OTel tracer or zero-overhead _NoOpTracer
    when opentelemetry-api not installed or KUBEFLOW_TRACING_DISABLED=1
  • SpanNames — canonical span name constants for all SDK clients
  • SpanAttributes — canonical attribute key constants
  • configure() — one-call setup for Jaeger/OTLP/Console exporters
  • _NoOpTracer / _NoOpSpan — safe fallback with empty method bodies

kubeflow/common/telemetry_test.py

  • 15 tests covering NoOp path, env var disabling, context manager
    behavior, and naming convention validation
  • Follows existing TestCase + pytest.mark.parametrize pattern

kubeflow/trainer/backends/kubernetes/backend.py

  • KubernetesBackend.train() wrapped with root span
  • wait_for_job_status() polling loop: per-iteration child spans
    with status_check span events
  • All 47 existing backend tests pass with zero regressions

Span hierarchy produced

kubeflow.sdk.trainer.train [root]
├── kubeflow.sdk.trainer.get_runtime
├── kubeflow.sdk.trainer.create_trainjob
│ event: trainjob_submitted
└── kubeflow.sdk.trainer.poll_status [× N iterations]
attributes: poll.iteration, trainjob.status
events: status_check, job_reached_expected_status

Design decisions

Centralized in kubeflow/common/ — single module imported by all
SDK clients, enforcing consistent naming via SpanNames and
SpanAttributes constants.

Zero overhead_is_otel_available() checks once and caches.
_NoOpTracer has empty method bodies. No allocation when disabled.

Optional dependencyopentelemetry-api not added to core
requirements. Users opt in via pip install 'kubeflow[telemetry]'.

Polling loop design — per-iteration child spans give full
visibility into state transitions for long-running jobs.

Validation

make verify → ruff check + format: PASSED
telemetry_test.py → 15/15 PASSED
backend_test.py → 47/47 PASSED (zero regressions)

Related

🚧 Draft — opening for early feedback as part of GSoC 2026 proposal work

…entation

Adds kubeflow/common/telemetry.py as the shared OTel instrumentation layer:
- get_tracer(): zero-overhead NoOp when opentelemetry-api not installed
- SpanNames/SpanAttributes: canonical constants for all SDK clients
- configure(): one-call setup for Jaeger/OTLP/Console exporters
- _NoOpTracer/_NoOpSpan: safe fallback with empty method bodies

Instruments KubernetesBackend.train() and wait_for_job_status() as
reference implementation. Polling loop produces per-iteration child
spans with status_check span events.

Related to kubeflow#164

Signed-off-by: Prabinder Singh <prabindersinghh@gmail.com>
Copilot AI review requested due to automatic review settings March 19, 2026 09:32
@google-oss-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kramaranya for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a shared OpenTelemetry (OTel) instrumentation layer under kubeflow/common/ and wires initial tracing into the Trainer Kubernetes backend, aiming to provide optional, low-overhead observability for SDK operations.

Changes:

  • Introduces kubeflow/common/telemetry.py with get_tracer(), configure(), canonical span/attribute constants, and NoOp fallbacks.
  • Adds unit tests validating the NoOp behavior and naming/attribute conventions.
  • Instruments KubernetesBackend.train() and wait_for_job_status() with root/child spans and span events.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
kubeflow/common/telemetry.py Adds shared tracing helpers, exporter configuration, and canonical naming/attribute constants.
kubeflow/common/telemetry_test.py Adds tests for NoOp fallback, env-var disabling, and naming/attribute conventions.
kubeflow/trainer/backends/kubernetes/backend.py Wraps training + polling in spans and emits polling events/attributes.

@prabindersinghh prabindersinghh changed the title feat(common): add telemetry module for optional OpenTelemetry instrumentation feat(trainer): add telemetry module for optional OpenTelemetry instrumentation Mar 19, 2026
@Rajneesh180
Copy link
Copy Markdown

Went through the full diff against the KubernetesBackend source. Solid foundation — the NoOp fallback design is the right call for an optional dependency. Some things I'd flag from working on the OTel Collector side:

Polling span creation (wait_for_job_status)
start_as_current_span(SpanNames.TRAINER_POLL_STATUS) is inside the for loop, creating a new child span per iteration. With timeout=600 and polling_interval=2, that's up to 300 spans per TrainJob. Wrapping the entire for loop in a single TRAINER_POLL_STATUS span and using span.add_event("status_check", {SpanAttributes.TRAINJOB_STATUS: trainjob.status}) per iteration would give identical Jaeger timeline visibility without the storage overhead. Also avoids POLL_TIMEOUT and POLL_INTERVAL attributes being duplicated 300 times — set them once on the parent span.

_NoOpSpan interface gap
Missing is_recording() -> bool and get_span_context(). If any downstream code or OTel instrumentation lib checks span.is_recording(), this raises AttributeError. Since this is a duck-type stand-in for opentelemetry.trace.Span, adding def is_recording(self) -> bool: return False would make it safer.

Exporter naming (_build_exporter)
exporter="jaeger" maps to OTLPSpanExporter — same path as "otlp". Since opentelemetry-exporter-jaeger was removed from the Python SDK, keeping "jaeger" as an option creates the impression of a different backend. I'd simplify to "otlp" and "console", noting in the docstring that Jaeger accepts OTLP natively on 4317.

_NoOpTracer allocation
get_tracer() returns _NoOpTracer() — new instance each call. In KubernetesBackend.__init__ it's fine since it's called once, but a module-level _NOOP_TRACER = _NoOpTracer() sentinel avoids unnecessary allocations if other code calls get_tracer() per-request.

Test coverage gap
The 15 tests validate the telemetry module in isolation (NoOp path, constants, env disabling). What's missing is an integration-level test that mocks KubernetesBackend, calls .train(), and verifies the span hierarchy/attributes with an InMemorySpanExporter. The existing 47 backend tests pass, but they don't assert on the instrumentation behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(common): add shared telemetry module for optional OpenTelemetry instrumentation

3 participants