Skip to content

kubernetes: surface log-unavailable reason and observability link in placeholder#280

Closed
morgan-wowk wants to merge 1 commit into
fix/swallow-log-acquisition-errorsfrom
fix/log-acquisition-error-search-link
Closed

kubernetes: surface log-unavailable reason and observability link in placeholder#280
morgan-wowk wants to merge 1 commit into
fix/swallow-log-acquisition-errorsfrom
fix/log-acquisition-error-search-link

Conversation

@morgan-wowk

@morgan-wowk morgan-wowk commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Stacked on #279.

When log acquisition fails (broken UTF-8, truncated response, broken JSON from the Kubernetes API), instead of returning an empty string the launcher stores a human-readable placeholder containing the pod name, namespace, and — when TANGLE_LOG_SEARCH_URL_TEMPLATE is configured — a direct link to the pod's logs in the deployment's observability platform.

The placeholder is stored in GCS via upload_log and returned verbatim by the log-read API, so it surfaces wherever logs are displayed without any frontend or schema changes.

Time range — absolute ISO 8601 timestamps

The link uses a fixed, absolute time window so it stays accurate regardless of when the link is clicked (a relative now-Xm would drift and become useless after retention).

Both started_at and ended_at are already in memory — no DB queries:

Launcher class started_at source ended_at source
LaunchedKubernetesContainer self._debug_pod.status container state self._debug_pod.status terminated state
LaunchedKubernetesJob self._debug_job.status.start_time job completion condition last_transition_time

The window is started_at − 5 minended_at + 5 min, matching the padding used by the tangle-ui overlay schema. Falls back to now − 24 hnow when timestamps are unavailable (pod still pending, or status not yet populated).

OSS design

TANGLE_LOG_SEARCH_URL_TEMPLATE is a generic env var with three str.replace placeholders:

Placeholder Substituted with
{pod_name} Kubernetes pod name
{start_iso8601_ms} started_at − 5 min as 2026-06-17T20:24:11.000Z, or now − 24 h as fallback
{end_iso8601_ms} ended_at + 5 min as 2026-06-17T22:36:44.000Z, or now as fallback

No observability-platform-specific naming or logic in the OSS code. Deployments that set the env var get a direct link; deployments that omit it get the pod name and namespace only.

Example output

Without TANGLE_LOG_SEARCH_URL_TEMPLATE set (any OSS deployment):

[Log unavailable: Kubernetes API returned a malformed response. Pod: task-019ed73257f0c30a068a-0-xk9pq, Namespace: ml-infra-nebius.]

With TANGLE_LOG_SEARCH_URL_TEMPLATE set (e.g. Shopify's Observe deployment):

[Log unavailable: Kubernetes API returned a malformed response. Pod: task-019ed73257f0c30a068a-0-xk9pq, Namespace: ml-infra-nebius. Search: https://observe.shopify.io/a/observe/investigate/query?mls=...&q=...kube_pod+contains+task-019ed73257f0c30a068a-0-xk9pq...&r={"from":"2026-06-17T20:24:11.000Z","to":"2026-06-17T22:36:44.000Z"}&category=logging]

The link opens Observe pre-filtered to that pod, on a 2h 17m window (the job's actual runtime ± 5 min padding).

Deployment config

The Observe URL template is set for production and staging in infrastructure/applications/oasis-backend/{production,staging}/app.yaml — see Shopify/infrastructure#52749.

Copy link
Copy Markdown
Collaborator Author

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

…placeholder

When log acquisition fails, replace the empty return value with a human-
readable message that includes the pod name, namespace, and — when
TANGLE_LOG_SEARCH_URL_TEMPLATE is set — a direct link to the pod's logs in
the configured observability platform.

The URL template supports two placeholders substituted at runtime:
  {pod_name}   — Kubernetes pod name
  {start_time} — relative start derived from started_at (e.g. "now-125m",
                 adding 5 min of padding); falls back to "now-1440m" (24 h)
                 if the start time is not available in memory.

Both started_at values (LaunchedKubernetesContainer from pod container state,
LaunchedKubernetesJob from job status) are in-memory reads — no additional
database queries are required to compute the time range.

The placeholder is stored in GCS via upload_log and returned verbatim by the
log-read API, so it surfaces wherever logs are displayed without any frontend
or schema changes.
@morgan-wowk

Copy link
Copy Markdown
Collaborator Author

We chose a different solution as indicated on the issue #281

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant