Conversation
scheduled_task scopes do not expose HTTP traffic via ALB, so the ALB capacity and target group capacity validations from the base k8s workflows are unnecessary. Override them with `action: skip` in the scheduled_task overlays and add structural tests that lock the contract with upstream step names — if a base step is renamed, the test fails instead of silently re-enabling the validation. Also adds .vscode/ to .gitignore. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…heduled-tasks Skip ALB capacity validations in scheduled_task workflows
…t need it Deployment actions like switch-traffic, kill-instances, and diagnose-deployment are purely operational and don't require application parameters. All scope actions (create, update, delete, etc.) deal with infrastructure, not app config. The flag is only added when the CLI supports it, preserving backward compatibility.
…nd_throubleshoot Features/kubectl read command for throubleshooting
AWS ELBs expose DNS hostnames (type=Hostname), not IPs (type=IPAddress). The manage_route script now falls back through four strategies: 1. Gateway IPAddress → A record 2. Gateway Hostname → CNAME record 3. Service LB IP → A record 4. Service LB hostname → CNAME record The dns-endpoint.yaml.tpl now uses dynamic record_type (A or CNAME) instead of hardcoded A, so DNSEndpoints are created correctly on AWS. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When DNS_TYPE is external_dns, verify_networking_reconciliation was skipping reconciliation entirely. Now it calls manage_route to resolve the gateway address, applies the DNSEndpoint to the cluster, and verifies HTTPRoute reconciliation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nslookup for external_dns nslookup against 8.8.8.8 fails for private Route53 zones and domains without public NS delegation. external-dns sets status.observedGeneration=1 once it processes the DNSEndpoint, which is a reliable signal that the Route53 record was created. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rns cluster-internal address Istio gateways report their status address as the ClusterIP service name (gateway-public-istio.gateways.svc.cluster.local), not the external ALB hostname. Added a fallback that reads the hostname from the ALB Ingress (gateway-alb-public / gateway-alb-private) when a .svc.cluster.local address is detected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ddress The previous approach resolved the gateway address first and only checked the ALB Ingress when the result was cluster-internal. Reversed the priority: ALB Ingress (gateway-alb-<suffix>) is checked first since it's the AWS-specific override. If not present, falls back to the standard gateway address resolution chain (IPAddress → Hostname → Service IP → Service hostname), which is the common case for environments with a real external gateway. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…test with correct ALB target The previous nslookup against 8.8.8.8 was failing because the CNAME pointed to a cluster-internal gateway address. Now that manage_route resolves the real ALB hostname first, testing whether public DNS resolution works correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… line Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ne record creation Both external-dns-public and external-dns-private controllers process all DNSEndpoints, causing public scope records to appear in the private hosted zone and vice versa. Add a dns/zone-type label (public|private) derived from SCOPE_VISIBILITY so each controller can filter only the records it owns via --label-filter. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ead of nslookup Private scopes use the internal Route53 hosted zone which is not resolvable via public DNS (8.8.8.8). Poll status.observedGeneration on the DNSEndpoint instead — set to >=1 by external-dns when the record is processed. Public scopes keep the existing nslookup-based check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ns scopes Replace nslookup-based DNS resolution check with DNSEndpoint observedGeneration polling for all scopes (public and private). nslookup was unreliable due to high cluster DNS cache TTL. observedGeneration is set by external-dns when it processes the record — faster and works consistently regardless of zone visibility. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…erify step and improve endpoint naming
- Remove manage_route call and kubectl apply from verify_networking_reconciliation;
DNS record creation belongs to scope creation flow, not deployment verification
- Include application.slug in DNSEndpoint name (k8s-{app}-{scope}-{id}-dns) to
distinguish scopes with the same name across different apps
- Truncate app/scope slugs to 20 chars each to respect K8s name length limits
- Update dns-endpoint.yaml.tpl to use new naming via gomplate strings.Trunc
- Fix wait_on_balancer.bats: rewrite tests to match observedGeneration logic
(previous tests referenced removed nslookup checks)
- Fix manage_route.bats: correct wrong log message assertions and update
expected DNSEndpoint name to new format
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ption Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix(dns): support external-dns with DNSEndpoint CRD and zone-type label filtering
…onal_port service
…M_PORT pointing to main_http_port
* Capture deployments, replicasets, pod logs and describe in diagnose snapshot
Extends build_context to capture the resources needed for a complete post-mortem:
- deployments.json and replicasets.json scoped by deployment_id (so we see rollout
state even when no pods got created)
- For every pod identified as problematic (CrashLoop / OOM / ImagePullBackOff /
Terminated / restartCount>0 / not-Ready / terminating), capture:
- kubectl describe pod -> data/pod_describe/<pod>.txt
- kubectl logs (current + --previous) for every container, including init
containers, into data/pod_logs/<pod>.<container>[.previous].log
Tail size is configurable via POD_LOG_TAIL_LINES (default 500). All new files live
under data/, so notify_results continues to exclude them from the backend payload.
The data is consumed by downstream checks and (in a follow-up) embedded into
evidence for AI consumption.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Emit structured evidence in all diagnose checks
Until now every check emitted update_check_result with an empty {} evidence
payload, leaving only printf'd stdout for downstream consumers (UI / AI). With
20 checks all producing colored ANSI text, neither a frontend nor an LLM could
reliably extract counts, names, exit codes, or remediation steps.
This change defines a canonical evidence schema in diagnose_utils:
{
summary: "one-line human summary",
severity: "critical" | "warning" | "info",
affected: ["resource-names"],
details: { check-specific structured data },
suggested_actions: ["actionable guidance"]
}
Helpers:
- evidence_json(summary, severity, affected, details, actions): builds the
schema with safe defaults
- exit_code_meaning(code): maps 0/1/137/139/143 → human-readable, reused
across crash, OOM, and termination checks
- require_resources updated so the "skipped" path also emits schema evidence
All 20 checks migrated. Each preserves its existing stdout output (so no
regressions for users tailing logs) and additionally builds details with the
data already extracted: pod names, container names, exit codes, restart counts,
endpoint counts, ingress backends, certificate ARNs, etc. Severity is mapped
from status (failed→critical, warning→warning, success/skipped→info), allowing
the AI summarizer to prioritize what matters.
Side effects:
- Fixes a pre-existing bug in ingress_tls_configuration that read tls.crt
from .metadata.annotations | keys[] (which never contains them, and where
build_context strips .data anyway). Now relies on Secret type validation.
- Adds tests/evidence_schema.bats: cross-cutting validation that every check
in scope/, service/, and networking/ emits a schema-conformant payload on
skipped, failed, and success paths.
- Updates existing test files where they previously asserted on legacy
flat evidence fields (.evidence.tested, .evidence.ready) to point at the
new nested location (.evidence.details.*).
Suite: 280 tests, 0 failures.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Embed pod logs in failed-check evidence for AI post-mortem
Without logs in the evidence payload, the AI summarizer would have to fetch
them separately for every diagnose run. By the time the summary is requested,
the cluster state has already moved on (rollback fired, pods churned), so live
logs would be misleading. Instead, embed the relevant log slice from the
build_context snapshot directly into the failing check's evidence — the AI
gets self-contained post-mortem in a single payload.
Helper: read_log_tail(pod, container, "current"|"previous", [lines]) reads
from data/pod_logs/ and returns a JSON array of lines. Returns [] when the
file is missing (most common case: no previous log because container never
crashed). Truncation is configurable via EVIDENCE_LOG_TAIL_LINES (default 50,
intentionally smaller than the 500-line build_context capture so the payload
stays bounded).
Five checks now embed logs on their failure paths:
- container_crash_detection: previous_logs (CrashLoopBackOff, high-restart)
and current_logs + previous_logs (terminated). Previous is where the crash
output lives — current is empty during the restart loop.
- memory_limits_check: previous_logs on OOMKilled. The kubelet restarts the
container after the kill, so OOM-relevant output is in the previous instance.
- health_probe_endpoints: container_logs (current) on every probe failure
(4xx, 5xx, connection refused). Pairs the probe verdict with what the app
was printing.
- container_port_health: container_logs (current) on port_not_listening
issues. Container is running but not bound — current logs typically show
why (binding error, config mismatch).
- pod_readiness: current_logs of the first container for stuck (not_ready)
pods, but NOT for normally-starting pods (avoids noise during rollouts).
Discriminations made deliberately:
- Success paths never embed logs (keeps payload light for healthy scopes).
- image_pull_status doesn't embed: if the image couldn't be pulled, there
is no container and no logs.
- networking/ and service/ checks don't embed: their failures are
configuration issues, not application issues.
Tests: +8 covering the helper, the embedding behavior, and a regression test
asserting the success path stays log-free. Suite: 288 tests, 0 failures (10
environmental skips on macOS dev hosts where nc/timeout from coreutils aren't
in PATH).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Dedup mark_affected and replace jq-loop accumulators with bash arrays
The first pass landed evidence enrichment quickly but at the cost of two
duplications visible across all 17 non-existence checks:
- mark_affected was redefined locally in every check (18 copies of the same
3-line jq-dedup function, differing only by the array variable name).
- Each check accumulated facts via the same per-iteration jq round-trip
pattern: FACTS=$(echo "$FACTS" | jq --argjson f "$x" '. + [$f]'). This
is O(N²) (the JSON array is reparsed and reserialized on every push) and
forks one jq process per iteration. ~60 such call sites; for a failing
scope with 10 problematic pods, that's hundreds of jq forks per check.
This commit moves both into diagnose_utils:
- mark_affected <set_name> <value> — adds to a space-separated set
stored in a bash variable, dedup
on add (no jq).
- set_to_json_array <set_name> — converts the set to a JSON array
in a single jq call.
- add_fact <array_name> <json_string> — bash array append, no jq.
- facts_to_json_array <array_name> — converts the array to a JSON array
in a single jq -s call at the end
of accumulation.
- lines_to_json_array — extracted shared filter for the
tail|jq -R -s 'split("\n")...'
pipeline that update_check_result
and read_log_tail both used.
All 17 affected checks were migrated. The 18 local mark_affected copies are
gone; check-level accumulator code shrunk from "jq-merge per iteration" to
"bash append per iteration, jq once at end".
Bash 3.2 compatibility: helpers use eval-based pass-by-name rather than
declare -n / declare -A (which require bash 4.3+ / 4.0+). Production runtime
on Alpine has bash 5.x, but local dev tests on macOS run /bin/bash 3.2.
Suite: 288 tests, 0 failures, 10 environmental skips. No behavior change —
both the human stdout and the evidence JSON shape are byte-identical to the
pre-simplify baseline.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add Application Logs diagnose category for AI post-mortem
New diagnose category that publishes the application's own log output as
structured evidence, contextualized with the pod and container state at
fail-time. Unlike the scope/ checks that detect specific failure modes and
embed logs as secondary evidence, this check is log-first: a single,
self-contained category the AI summarizer can read to say "here is the
issue, look at this" without cross-referencing other checks.
- k8s/diagnose/logs/workflow.yml: declares the category "Application Logs"
with a single step that runs application_log_evidence.
- k8s/diagnose/logs/application_log_evidence: iterates problematic pods
from the build_context snapshot (no live kubectl), reads current and
previous logs per container (init + regular) via read_log_tail, and
emits a fact per container with the schema:
{ pod, pod_phase, pod_reason,
container, init_container,
container_state, restart_count,
current_state_reason, last_termination_reason,
last_exit_code, last_exit_code_meaning,
current_logs, previous_logs }
Status is always success/skipped (info severity). The check never fails:
absence of logs is itself meaningful information ("image never started").
Reuses exit_code_meaning from diagnose_utils for the meaning string.
- k8s/scope/workflows/diagnose.yaml and
k8s/deployment/workflows/diagnose.yaml: register the new folder in the
executor so the category appears automatically alongside Scope/Service/
Networking. notify_results groups by category, no backend changes
required.
- k8s/diagnose/tests/logs/application_log_evidence.bats: 10 tests covering
skipped path, empty problematic list, current logs only, previous logs,
init container flag, no-logs-available, multi-pod aggregation, empty
log files, CrashLoopBackOff context, and pod_reason from Ready condition.
- k8s/diagnose/tests/evidence_schema.bats: +1 cross-cutting test asserting
the check emits a schema-conformant evidence object on the skipped path.
Full suite: 357/357 green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Focus application_log_evidence on the 'application' container only
Narrow the Application Logs check to its essential job: publish the user-owned
container's logs for AI post-mortem. The previous shape duplicated metadata
already emitted by the scope/ checks (pod_phase, container_state, restart
counts, exit codes, etc.) and iterated every container in every problematic
pod — including sidecars like 'http' nginx whose logs already appear in
Health Probe Endpoints and Container Port Health.
- Filters by container name 'application' (the literal name set in
k8s/deployment/templates/deployment.yaml.tpl). Sidecars and init containers
are out of scope; this check is not a per-container audit.
- Per-pod payload shrinks from 12 fields to 2: { pod, logs }.
- current and previous logs are merged in chronological order (previous
first, current second) and truncated to the last EVIDENCE_LOG_TAIL_LINES
(default 50). One flat array — the AI does not need to know which container
instance produced which line; the user wanted the tail of the application
output, period.
- Tests updated: 9 cases covering skipped/empty paths, application-only
filtering (asserts sidecar logs do not leak), previous+current merge in
order, the 50-line cap, multi-pod aggregation, and a schema-pinning test
that asserts the pod entry exposes exactly {pod, logs}.
jq gotcha worth noting: `.[-n:]` with `n` as a variable does not compile
("n/0 is not defined") because jq parses `-n` as expression-minus-function.
The correct slice is `.[-$n:]` with the `$` prefix.
Full diagnose suite: 356/356 green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Move application logs out of evidence, keep them only in check stdout
Previously the application log text was duplicated: it lived inside
evidence.details.pods[].logs (canonical for AI) and was also echoed to
stdout (so the UI's check.logs[] tail could show it). For a single-pod
scope that meant the same ~45 lines appearing twice in the result payload.
Consolidate to a single source: the check.logs[] tail. evidence.details
now carries only counters (pods_with_logs, problematic_pod_count) and the
list of pods that produced logs is published via evidence.affected. No
log text in evidence at all.
The trade-off is the existing 20-line cap inside update_check_result —
the UI sees the last 20 non-empty stdout lines of the check, which means
roughly the last 17-18 log lines plus the check's own diagnostic prints.
Sufficient for the typical single-pod scope; if that proves too tight,
we can revisit the cap in diagnose_utils.
Tests reshaped: 9 cases covering skipped/empty paths, sidecar exclusion
in stdout, evidence.details exposing exactly {pods_with_logs,
problematic_pod_count} (anchor against log text leaking back in),
chronological merge of previous before current, the 50-line cap on the
echoed tail, and multi-pod aggregation via affected[].
Full diagnose suite: 356/356 green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Allow checks to override the 20-line cap on captured stdout
The Application Log Evidence check echoes the application log tail to
stdout so the diagnose UI can show it in check.logs[]. But the existing
20-line cap inside update_check_result chops most of the payload off —
for a typical single-pod scope with 50 log lines plus a few diagnostic
prints, only ~17 log lines survive in the UI.
Add an opt-in --log-tail-lines flag on update_check_result. Default
stays at 20 (no impact on the other 19 checks). The logs check passes
--log-tail-lines 200, which fits a few pods worth of output plus the
check's own orchestrator/info lines.
- diagnose_utils: parse --log-tail-lines, use it in the tail call;
preserve the positional and --status/--evidence APIs unchanged.
- logs/application_log_evidence: pass --log-tail-lines 200 to every
update_check_result invocation on a path that emits log text. The
skipped path keeps the default 20.
- diagnose_utils.bats: rename existing test to "by default" and add two
new cases: an 80-line override over 100 input lines, and a 5-line cap
preserving the most recent lines.
Full diagnose suite: 358/358 green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(changelog): note structured evidence and Application Logs in k8s/diagnose
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(changelog): drop "AI post-mortem" framing from diagnose entry
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…u-limits feat: add configurable cpu/ram limits to k8s scope (CLIEN-781)
This reverts commit dc679d2.
fix(k8s,scheduled_task): file-type parameter no longer leaks binary as env var
Update release date in changelog
ignacioboud
approved these changes
Jun 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.