Skip to content

Release 1.12.0#189

Merged
fedemaleh merged 64 commits into
mainfrom
beta
Jun 9, 2026
Merged

Release 1.12.0#189
fedemaleh merged 64 commits into
mainfrom
beta

Conversation

@fedemaleh

Copy link
Copy Markdown
Collaborator

No description provided.

geisbruch and others added 30 commits April 16, 2026 19:44
scheduled_task scopes do not expose HTTP traffic via ALB, so the ALB
capacity and target group capacity validations from the base k8s
workflows are unnecessary. Override them with `action: skip` in the
scheduled_task overlays and add structural tests that lock the contract
with upstream step names — if a base step is renamed, the test fails
instead of silently re-enabling the validation.

Also adds .vscode/ to .gitignore.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…heduled-tasks

Skip ALB capacity validations in scheduled_task workflows
…t need it

Deployment actions like switch-traffic, kill-instances, and diagnose-deployment
are purely operational and don't require application parameters. All scope actions
(create, update, delete, etc.) deal with infrastructure, not app config.
The flag is only added when the CLI supports it, preserving backward compatibility.
…nd_throubleshoot

Features/kubectl read command for throubleshooting
AWS ELBs expose DNS hostnames (type=Hostname), not IPs (type=IPAddress).
The manage_route script now falls back through four strategies:
1. Gateway IPAddress → A record
2. Gateway Hostname  → CNAME record
3. Service LB IP     → A record
4. Service LB hostname → CNAME record

The dns-endpoint.yaml.tpl now uses dynamic record_type (A or CNAME)
instead of hardcoded A, so DNSEndpoints are created correctly on AWS.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When DNS_TYPE is external_dns, verify_networking_reconciliation was
skipping reconciliation entirely. Now it calls manage_route to resolve
the gateway address, applies the DNSEndpoint to the cluster, and
verifies HTTPRoute reconciliation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nslookup for external_dns

nslookup against 8.8.8.8 fails for private Route53 zones and domains
without public NS delegation. external-dns sets status.observedGeneration=1
once it processes the DNSEndpoint, which is a reliable signal that the
Route53 record was created.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rns cluster-internal address

Istio gateways report their status address as the ClusterIP service name
(gateway-public-istio.gateways.svc.cluster.local), not the external ALB
hostname. Added a fallback that reads the hostname from the ALB Ingress
(gateway-alb-public / gateway-alb-private) when a .svc.cluster.local
address is detected.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ddress

The previous approach resolved the gateway address first and only checked
the ALB Ingress when the result was cluster-internal. Reversed the priority:
ALB Ingress (gateway-alb-<suffix>) is checked first since it's the
AWS-specific override. If not present, falls back to the standard gateway
address resolution chain (IPAddress → Hostname → Service IP → Service hostname),
which is the common case for environments with a real external gateway.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…test with correct ALB target

The previous nslookup against 8.8.8.8 was failing because the CNAME pointed
to a cluster-internal gateway address. Now that manage_route resolves the
real ALB hostname first, testing whether public DNS resolution works correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… line

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ne record creation

Both external-dns-public and external-dns-private controllers process all
DNSEndpoints, causing public scope records to appear in the private hosted
zone and vice versa. Add a dns/zone-type label (public|private) derived from
SCOPE_VISIBILITY so each controller can filter only the records it owns via
--label-filter.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ead of nslookup

Private scopes use the internal Route53 hosted zone which is not resolvable
via public DNS (8.8.8.8). Poll status.observedGeneration on the DNSEndpoint
instead — set to >=1 by external-dns when the record is processed.
Public scopes keep the existing nslookup-based check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ns scopes

Replace nslookup-based DNS resolution check with DNSEndpoint observedGeneration
polling for all scopes (public and private). nslookup was unreliable due to high
cluster DNS cache TTL. observedGeneration is set by external-dns when it processes
the record — faster and works consistently regardless of zone visibility.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…erify step and improve endpoint naming

- Remove manage_route call and kubectl apply from verify_networking_reconciliation;
  DNS record creation belongs to scope creation flow, not deployment verification
- Include application.slug in DNSEndpoint name (k8s-{app}-{scope}-{id}-dns) to
  distinguish scopes with the same name across different apps
- Truncate app/scope slugs to 20 chars each to respect K8s name length limits
- Update dns-endpoint.yaml.tpl to use new naming via gomplate strings.Trunc
- Fix wait_on_balancer.bats: rewrite tests to match observedGeneration logic
  (previous tests referenced removed nslookup checks)
- Fix manage_route.bats: correct wrong log message assertions and update
  expected DNSEndpoint name to new format

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ption

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix(dns): support external-dns with DNSEndpoint CRD and zone-type label filtering
fedemaleh and others added 28 commits May 21, 2026 13:15
* Capture deployments, replicasets, pod logs and describe in diagnose snapshot

Extends build_context to capture the resources needed for a complete post-mortem:
- deployments.json and replicasets.json scoped by deployment_id (so we see rollout
  state even when no pods got created)
- For every pod identified as problematic (CrashLoop / OOM / ImagePullBackOff /
  Terminated / restartCount>0 / not-Ready / terminating), capture:
    - kubectl describe pod -> data/pod_describe/<pod>.txt
    - kubectl logs (current + --previous) for every container, including init
      containers, into data/pod_logs/<pod>.<container>[.previous].log

Tail size is configurable via POD_LOG_TAIL_LINES (default 500). All new files live
under data/, so notify_results continues to exclude them from the backend payload.
The data is consumed by downstream checks and (in a follow-up) embedded into
evidence for AI consumption.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Emit structured evidence in all diagnose checks

Until now every check emitted update_check_result with an empty {} evidence
payload, leaving only printf'd stdout for downstream consumers (UI / AI). With
20 checks all producing colored ANSI text, neither a frontend nor an LLM could
reliably extract counts, names, exit codes, or remediation steps.

This change defines a canonical evidence schema in diagnose_utils:
    {
      summary:           "one-line human summary",
      severity:          "critical" | "warning" | "info",
      affected:          ["resource-names"],
      details:           { check-specific structured data },
      suggested_actions: ["actionable guidance"]
    }

Helpers:
  - evidence_json(summary, severity, affected, details, actions): builds the
    schema with safe defaults
  - exit_code_meaning(code): maps 0/1/137/139/143 → human-readable, reused
    across crash, OOM, and termination checks
  - require_resources updated so the "skipped" path also emits schema evidence

All 20 checks migrated. Each preserves its existing stdout output (so no
regressions for users tailing logs) and additionally builds details with the
data already extracted: pod names, container names, exit codes, restart counts,
endpoint counts, ingress backends, certificate ARNs, etc. Severity is mapped
from status (failed→critical, warning→warning, success/skipped→info), allowing
the AI summarizer to prioritize what matters.

Side effects:
  - Fixes a pre-existing bug in ingress_tls_configuration that read tls.crt
    from .metadata.annotations | keys[] (which never contains them, and where
    build_context strips .data anyway). Now relies on Secret type validation.
  - Adds tests/evidence_schema.bats: cross-cutting validation that every check
    in scope/, service/, and networking/ emits a schema-conformant payload on
    skipped, failed, and success paths.
  - Updates existing test files where they previously asserted on legacy
    flat evidence fields (.evidence.tested, .evidence.ready) to point at the
    new nested location (.evidence.details.*).

Suite: 280 tests, 0 failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Embed pod logs in failed-check evidence for AI post-mortem

Without logs in the evidence payload, the AI summarizer would have to fetch
them separately for every diagnose run. By the time the summary is requested,
the cluster state has already moved on (rollback fired, pods churned), so live
logs would be misleading. Instead, embed the relevant log slice from the
build_context snapshot directly into the failing check's evidence — the AI
gets self-contained post-mortem in a single payload.

Helper: read_log_tail(pod, container, "current"|"previous", [lines]) reads
from data/pod_logs/ and returns a JSON array of lines. Returns [] when the
file is missing (most common case: no previous log because container never
crashed). Truncation is configurable via EVIDENCE_LOG_TAIL_LINES (default 50,
intentionally smaller than the 500-line build_context capture so the payload
stays bounded).

Five checks now embed logs on their failure paths:
  - container_crash_detection: previous_logs (CrashLoopBackOff, high-restart)
    and current_logs + previous_logs (terminated). Previous is where the crash
    output lives — current is empty during the restart loop.
  - memory_limits_check: previous_logs on OOMKilled. The kubelet restarts the
    container after the kill, so OOM-relevant output is in the previous instance.
  - health_probe_endpoints: container_logs (current) on every probe failure
    (4xx, 5xx, connection refused). Pairs the probe verdict with what the app
    was printing.
  - container_port_health: container_logs (current) on port_not_listening
    issues. Container is running but not bound — current logs typically show
    why (binding error, config mismatch).
  - pod_readiness: current_logs of the first container for stuck (not_ready)
    pods, but NOT for normally-starting pods (avoids noise during rollouts).

Discriminations made deliberately:
  - Success paths never embed logs (keeps payload light for healthy scopes).
  - image_pull_status doesn't embed: if the image couldn't be pulled, there
    is no container and no logs.
  - networking/ and service/ checks don't embed: their failures are
    configuration issues, not application issues.

Tests: +8 covering the helper, the embedding behavior, and a regression test
asserting the success path stays log-free. Suite: 288 tests, 0 failures (10
environmental skips on macOS dev hosts where nc/timeout from coreutils aren't
in PATH).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Dedup mark_affected and replace jq-loop accumulators with bash arrays

The first pass landed evidence enrichment quickly but at the cost of two
duplications visible across all 17 non-existence checks:

  - mark_affected was redefined locally in every check (18 copies of the same
    3-line jq-dedup function, differing only by the array variable name).
  - Each check accumulated facts via the same per-iteration jq round-trip
    pattern: FACTS=$(echo "$FACTS" | jq --argjson f "$x" '. + [$f]'). This
    is O(N²) (the JSON array is reparsed and reserialized on every push) and
    forks one jq process per iteration. ~60 such call sites; for a failing
    scope with 10 problematic pods, that's hundreds of jq forks per check.

This commit moves both into diagnose_utils:

  - mark_affected <set_name> <value>     — adds to a space-separated set
                                            stored in a bash variable, dedup
                                            on add (no jq).
  - set_to_json_array <set_name>         — converts the set to a JSON array
                                            in a single jq call.
  - add_fact <array_name> <json_string>  — bash array append, no jq.
  - facts_to_json_array <array_name>     — converts the array to a JSON array
                                            in a single jq -s call at the end
                                            of accumulation.
  - lines_to_json_array                  — extracted shared filter for the
                                            tail|jq -R -s 'split("\n")...'
                                            pipeline that update_check_result
                                            and read_log_tail both used.

All 17 affected checks were migrated. The 18 local mark_affected copies are
gone; check-level accumulator code shrunk from "jq-merge per iteration" to
"bash append per iteration, jq once at end".

Bash 3.2 compatibility: helpers use eval-based pass-by-name rather than
declare -n / declare -A (which require bash 4.3+ / 4.0+). Production runtime
on Alpine has bash 5.x, but local dev tests on macOS run /bin/bash 3.2.

Suite: 288 tests, 0 failures, 10 environmental skips. No behavior change —
both the human stdout and the evidence JSON shape are byte-identical to the
pre-simplify baseline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add Application Logs diagnose category for AI post-mortem

New diagnose category that publishes the application's own log output as
structured evidence, contextualized with the pod and container state at
fail-time. Unlike the scope/ checks that detect specific failure modes and
embed logs as secondary evidence, this check is log-first: a single,
self-contained category the AI summarizer can read to say "here is the
issue, look at this" without cross-referencing other checks.

- k8s/diagnose/logs/workflow.yml: declares the category "Application Logs"
  with a single step that runs application_log_evidence.
- k8s/diagnose/logs/application_log_evidence: iterates problematic pods
  from the build_context snapshot (no live kubectl), reads current and
  previous logs per container (init + regular) via read_log_tail, and
  emits a fact per container with the schema:
    { pod, pod_phase, pod_reason,
      container, init_container,
      container_state, restart_count,
      current_state_reason, last_termination_reason,
      last_exit_code, last_exit_code_meaning,
      current_logs, previous_logs }
  Status is always success/skipped (info severity). The check never fails:
  absence of logs is itself meaningful information ("image never started").
  Reuses exit_code_meaning from diagnose_utils for the meaning string.
- k8s/scope/workflows/diagnose.yaml and
  k8s/deployment/workflows/diagnose.yaml: register the new folder in the
  executor so the category appears automatically alongside Scope/Service/
  Networking. notify_results groups by category, no backend changes
  required.
- k8s/diagnose/tests/logs/application_log_evidence.bats: 10 tests covering
  skipped path, empty problematic list, current logs only, previous logs,
  init container flag, no-logs-available, multi-pod aggregation, empty
  log files, CrashLoopBackOff context, and pod_reason from Ready condition.
- k8s/diagnose/tests/evidence_schema.bats: +1 cross-cutting test asserting
  the check emits a schema-conformant evidence object on the skipped path.

Full suite: 357/357 green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Focus application_log_evidence on the 'application' container only

Narrow the Application Logs check to its essential job: publish the user-owned
container's logs for AI post-mortem. The previous shape duplicated metadata
already emitted by the scope/ checks (pod_phase, container_state, restart
counts, exit codes, etc.) and iterated every container in every problematic
pod — including sidecars like 'http' nginx whose logs already appear in
Health Probe Endpoints and Container Port Health.

- Filters by container name 'application' (the literal name set in
  k8s/deployment/templates/deployment.yaml.tpl). Sidecars and init containers
  are out of scope; this check is not a per-container audit.
- Per-pod payload shrinks from 12 fields to 2: { pod, logs }.
- current and previous logs are merged in chronological order (previous
  first, current second) and truncated to the last EVIDENCE_LOG_TAIL_LINES
  (default 50). One flat array — the AI does not need to know which container
  instance produced which line; the user wanted the tail of the application
  output, period.
- Tests updated: 9 cases covering skipped/empty paths, application-only
  filtering (asserts sidecar logs do not leak), previous+current merge in
  order, the 50-line cap, multi-pod aggregation, and a schema-pinning test
  that asserts the pod entry exposes exactly {pod, logs}.

jq gotcha worth noting: `.[-n:]` with `n` as a variable does not compile
("n/0 is not defined") because jq parses `-n` as expression-minus-function.
The correct slice is `.[-$n:]` with the `$` prefix.

Full diagnose suite: 356/356 green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Move application logs out of evidence, keep them only in check stdout

Previously the application log text was duplicated: it lived inside
evidence.details.pods[].logs (canonical for AI) and was also echoed to
stdout (so the UI's check.logs[] tail could show it). For a single-pod
scope that meant the same ~45 lines appearing twice in the result payload.

Consolidate to a single source: the check.logs[] tail. evidence.details
now carries only counters (pods_with_logs, problematic_pod_count) and the
list of pods that produced logs is published via evidence.affected. No
log text in evidence at all.

The trade-off is the existing 20-line cap inside update_check_result —
the UI sees the last 20 non-empty stdout lines of the check, which means
roughly the last 17-18 log lines plus the check's own diagnostic prints.
Sufficient for the typical single-pod scope; if that proves too tight,
we can revisit the cap in diagnose_utils.

Tests reshaped: 9 cases covering skipped/empty paths, sidecar exclusion
in stdout, evidence.details exposing exactly {pods_with_logs,
problematic_pod_count} (anchor against log text leaking back in),
chronological merge of previous before current, the 50-line cap on the
echoed tail, and multi-pod aggregation via affected[].

Full diagnose suite: 356/356 green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Allow checks to override the 20-line cap on captured stdout

The Application Log Evidence check echoes the application log tail to
stdout so the diagnose UI can show it in check.logs[]. But the existing
20-line cap inside update_check_result chops most of the payload off —
for a typical single-pod scope with 50 log lines plus a few diagnostic
prints, only ~17 log lines survive in the UI.

Add an opt-in --log-tail-lines flag on update_check_result. Default
stays at 20 (no impact on the other 19 checks). The logs check passes
--log-tail-lines 200, which fits a few pods worth of output plus the
check's own orchestrator/info lines.

- diagnose_utils: parse --log-tail-lines, use it in the tail call;
  preserve the positional and --status/--evidence APIs unchanged.
- logs/application_log_evidence: pass --log-tail-lines 200 to every
  update_check_result invocation on a path that emits log text. The
  skipped path keeps the default 20.
- diagnose_utils.bats: rename existing test to "by default" and add two
  new cases: an 80-line override over 100 input lines, and a 5-line cap
  preserving the most recent lines.

Full diagnose suite: 358/358 green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(changelog): note structured evidence and Application Logs in k8s/diagnose

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(changelog): drop "AI post-mortem" framing from diagnose entry

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…u-limits

feat: add configurable cpu/ram limits to k8s scope (CLIEN-781)
fix(k8s,scheduled_task): file-type parameter no longer leaks binary as env var
@fedemaleh fedemaleh merged commit 6e4077d into main Jun 9, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants