Release 1.12.0 by fedemaleh · Pull Request #189 · nullplatform/scopes

fedemaleh · 2026-06-08T19:18:09Z

No description provided.

scheduled_task scopes do not expose HTTP traffic via ALB, so the ALB capacity and target group capacity validations from the base k8s workflows are unnecessary. Override them with `action: skip` in the scheduled_task overlays and add structural tests that lock the contract with upstream step names — if a base step is renamed, the test fails instead of silently re-enabling the validation. Also adds .vscode/ to .gitignore. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…heduled-tasks Skip ALB capacity validations in scheduled_task workflows

…t need it Deployment actions like switch-traffic, kill-instances, and diagnose-deployment are purely operational and don't require application parameters. All scope actions (create, update, delete, etc.) deal with infrastructure, not app config. The flag is only added when the CLI supports it, preserving backward compatibility.

…nd_throubleshoot Features/kubectl read command for throubleshooting

AWS ELBs expose DNS hostnames (type=Hostname), not IPs (type=IPAddress). The manage_route script now falls back through four strategies: 1. Gateway IPAddress → A record 2. Gateway Hostname → CNAME record 3. Service LB IP → A record 4. Service LB hostname → CNAME record The dns-endpoint.yaml.tpl now uses dynamic record_type (A or CNAME) instead of hardcoded A, so DNSEndpoints are created correctly on AWS. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When DNS_TYPE is external_dns, verify_networking_reconciliation was skipping reconciliation entirely. Now it calls manage_route to resolve the gateway address, applies the DNSEndpoint to the cluster, and verifies HTTPRoute reconciliation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…nslookup for external_dns nslookup against 8.8.8.8 fails for private Route53 zones and domains without public NS delegation. external-dns sets status.observedGeneration=1 once it processes the DNSEndpoint, which is a reliable signal that the Route53 record was created. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…rns cluster-internal address Istio gateways report their status address as the ClusterIP service name (gateway-public-istio.gateways.svc.cluster.local), not the external ALB hostname. Added a fallback that reads the hostname from the ALB Ingress (gateway-alb-public / gateway-alb-private) when a .svc.cluster.local address is detected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ddress The previous approach resolved the gateway address first and only checked the ALB Ingress when the result was cluster-internal. Reversed the priority: ALB Ingress (gateway-alb-<suffix>) is checked first since it's the AWS-specific override. If not present, falls back to the standard gateway address resolution chain (IPAddress → Hostname → Service IP → Service hostname), which is the common case for environments with a real external gateway. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…test with correct ALB target The previous nslookup against 8.8.8.8 was failing because the CNAME pointed to a cluster-internal gateway address. Now that manage_route resolves the real ALB hostname first, testing whether public DNS resolution works correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… line Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ne record creation Both external-dns-public and external-dns-private controllers process all DNSEndpoints, causing public scope records to appear in the private hosted zone and vice versa. Add a dns/zone-type label (public|private) derived from SCOPE_VISIBILITY so each controller can filter only the records it owns via --label-filter. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ead of nslookup Private scopes use the internal Route53 hosted zone which is not resolvable via public DNS (8.8.8.8). Poll status.observedGeneration on the DNSEndpoint instead — set to >=1 by external-dns when the record is processed. Public scopes keep the existing nslookup-based check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ns scopes Replace nslookup-based DNS resolution check with DNSEndpoint observedGeneration polling for all scopes (public and private). nslookup was unreliable due to high cluster DNS cache TTL. observedGeneration is set by external-dns when it processes the record — faster and works consistently regardless of zone visibility. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…erify step and improve endpoint naming - Remove manage_route call and kubectl apply from verify_networking_reconciliation; DNS record creation belongs to scope creation flow, not deployment verification - Include application.slug in DNSEndpoint name (k8s-{app}-{scope}-{id}-dns) to distinguish scopes with the same name across different apps - Truncate app/scope slugs to 20 chars each to respect K8s name length limits - Update dns-endpoint.yaml.tpl to use new naming via gomplate strings.Trunc - Fix wait_on_balancer.bats: rewrite tests to match observedGeneration logic (previous tests referenced removed nslookup checks) - Fix manage_route.bats: correct wrong log message assertions and update expected DNSEndpoint name to new format Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ption Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(dns): support external-dns with DNSEndpoint CRD and zone-type label filtering

…l_ports

…onal_port service

…p-bound directly

…M_PORT pointing to main_http_port

…t directly

… tests

…port

… scope spec

…trols

…fields

…dioms

…I visibility

… option

* Capture deployments, replicasets, pod logs and describe in diagnose snapshot Extends build_context to capture the resources needed for a complete post-mortem: - deployments.json and replicasets.json scoped by deployment_id (so we see rollout state even when no pods got created) - For every pod identified as problematic (CrashLoop / OOM / ImagePullBackOff / Terminated / restartCount>0 / not-Ready / terminating), capture: - kubectl describe pod -> data/pod_describe/<pod>.txt - kubectl logs (current + --previous) for every container, including init containers, into data/pod_logs/<pod>.<container>[.previous].log Tail size is configurable via POD_LOG_TAIL_LINES (default 500). All new files live under data/, so notify_results continues to exclude them from the backend payload. The data is consumed by downstream checks and (in a follow-up) embedded into evidence for AI consumption. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Emit structured evidence in all diagnose checks Until now every check emitted update_check_result with an empty {} evidence payload, leaving only printf'd stdout for downstream consumers (UI / AI). With 20 checks all producing colored ANSI text, neither a frontend nor an LLM could reliably extract counts, names, exit codes, or remediation steps. This change defines a canonical evidence schema in diagnose_utils: { summary: "one-line human summary", severity: "critical" | "warning" | "info", affected: ["resource-names"], details: { check-specific structured data }, suggested_actions: ["actionable guidance"] } Helpers: - evidence_json(summary, severity, affected, details, actions): builds the schema with safe defaults - exit_code_meaning(code): maps 0/1/137/139/143 → human-readable, reused across crash, OOM, and termination checks - require_resources updated so the "skipped" path also emits schema evidence All 20 checks migrated. Each preserves its existing stdout output (so no regressions for users tailing logs) and additionally builds details with the data already extracted: pod names, container names, exit codes, restart counts, endpoint counts, ingress backends, certificate ARNs, etc. Severity is mapped from status (failed→critical, warning→warning, success/skipped→info), allowing the AI summarizer to prioritize what matters. Side effects: - Fixes a pre-existing bug in ingress_tls_configuration that read tls.crt from .metadata.annotations | keys[] (which never contains them, and where build_context strips .data anyway). Now relies on Secret type validation. - Adds tests/evidence_schema.bats: cross-cutting validation that every check in scope/, service/, and networking/ emits a schema-conformant payload on skipped, failed, and success paths. - Updates existing test files where they previously asserted on legacy flat evidence fields (.evidence.tested, .evidence.ready) to point at the new nested location (.evidence.details.*). Suite: 280 tests, 0 failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Embed pod logs in failed-check evidence for AI post-mortem Without logs in the evidence payload, the AI summarizer would have to fetch them separately for every diagnose run. By the time the summary is requested, the cluster state has already moved on (rollback fired, pods churned), so live logs would be misleading. Instead, embed the relevant log slice from the build_context snapshot directly into the failing check's evidence — the AI gets self-contained post-mortem in a single payload. Helper: read_log_tail(pod, container, "current"|"previous", [lines]) reads from data/pod_logs/ and returns a JSON array of lines. Returns [] when the file is missing (most common case: no previous log because container never crashed). Truncation is configurable via EVIDENCE_LOG_TAIL_LINES (default 50, intentionally smaller than the 500-line build_context capture so the payload stays bounded). Five checks now embed logs on their failure paths: - container_crash_detection: previous_logs (CrashLoopBackOff, high-restart) and current_logs + previous_logs (terminated). Previous is where the crash output lives — current is empty during the restart loop. - memory_limits_check: previous_logs on OOMKilled. The kubelet restarts the container after the kill, so OOM-relevant output is in the previous instance. - health_probe_endpoints: container_logs (current) on every probe failure (4xx, 5xx, connection refused). Pairs the probe verdict with what the app was printing. - container_port_health: container_logs (current) on port_not_listening issues. Container is running but not bound — current logs typically show why (binding error, config mismatch). - pod_readiness: current_logs of the first container for stuck (not_ready) pods, but NOT for normally-starting pods (avoids noise during rollouts). Discriminations made deliberately: - Success paths never embed logs (keeps payload light for healthy scopes). - image_pull_status doesn't embed: if the image couldn't be pulled, there is no container and no logs. - networking/ and service/ checks don't embed: their failures are configuration issues, not application issues. Tests: +8 covering the helper, the embedding behavior, and a regression test asserting the success path stays log-free. Suite: 288 tests, 0 failures (10 environmental skips on macOS dev hosts where nc/timeout from coreutils aren't in PATH). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Dedup mark_affected and replace jq-loop accumulators with bash arrays The first pass landed evidence enrichment quickly but at the cost of two duplications visible across all 17 non-existence checks: - mark_affected was redefined locally in every check (18 copies of the same 3-line jq-dedup function, differing only by the array variable name). - Each check accumulated facts via the same per-iteration jq round-trip pattern: FACTS=$(echo "$FACTS" | jq --argjson f "$x" '. + [$f]'). This is O(N²) (the JSON array is reparsed and reserialized on every push) and forks one jq process per iteration. ~60 such call sites; for a failing scope with 10 problematic pods, that's hundreds of jq forks per check. This commit moves both into diagnose_utils: - mark_affected <set_name> <value> — adds to a space-separated set stored in a bash variable, dedup on add (no jq). - set_to_json_array <set_name> — converts the set to a JSON array in a single jq call. - add_fact <array_name> <json_string> — bash array append, no jq. - facts_to_json_array <array_name> — converts the array to a JSON array in a single jq -s call at the end of accumulation. - lines_to_json_array — extracted shared filter for the tail|jq -R -s 'split("\n")...' pipeline that update_check_result and read_log_tail both used. All 17 affected checks were migrated. The 18 local mark_affected copies are gone; check-level accumulator code shrunk from "jq-merge per iteration" to "bash append per iteration, jq once at end". Bash 3.2 compatibility: helpers use eval-based pass-by-name rather than declare -n / declare -A (which require bash 4.3+ / 4.0+). Production runtime on Alpine has bash 5.x, but local dev tests on macOS run /bin/bash 3.2. Suite: 288 tests, 0 failures, 10 environmental skips. No behavior change — both the human stdout and the evidence JSON shape are byte-identical to the pre-simplify baseline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add Application Logs diagnose category for AI post-mortem New diagnose category that publishes the application's own log output as structured evidence, contextualized with the pod and container state at fail-time. Unlike the scope/ checks that detect specific failure modes and embed logs as secondary evidence, this check is log-first: a single, self-contained category the AI summarizer can read to say "here is the issue, look at this" without cross-referencing other checks. - k8s/diagnose/logs/workflow.yml: declares the category "Application Logs" with a single step that runs application_log_evidence. - k8s/diagnose/logs/application_log_evidence: iterates problematic pods from the build_context snapshot (no live kubectl), reads current and previous logs per container (init + regular) via read_log_tail, and emits a fact per container with the schema: { pod, pod_phase, pod_reason, container, init_container, container_state, restart_count, current_state_reason, last_termination_reason, last_exit_code, last_exit_code_meaning, current_logs, previous_logs } Status is always success/skipped (info severity). The check never fails: absence of logs is itself meaningful information ("image never started"). Reuses exit_code_meaning from diagnose_utils for the meaning string. - k8s/scope/workflows/diagnose.yaml and k8s/deployment/workflows/diagnose.yaml: register the new folder in the executor so the category appears automatically alongside Scope/Service/ Networking. notify_results groups by category, no backend changes required. - k8s/diagnose/tests/logs/application_log_evidence.bats: 10 tests covering skipped path, empty problematic list, current logs only, previous logs, init container flag, no-logs-available, multi-pod aggregation, empty log files, CrashLoopBackOff context, and pod_reason from Ready condition. - k8s/diagnose/tests/evidence_schema.bats: +1 cross-cutting test asserting the check emits a schema-conformant evidence object on the skipped path. Full suite: 357/357 green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Focus application_log_evidence on the 'application' container only Narrow the Application Logs check to its essential job: publish the user-owned container's logs for AI post-mortem. The previous shape duplicated metadata already emitted by the scope/ checks (pod_phase, container_state, restart counts, exit codes, etc.) and iterated every container in every problematic pod — including sidecars like 'http' nginx whose logs already appear in Health Probe Endpoints and Container Port Health. - Filters by container name 'application' (the literal name set in k8s/deployment/templates/deployment.yaml.tpl). Sidecars and init containers are out of scope; this check is not a per-container audit. - Per-pod payload shrinks from 12 fields to 2: { pod, logs }. - current and previous logs are merged in chronological order (previous first, current second) and truncated to the last EVIDENCE_LOG_TAIL_LINES (default 50). One flat array — the AI does not need to know which container instance produced which line; the user wanted the tail of the application output, period. - Tests updated: 9 cases covering skipped/empty paths, application-only filtering (asserts sidecar logs do not leak), previous+current merge in order, the 50-line cap, multi-pod aggregation, and a schema-pinning test that asserts the pod entry exposes exactly {pod, logs}. jq gotcha worth noting: `.[-n:]` with `n` as a variable does not compile ("n/0 is not defined") because jq parses `-n` as expression-minus-function. The correct slice is `.[-$n:]` with the `$` prefix. Full diagnose suite: 356/356 green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Move application logs out of evidence, keep them only in check stdout Previously the application log text was duplicated: it lived inside evidence.details.pods[].logs (canonical for AI) and was also echoed to stdout (so the UI's check.logs[] tail could show it). For a single-pod scope that meant the same ~45 lines appearing twice in the result payload. Consolidate to a single source: the check.logs[] tail. evidence.details now carries only counters (pods_with_logs, problematic_pod_count) and the list of pods that produced logs is published via evidence.affected. No log text in evidence at all. The trade-off is the existing 20-line cap inside update_check_result — the UI sees the last 20 non-empty stdout lines of the check, which means roughly the last 17-18 log lines plus the check's own diagnostic prints. Sufficient for the typical single-pod scope; if that proves too tight, we can revisit the cap in diagnose_utils. Tests reshaped: 9 cases covering skipped/empty paths, sidecar exclusion in stdout, evidence.details exposing exactly {pods_with_logs, problematic_pod_count} (anchor against log text leaking back in), chronological merge of previous before current, the 50-line cap on the echoed tail, and multi-pod aggregation via affected[]. Full diagnose suite: 356/356 green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Allow checks to override the 20-line cap on captured stdout The Application Log Evidence check echoes the application log tail to stdout so the diagnose UI can show it in check.logs[]. But the existing 20-line cap inside update_check_result chops most of the payload off — for a typical single-pod scope with 50 log lines plus a few diagnostic prints, only ~17 log lines survive in the UI. Add an opt-in --log-tail-lines flag on update_check_result. Default stays at 20 (no impact on the other 19 checks). The logs check passes --log-tail-lines 200, which fits a few pods worth of output plus the check's own orchestrator/info lines. - diagnose_utils: parse --log-tail-lines, use it in the tail call; preserve the positional and --status/--evidence APIs unchanged. - logs/application_log_evidence: pass --log-tail-lines 200 to every update_check_result invocation on a path that emits log text. The skipped path keeps the default 20. - diagnose_utils.bats: rename existing test to "by default" and add two new cases: an 80-line override over 100 input lines, and a 5-line cap preserving the most recent lines. Full diagnose suite: 358/358 green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(changelog): note structured evidence and Application Logs in k8s/diagnose Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(changelog): drop "AI post-mortem" framing from diagnose entry Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rasing

…omment

…nction

…coverage

…u-limits feat: add configurable cpu/ram limits to k8s scope (CLIEN-781)

…s env var

This reverts commit dc679d2.

…w chars

fix(k8s,scheduled_task): file-type parameter no longer leaks binary as env var

Update release date in changelog

geisbruch and others added 30 commits April 16, 2026 19:44

Add kubectl helper command

172bc47

Add kubectl logs

5704885

Merge pull request #178 from nullplatform/fix/avoid-alb-validation-sc…

55b88a8

…heduled-tasks Skip ALB capacity validations in scheduled_task workflows

Merge pull request #176 from nullplatform/features/kubectl_read_comma…

4edd9d7

…nd_throubleshoot Features/kubectl read command for throubleshooting

fix(wait_on_balancer): fix nslookup IP parsing to skip server address…

ae40e10

… line Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore(changelog): add 1.12.0 entry for external-dns DNSEndpoint support

354b47c

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore(changelog): condense 1.11.1 entry into single functional descri…

d6c24ad

…ption Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore(changelog): reframe 1.11.1 entry around user benefit

fa6b7ba

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Merge pull request #182 from nullplatform/fix/dnsendpoint

90288ca

fix(dns): support external-dns with DNSEndpoint CRD and zone-type label filtering

feat(k8s): configurable main_http_port and HTTP support for additiona…

c0df99c

…l_ports

Additional port default is http

7bd2a57

fix(k8s/templates): use root context for k8s_modifiers in HTTP additi…

420c39c

…onal_port service

fix(k8s/templates): remove HTTP sidecar; HTTP additional ports are ap…

080827f

…p-bound directly

fix(k8s/templates): restore HTTP additional-port sidecar with UPSTREA…

492841a

…M_PORT pointing to main_http_port

fix(k8s/templates): HTTP sidecar binds port+10000 so app can bind por…

39d4856

…t directly

fix(k8s/verify_ingress): dedupe weights for multi-ingress listeners +…

9e6b43d

… tests

feat(k8s/templates): HTTP additional ports listen on dedicated HTTPS …

1a52f3c

…port

Fix tests

604a1fe

fedemaleh and others added 28 commits May 21, 2026 13:15

feat: add cpu_millicores_limit and ram_memory_limit properties to k8s…

3916ce8

… scope spec

feat: rename Processor tab to Resources and surface CPU/RAM limit con…

957debc

…trols

feat: normalize cpu/ram limit capabilities to request value when unset

8bc5dac

feat: render application container limits from normalized capability …

f50b59c

…fields

refactor: tighten normalize_capability_limits jq + bats here-string i…

a40f54a

…dioms

fix: mark cpu_millicores_limit and ram_memory_limit as required for U…

3eff675

…I visibility

refactor: make cpu_millicores_limit a dropdown with 'Same as request'…

6856c9c

… option

docs: align ram_memory_limit description with cpu_millicores_limit ph…

d08ca79

…rasing

chore: move design spec and plan to .claude (untracked working notes)

88f65ee

docs: add changelog entry for configurable CPU and memory limits

26aae22

Merge branch 'beta' into feature/clien-781-memory-cpu-limits

d9cf93f

refactor: drop ticket id and noise from normalize_capability_limits c…

81726e1

…omment

test: exercise normalize via full build_context instead of private fu…

0a3ab0f

…nction

test: remove deployment template shape tests in favor of integration …

811b607

…coverage

feat: clamp limit to request when below it as defense-in-depth

44776c7

Merge pull request #185 from nullplatform/feature/clien-781-memory-cp…

e9505e4

…u-limits feat: add configurable cpu/ram limits to k8s scope (CLIEN-781)

fix(k8s,scheduled_task): file-type parameter no longer leaks binary a…

fd39127

…s env var

docs(changelog): tighten file-parameter fix entry

f6118ce

fix(k8s,scheduled_task): isolate file binary in a dedicated Secret

4c57d1e

refactor(k8s,scheduled_task): derive file-param identifiers from .name

8a093eb

fix(k8s,scheduled_task): omit env: block when no file params

dc679d2

Revert "fix(k8s,scheduled_task): omit env: block when no file params"

cd89c5f

This reverts commit dc679d2.

fix(k8s,scheduled_task): quote destination_path in YAML to escape flo…

b994dfa

…w chars

test(scheduled_task): add build_deployment render test for file params

e43f0f3

Merge pull request #186 from nullplatform/fix/file-param-env-nul-byte

df0a510

fix(k8s,scheduled_task): file-type parameter no longer leaks binary as env var

Update CHANGELOG.md

6c6284f

Merge pull request #188 from nullplatform/fedemaleh-patch-1

2836cfb

Update release date in changelog

ignacioboud approved these changes Jun 8, 2026

View reviewed changes

fedemaleh merged commit 6e4077d into main Jun 9, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 1.12.0#189

Release 1.12.0#189
fedemaleh merged 64 commits into
mainfrom
beta

fedemaleh commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

fedemaleh commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants