-
Notifications
You must be signed in to change notification settings - Fork 0
feat: opt-in ALB autocreation when pool capacity is exhausted #187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
025de7e
feat: opt-in ALB autocreation when pool capacity is exhausted
fedemaleh 59821c3
fix: harden autocreate input validation and merge AWS polling call
fedemaleh 56152e5
refactor: address PR review feedback on autocreate flow
fedemaleh b0b1cc1
docs: clarify get_alb_rule_count stdout contract in race-case branch
fedemaleh f5026d2
refactor: reuse $DOMAIN from build_context instead of recomputing
fedemaleh 81b33a3
Merge remote-tracking branch 'origin/beta' into feature/clien-807-aut…
fedemaleh 940276f
test: assert full log messages with emojis for autocreate networking …
fedemaleh 33da501
feat: suggest enabling ALB_AUTOCREATE_ENABLED in capacity error hints
fedemaleh 7f9435f
fix(autocreate): trigger on single-ALB setups and emit step-by-step d…
fedemaleh dff4e05
fix: render gomplate context as .json and emit heartbeat every 30s wh…
fedemaleh 8a4c12f
fix: generate dummy ingress host via domain-generate so it matches th…
fedemaleh a9dbc6c
fix: stable 30s heartbeat cadence and trap-safe context cleanup under…
fedemaleh 257a7bc
fix: gate wait_for_alb on route53, validate inputs, dedupe AWS call, …
fedemaleh a078c51
docs(changelog): shorten autocreate entry to one-liner client summary
fedemaleh 47c98c2
feat(wait_for_alb): name the missing IAM permission in tag-failure warn
fedemaleh File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,76 @@ | ||
| # ALB Autocreation | ||
|
|
||
| The k8s scope can provision new Application Load Balancers (ALBs) on demand when the declared pool of ALBs is exhausted. The behavior is opt-in and only triggers during scope creation; existing scopes are never moved to autocreated ALBs automatically. | ||
|
|
||
| ## When the autocreate path runs | ||
|
|
||
| The flow only triggers when **all** of the following are true: | ||
|
|
||
| - `ALB_AUTOCREATE_ENABLED=true` in `values.yaml` or in the `container-orchestration` provider. | ||
| - `DNS_TYPE=route53` (autocreation requires the same load-balancing path used by Route53 scopes). | ||
| - Every candidate ALB in the pool (base + additional balancers declared in the provider) reports a rule count `>= ALB_MAX_CAPACITY`. | ||
| - The scope being created does not already have a Route53 record (a scope being recreated reuses its existing ALB and does not trigger autocreation). | ||
|
|
||
| If any candidate is below the threshold, the scope creation uses that candidate and the autocreate path is not taken. | ||
|
|
||
| ## Configuration | ||
|
|
||
| | Key | Default | Description | | ||
| |---|---|---| | ||
| | `ALB_AUTOCREATE_ENABLED` | `false` | Master switch. When `false`, behavior is identical to previous releases. | | ||
| | `ALB_AUTOCREATE_NAME_PREFIX` | `nullplatform-auto-` | Prefix for autocreated ALB names. Final name format: `<prefix><public|private>-<6 hex chars>`. Must match `^[a-z0-9-]+$` and be ≤18 chars so the rendered name stays under the AWS 32-char ALB name limit. | | ||
| | `ALB_AUTOCREATE_TIMEOUT_SECONDS` | `300` | How long `wait_for_alb` polls AWS for the new ALB to reach `state=active` before failing the scope creation. The AWS Load Balancer Controller usually takes 2–4 minutes. Must be a positive integer. | | ||
|
|
||
| All three keys are also readable from `providers.container-orchestration.balancer.{autocreate_enabled, autocreate_name_prefix, autocreate_timeout_seconds}`. | ||
|
|
||
| ## How it works | ||
|
|
||
| 1. `resolve_balancer` evaluates the candidate pool — the base ALB plus the `additional_public_names` / `additional_private_names` list declared in the `container-orchestration` provider — and picks the least-loaded one. | ||
| 2. If that candidate's rule count is at or above `ALB_MAX_CAPACITY` and `ALB_AUTOCREATE_ENABLED=true`, `resolve_balancer` sources `autocreate_alb`. | ||
| 3. `autocreate_alb` generates a unique ALB name (`<prefix><public|private>-<6 hex>`) and **patches the container-orchestration provider via `np provider patch`** to append the new name to `additional_public_names` or `additional_private_names` (visibility-dependent). The provider is the authoritative registry of ALBs the platform uses. | ||
| 4. `autocreate_alb` renders `scope/templates/ingress-dummy.yaml.tpl` into `$OUTPUT_DIR/ingress-dummy-<alb-name>.yaml`. The dummy Ingress carries `alb.ingress.kubernetes.io/group.name=<new-name>` and `alb.ingress.kubernetes.io/load-balancer-name=<new-name>`, which is what makes the AWS Load Balancer Controller materialize the ALB once the file is applied. | ||
| 5. The workflow step `apply autocreated ingress` (in `k8s/scope/workflows/create.yaml`) applies whatever templates are in `$OUTPUT_DIR` via the standard `apply_templates` script. Its `post: wait for alb` runs `wait_for_alb`, which polls `aws elbv2 describe-load-balancers` every 10 seconds until the ALB reports `State.Code=active` (or `failed`/timeout, in which case the scope creation fails). An info-level heartbeat is emitted every ~30s so the operator can see progress. | ||
| 6. Once active, `wait_for_alb` tags the ALB with `nullplatform:managed-by=autocreate`, `nullplatform:visibility=internet-facing|internal`, and `nullplatform:created-by-scope-id=<scope-id>`. **These tags are audit metadata only**, surfacing the lineage of which scope provisioned which ALB. Discovery does NOT depend on these tags. | ||
| 7. The rest of the scope creation proceeds with `ALB_NAME` set to the new ALB. | ||
|
|
||
| ## How concurrent scope creations behave | ||
|
|
||
| When scope A triggers autocreate, the provider is patched **before** the ALB is active. Scope B that starts during this window reads the provider list, sees the new ALB name, and treats it as a normal candidate. AWS will return `LoadBalancerNotFound` for the in-flight ALB during the few seconds before it shows up in the API; `resolve_balancer` interprets that error specifically as "0 rules" so the in-flight ALB wins least-loaded selection in scope B and no second autocreate fires. Scope B then waits on the same ALB via its own `wait_for_alb` step. | ||
|
|
||
| ## Required permissions | ||
|
|
||
| In addition to the permissions already required for capacity validation, the autocreate path needs: | ||
|
|
||
| **Nullplatform API credentials.** The script calls `np provider list` and `np provider patch`, so the workflow environment must provide either `NP_TOKEN` or `NULLPLATFORM_API_KEY` with write access to the container-orchestration provider for the relevant NRN. Without these, the patch step fails with `❌ Failed to patch container-orchestration provider with new ALB`. | ||
|
|
||
| **AWS IAM (agent role).** | ||
|
|
||
| - `elasticloadbalancing:AddTags` — for the audit tags `wait_for_alb` applies once the ALB is active. Failure here is non-fatal (logged as a warning, the scope creation proceeds). | ||
|
|
||
| No new Kubernetes permissions are needed beyond those the agent already has for scope resources. | ||
|
|
||
| ## Operational notes | ||
|
|
||
| - Scope creations that trigger autocreation are slower (typically 2–4 minutes extra). This is the expected behavior, not a regression. The platform logs `🔧 Best candidate ALB '...' is at or above capacity (X/Y); triggering autocreate` when it happens, followed by `⏳ Still waiting for ALB '...' to become active (provisioning, ~30s elapsed)` heartbeats while the controller provisions. | ||
| - The dummy Ingress (`nullplatform-autocreate-<alb-name>`) is created in the scope's namespace. It exposes no real traffic — the rule points to a fixed `404` response via the standard `alb.ingress.kubernetes.io/actions.response-404` annotation — and exists only to keep the ALB alive in the eyes of the AWS Load Balancer Controller. Deleting the dummy Ingress will cause the controller to delete the ALB. | ||
| - The ALB is registered in the nullplatform provider (not in the customer's IaC). Two consequences: | ||
| 1. The provider becomes the source of truth for the ALB pool; subsequent scope creations read it directly. | ||
| 2. The cloud's IaC (Terraform, OpenTofu, CloudFormation) is **not** updated automatically. If your IaC is the source of truth for ALB inventory, you should reconcile autocreated ALBs into it through your own process. | ||
|
|
||
| ## Failure modes | ||
|
|
||
| | Failure | Outcome | | ||
| |---|---| | ||
| | `ALB_AUTOCREATE_NAME_PREFIX` invalid (bad charset or >18 chars) | Scope creation exits 1 with the validation error before any AWS or provider call. | | ||
| | `np provider list` cannot find a container-orchestration provider for the NRN | Scope creation exits 1 with `❌ No container-orchestration provider found for NRN '<nrn>'`. | | ||
| | `np provider patch` fails (no API token / no write access) | Scope creation exits 1 with `❌ Failed to patch container-orchestration provider with new ALB` + hint about `NP_TOKEN` / `NULLPLATFORM_API_KEY`. | | ||
| | `gomplate` render of the dummy Ingress fails | Scope creation exits 1 with `❌ Failed to render ingress-dummy template`. | | ||
| | ALB never reaches `active` within `ALB_AUTOCREATE_TIMEOUT_SECONDS` | Scope creation exits 1; check controller logs and AWS quota for ALBs in the region. | | ||
| | AWS reports the ALB state as `failed` | Scope creation exits 1 immediately. | | ||
| | `AddTags` call fails (no IAM permission) | Logged as `⚠️ Could not tag ALB '<name>' (audit only — provider registration already succeeded)`. The scope creation continues; the tags are documentation only. | | ||
|
|
||
| ## What is out of scope | ||
|
|
||
| - Migration of existing scopes to autocreated ALBs. Use the `Recreate scope` action if needed. | ||
| - Automatic cleanup of unused autocreated ALBs (no scopes referencing them). | ||
| - Updating the cloud IaC (Terraform / OpenTofu / CloudFormation) with the new ALB. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,217 @@ | ||
| #!/bin/bash | ||
|
|
||
| # Provisions a new ALB on demand when the existing pool is exhausted. | ||
| # | ||
| # Workflow contract: | ||
| # 1. Generates a unique ALB name. | ||
| # 2. Patches the container-orchestration provider via `np provider patch` so | ||
| # the new ALB is recorded as an additional balancer. Subsequent scope | ||
| # creations that read the provider see the new ALB and re-use it instead | ||
| # of triggering another autocreate. | ||
| # 3. Renders the dummy-ingress template to $OUTPUT_DIR. The next workflow | ||
| # step (apply_templates) applies it; that is what triggers the AWS Load | ||
| # Balancer Controller to actually provision the ALB. | ||
| # 4. Exports ALB_NAME (the new name) and ALB_AUTOCREATED=true so the wait | ||
| # step downstream knows to poll for active state and tag the ALB. | ||
| # | ||
| # Inputs (env vars): | ||
| # CONTEXT - Scope CONTEXT JSON | ||
| # INGRESS_VISIBILITY - "internet-facing" or "internal" | ||
| # K8S_NAMESPACE - Namespace for the dummy Ingress | ||
| # OUTPUT_DIR - Where the rendered YAML is written | ||
| # ALB_AUTOCREATE_NAME_PREFIX - Optional prefix for the new ALB | ||
| # | ||
| # Outputs (env vars): | ||
| # ALB_NAME - Replaced with the new ALB name | ||
| # ALB_AUTOCREATED - Set to "true" | ||
|
|
||
| _AUTOCREATE_ALB_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" | ||
|
|
||
| generate_alb_name() { | ||
| local prefix="$1" | ||
| local visibility="$2" | ||
|
|
||
| local vis_short | ||
| if [ "$visibility" = "internet-facing" ]; then | ||
| vis_short="public" | ||
| else | ||
| vis_short="private" | ||
| fi | ||
|
|
||
| local suffix | ||
| suffix=$(LC_ALL=C tr -dc 'a-f0-9' < /dev/urandom 2>/dev/null | head -c 6) || { | ||
| suffix=$(printf '%06x' $((RANDOM * RANDOM % 16777215))) | ||
| } | ||
|
|
||
| echo "${prefix}${vis_short}-${suffix}" | ||
|
fedemaleh marked this conversation as resolved.
|
||
| } | ||
|
|
||
| # Reads the container-orchestration provider id and current additional balancer | ||
| # list for the requested visibility, then patches the provider to append the new | ||
| # ALB name. Surfaces the existing balancer.* attributes so the patch is a | ||
| # merge of the full balancer object, not a destructive replacement. | ||
| register_alb_in_provider() { | ||
| local new_alb_name="$1" | ||
| local visibility="$2" | ||
| local nrn | ||
| nrn=$(echo "$CONTEXT" | jq -r '.scope.nrn // empty') | ||
| if [ -z "$nrn" ]; then | ||
| log error "❌ Could not read scope NRN from CONTEXT — cannot patch provider" | ||
| exit 1 | ||
| fi | ||
|
|
||
| local provider_json | ||
| provider_json=$(np provider list --categories container-orchestration --nrn "$nrn" --format json 2>/dev/null) || { | ||
| log error "❌ Failed to list container-orchestration provider for NRN '$nrn'" | ||
| exit 1 | ||
| } | ||
|
|
||
| local provider_id | ||
| provider_id=$(echo "$provider_json" | jq -r '.results[0].id // empty') | ||
| if [ -z "$provider_id" ]; then | ||
| log error "❌ No container-orchestration provider found for NRN '$nrn'" | ||
| exit 1 | ||
| fi | ||
|
|
||
| local field | ||
| if [ "$visibility" = "internet-facing" ]; then | ||
| field="additional_public_names" | ||
| else | ||
| field="additional_private_names" | ||
| fi | ||
|
|
||
| # Merge: current balancer.* + appended ALB in the right field. | ||
| local patch_body | ||
| patch_body=$(echo "$provider_json" | jq -c \ | ||
| --arg field "$field" \ | ||
| --arg new_alb "$new_alb_name" \ | ||
| '{ | ||
| attributes: { | ||
| balancer: ( | ||
| (.results[0].attributes.balancer // {}) as $bal | | ||
| $bal + { ($field): (($bal[$field] // []) + [$new_alb] | unique) } | ||
| ) | ||
| } | ||
| }') | ||
|
|
||
| log info "📝 Registering ALB '$new_alb_name' in container-orchestration provider ($field)" | ||
| if ! np provider patch --id "$provider_id" --body "$patch_body" --no-output 2>/dev/null; then | ||
| log error "❌ Failed to patch container-orchestration provider with new ALB" | ||
| log error "💡 Possible causes: agent lacks write permission on the provider, or NP_TOKEN/NULLPLATFORM_API_KEY is missing" | ||
| exit 1 | ||
| fi | ||
| } | ||
|
|
||
| # Builds the dummy ingress host via the same domain-generate binary the | ||
| # platform uses for scope domains. Substituting scopeSlug with the ALB name | ||
| # keeps the host inside whatever wildcard cert/DNS pattern the platform | ||
| # already maintains. | ||
| generate_dummy_host() { | ||
| local alb_name="$1" | ||
|
|
||
| local account_slug namespace_slug application_slug | ||
| account_slug=$(echo "$CONTEXT" | jq -r '.account.slug') | ||
| namespace_slug=$(echo "$CONTEXT" | jq -r '.namespace.slug') | ||
| application_slug=$(echo "$CONTEXT" | jq -r '.application.slug') | ||
|
|
||
| local host | ||
| host=$("$SERVICE_PATH/scope/networking/dns/domain/domain-generate" \ | ||
| --accountSlug="$account_slug" \ | ||
| --namespaceSlug="$namespace_slug" \ | ||
| --applicationSlug="$application_slug" \ | ||
| --scopeSlug="$alb_name" \ | ||
| --domain="$DOMAIN" \ | ||
| --useAccountSlug="${USE_ACCOUNT_SLUG:-false}") || { | ||
| log error "❌ Failed to generate dummy ingress host via domain-generate" | ||
| log error "💡 Possible causes:" | ||
| log error " The domain-generate binary returned an error" | ||
| log error "🔧 How to fix:" | ||
| log error " • Check the domain-generate binary exists: ls -la $SERVICE_PATH/scope/networking/dns/domain/domain-generate" | ||
| log error " • Verify the input slugs are valid" | ||
| exit 1 | ||
| } | ||
|
|
||
| echo "$host" | ||
| } | ||
|
|
||
| render_dummy_ingress() { | ||
| local alb_name="$1" | ||
| local visibility="$2" | ||
| local namespace="$3" | ||
|
|
||
| if [ -z "${OUTPUT_DIR:-}" ]; then | ||
| log error "❌ OUTPUT_DIR is not set — autocreate_alb must run after OUTPUT_DIR is exported" | ||
| exit 1 | ||
| fi | ||
| mkdir -p "$OUTPUT_DIR" | ||
|
|
||
| local dummy_host | ||
| dummy_host=$(generate_dummy_host "$alb_name") | ||
| log debug "📋 Dummy ingress host: $dummy_host" | ||
|
|
||
| # The context file MUST have a .json extension. gomplate uses the extension | ||
| # to pick the parser; a plain mktemp path is treated as an opaque string | ||
| # and the template fails with "can't evaluate field X in type string". | ||
| local context_path | ||
| context_path="$OUTPUT_DIR/ingress-dummy-${alb_name}-context.json" | ||
| # Use double quotes so $context_path is baked into the trap string now. | ||
| # Single quotes would defer expansion until the trap fires on RETURN, by | ||
| # which point the local variable is out of scope and `set -u` would trip | ||
| # with "context_path: unbound variable". | ||
| trap "rm -f '$context_path'" RETURN | ||
|
|
||
| echo "$CONTEXT" | jq \ | ||
| --arg alb_name "$alb_name" \ | ||
| --arg ingress_visibility "$visibility" \ | ||
| --arg k8s_namespace "$namespace" \ | ||
| --arg dummy_host "$dummy_host" \ | ||
| '. + {alb_name: $alb_name, ingress_visibility: $ingress_visibility, k8s_namespace: $k8s_namespace, dummy_host: $dummy_host}' \ | ||
| > "$context_path" | ||
|
|
||
| local template_path="${INGRESS_DUMMY_TEMPLATE:-$SERVICE_PATH/scope/templates/ingress-dummy.yaml.tpl}" | ||
| local out_path="$OUTPUT_DIR/ingress-dummy-${alb_name}.yaml" | ||
|
|
||
| if ! gomplate -c .="$context_path" --file "$template_path" --out "$out_path"; then | ||
| log error "❌ Failed to render ingress-dummy template" | ||
| log error "📋 Template: $template_path" | ||
| exit 1 | ||
| fi | ||
|
|
||
| log debug "📝 Rendered dummy ingress to $out_path" | ||
| } | ||
|
|
||
| # ============================================================================= | ||
| # Main | ||
| # ============================================================================= | ||
|
|
||
| NAME_PREFIX=$(get_config_value \ | ||
| --env ALB_AUTOCREATE_NAME_PREFIX \ | ||
| --provider '.providers["container-orchestration"].balancer.autocreate_name_prefix' \ | ||
| --default "nullplatform-auto-" | ||
| ) | ||
|
|
||
| # Final ALB name is "<prefix><public|private>-<6 hex>". AWS rejects names that | ||
| # exceed 32 chars or contain anything outside [a-zA-Z0-9-]. If an invalid name | ||
| # slips through, the AWS Load Balancer Controller silently refuses to create | ||
| # the ALB and wait_for_alb hangs to timeout with an opaque error. Catch it | ||
| # here with a clear message instead. | ||
| if ! [[ "$NAME_PREFIX" =~ ^[a-z0-9-]+$ ]]; then | ||
| log error "❌ ALB_AUTOCREATE_NAME_PREFIX must match ^[a-z0-9-]+$, got: '$NAME_PREFIX'" | ||
| exit 1 | ||
| fi | ||
|
|
||
| # 14 = len("private-") + 6 hex chars; total must stay ≤32 (AWS ALB name limit) | ||
| if [ "${#NAME_PREFIX}" -gt 18 ]; then | ||
| log error "❌ ALB_AUTOCREATE_NAME_PREFIX must be ≤18 chars (AWS caps ALB names at 32, the visibility+hex suffix uses 14); got ${#NAME_PREFIX}" | ||
| exit 1 | ||
| fi | ||
|
|
||
| NEW_ALB_NAME=$(generate_alb_name "$NAME_PREFIX" "$INGRESS_VISIBILITY") | ||
| log info "🔧 Autocreating ALB '$NEW_ALB_NAME' (visibility=$INGRESS_VISIBILITY)" | ||
|
|
||
| register_alb_in_provider "$NEW_ALB_NAME" "$INGRESS_VISIBILITY" | ||
|
|
||
| render_dummy_ingress "$NEW_ALB_NAME" "$INGRESS_VISIBILITY" "$K8S_NAMESPACE" | ||
|
|
||
| export ALB_NAME="$NEW_ALB_NAME" | ||
| export ALB_AUTOCREATED="true" | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.