Skip to content

feat: dynamic assume-role support, configurable placeholder image & install tofu consolidation#25

Draft
davidf-null wants to merge 24 commits into
mainfrom
feature/assume-role-support
Draft

feat: dynamic assume-role support, configurable placeholder image & install tofu consolidation#25
davidf-null wants to merge 24 commits into
mainfrom
feature/assume-role-support

Conversation

@davidf-null

@davidf-null davidf-null commented Jun 3, 2026

Copy link
Copy Markdown

Summary

Adds dynamic assume-role support and a configurable placeholder image to the
AWS Lambda scope, introduces a requirements module with the IAM policies the scope
needs, and bundles the deploy/state/IAM fixes found while testing — plus a
security/docs hardening pass to keep the published scope account-agnostic.

Changes

Dynamic assume role

  • Resolve the assume-role ARN from the scope-configurations provider
    (assume_role.arn) with ASSUME_ROLE_ARN_DEFAULT as an env-var fallback; when set,
    AWS operations run under the assumed role, otherwise the agent's pod credentials are
    used.
  • Surface sts:AssumeRole errors to stdout so they are visible in NP logs.

Installation tofu (lambda/setup/) + IAM policies

  • IAM policies required for Lambda scope operations (Lambda core, IAM roles,
    networking, storage/observability).
  • Prefix the Lambda execution role with np-lambda- to match the policy resource
    constraint.
  • Add the modern CloudWatch Logs tagging actions to the policy.
  • Consolidation (reviewer feedback): all installation-time tofu now lives under
    lambda/setup/ — the scope-registration module (formerly lambda/specs/tofu/) and
    the IAM policies (formerly the standalone lambda/requirements/ module) are merged
    there. A single tofu apply in lambda/setup/ registers the scope type and
    provisions the IAM policies. name is now a required setup variable; attaching the
    policies stays optional via create_role / role_name.

Configurable placeholder image

  • PLACEHOLDER_IMAGE_URI_DEFAULT env-var fallback for the placeholder image, with
    precedence: scope-config deployment.placeholder_image_uri >
    PLACEHOLDER_IMAGE_URI_DEFAULT > the script's public default.
  • Use the exact PLACEHOLDER_IMAGE_URI when explicitly set and stop appending an
    automatic architecture suffix (publish -amd64 / -arm64 tags instead).

Deploy / state / tofu fixes

  • Ensure the Lambda pull policy on the image's ECR repo before update.
  • Add the missing diagnose.yaml workflow for the diagnose-deployment action.
  • Read TOFU_STATE_BUCKET from .provider.aws_state_bucket as a fallback.
  • Surface tofu apply stderr to stdout for visibility in NP logs.
  • Correct the nullplatform provider version constraint in specs/tofu.

Security & docs hardening

  • Remove the account-specific defaults committed for testing
    (ASSUME_ROLE_ARN_DEFAULT / PLACEHOLDER_IMAGE_URI_DEFAULT carried a real account
    ARN/URI) so the product repo stays account-agnostic.
  • Re-add PLACEHOLDER_IMAGE_URI_DEFAULT to values.yaml as a commented,
    account-agnostic template
    so operators can pick their own image without a
    hardcoded value.
  • Normalize a stray real-looking account ID in a publish comment to the dummy
    123456789012.
  • README: new Placeholder Image (Scope Bootstrap) section explaining why
    Image-based scopes need a private-ECR placeholder, the resolution precedence, how to
    publish one, and a troubleshooting entry.

Note: the real account IDs (235494813897, 688720756067) appear in the branch
history (52cba87 and earlier commits); they are removed from the working tree but
not scrubbed from history (account IDs are low-sensitivity).

Test plan

  • Scope create/update/delete with no assume role → pod credentials used end-to-end
  • Scope ops with assume_role.arn set → run under the assumed role; errors surface
    in NP logs on failure
  • create_role requirements module applies with all IAM policies attached
  • Zip-package scope create → uses the embedded placeholder, no image config needed
  • Image-package scope create → uses PLACEHOLDER_IMAGE_URI_DEFAULT from a private
    ECR (single-arch tag matching the scope architecture)
  • diagnose-deployment action runs via the new diagnose.yaml workflow

🤖 Generated with Claude Code

David Fernandez and others added 17 commits June 1, 2026 15:28
When assume_role.arn is set in the scope-configurations provider, the agent's
base credentials (IRSA) are used only to call sts:AssumeRole; all subsequent
AWS calls (CLI + Tofu) run under the target role. Falls back to ASSUME_ROLE_ARN_DEFAULT
in values.yaml if the provider key is absent. When neither is set, behavior is
unchanged — pod credentials (IRSA) are used directly.

- New utils/assume_role: sourceable helper that exports temporary credentials
- fetch_scope_configuration: reads assume_role.arn from scope-configurations
  provider and applies the role immediately after config is fetched
- diagnose/build_context: explicit assume_role sourcing (only build_context
  that bypasses fetch_scope_configuration)
- values.yaml: documents ASSUME_ROLE_ARN_DEFAULT as fallback config option

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ations

Creates 4 IAM policies covering all AWS operations needed by the lambda scope:
- lambda_policy: Lambda CRUD, versions, aliases, concurrency
- lambda_iam_policy: execution role management (nullplatform-* and np-lambda-*)
- lambda_networking_policy: API Gateway, ALB, Route53
- lambda_storage_policy: ECR, Secrets Manager, CloudWatch, S3 tfstate

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…uffix

When PLACEHOLDER_IMAGE_URI is set in values.yaml the operator has already
chosen the exact tag — no architecture suffix should be appended.
Sets the default to :latest (no arch suffix) for this deployment.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The public ECR image only exists as :latest without architecture-specific
tags. Remove the -arm64/-amd64 append logic from the default path.
Users who publish arch-specific images can set PLACEHOLDER_IMAGE_URI
explicitly to the full tag they need.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The existing scope-configurations provider in this account uses a different
schema (.provider.aws_state_bucket) than our Lambda spec (.state.tofu_state_bucket).
Add fallback to support both schemas without requiring a new provider instance.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rements policy

The scope execution role was named "${function}-role", which didn't match the
iam:CreateRole/PassRole Resource constraint (arn:aws:iam::*:role/np-lambda-*) in
lambda/requirements, causing AccessDenied at tofu apply. Prefixing aligns the
role name with the policy the assumed role already grants.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
OpenTofu writes its "Error:" block to stderr, but the NP workflow executor only
captures stdout — so the real failure reason (e.g. AWS AccessDenied) never showed
in the logs, leaving only a generic "scope creation failed". Redirect stderr to
stdout on the apply and stop sending the script's own error message to stderr.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ements policy

The AWS provider (v5) reads log group tags via logs:ListTagsForResource and
manages them via logs:TagResource/UntagResource — the generic resource-tagging
API — but the policy only granted the deprecated logs:TagLogGroup. Creating a
scope's aws_cloudwatch_log_group failed with AccessDenied on ListTagsForResource.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…R_IMAGE_URI_DEFAULT

Adds an env-var fallback for the Lambda placeholder image, mirroring the existing
ASSUME_ROLE_ARN_DEFAULT pattern. Precedence: scope-config
deployment.placeholder_image_uri > PLACEHOLDER_IMAGE_URI_DEFAULT (values.yaml) >
script's hardcoded default. Lets operators point the placeholder at a private ECR
mirror per account without a scope-configuration value or code changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… update

Container-image Lambdas require the source ECR repo to grant lambda.amazonaws.com
pull access; without it update-function-code fails with "Lambda does not have
permission to access the ECR image". update_function_code now sets the standard
LambdaECRImageRetrievalPolicy on the image's repo (idempotent, best-effort), and
the requirements role gains ecr:Get/SetRepositoryPolicy. Removes the need to set
the policy by hand per application repo.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nt action

The diagnose-deployment action mapped to deployment/workflows/diagnose.yaml,
which did not exist, so every auto-diagnose after a failed deployment errored
with "failed to read workflow file". Adds the workflow mirroring the scope
diagnose flow: lean diagnose/build_context + executor over diagnose/checks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ASSUME_ROLE_ARN_DEFAULT and PLACEHOLDER_IMAGE_URI_DEFAULT carried a real AWS
account ARN/URI committed for testing. The product repo must stay account-agnostic:
both are now documented as account-specific and provided per-installation via the
scope-configurations provider or the agent's extra_envs (Helm), not hardcoded here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…URI_DEFAULT knob

Document why Image-based scopes need a private-ECR placeholder and how the
URI is resolved (provider key > PLACEHOLDER_IMAGE_URI_DEFAULT > public default),
including how to publish one and a troubleshooting entry.

Also re-add PLACEHOLDER_IMAGE_URI_DEFAULT to values.yaml as a commented,
account-agnostic template so operators can pick their own image, and normalize
a stray real-looking account ID in a publish comment to the dummy 123456789012.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e requirements

Reviewer feedback: the standalone requirements/ folder should not sit at the
lambda/ root — all installation-time tofu should live together under a setup module.

- Move lambda/specs/tofu/ -> lambda/setup/ (the operator-applied install module).
- Merge lambda/requirements/ into lambda/setup/ (requirements.tf + outputs.tf, and
  its variables folded into setup/variables.tf); remove the requirements/ folder.
- A single 'tofu apply' in lambda/setup now registers the scope type AND provisions
  the IAM policies. The 4 policies are always created; attaching them stays optional
  via create_role / role_name.
- Add the aws provider (~> 5.0) + provider block to setup/provider.tf and a nullable
  aws_region var (IAM is global). 'name' is now a required setup variable.
- Update backend key to lambda/setup/terraform.tfstate.
- Refresh references: installation.md (cd path + IAM vars table), prerequisites.md
  (setup/main.tf), and the iam/setup comment.

Verified with 'tofu validate' (Success).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@davidf-null davidf-null changed the title feat: dynamic assume-role support, configurable placeholder image & requirements module feat: dynamic assume-role support, configurable placeholder image & install tofu consolidation Jun 4, 2026
# update-function-code fails with "Lambda does not have permission to access
# the ECR image". Idempotent and best-effort (cross-account repos may not be
# writable from here — Lambda would then need the policy set on the source side).
if [[ "$IMAGE_URI" == *.dkr.ecr.*.amazonaws.com/* ]]; then

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No entiendo porque necesitamos esto?

Que pasa si la uri no es de amazonaws.com?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lambda con docker, solo fucniona con imagenes de ecr https://docs.aws.amazon.com/es_es/lambda/latest/dg/images-create.html

if aws ecr set-repository-policy --repository-name "$ecr_repo" --region "$ecr_region" --policy-text "$lambda_pull_policy" >/dev/null 2>&1; then
log debug " ✅ ensured Lambda pull policy on ECR repo $ecr_repo"
else
log warn " ⚠️ could not set Lambda pull policy on ECR repo $ecr_repo (continuing; pull may fail if not already allowed)"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Esto está confirmado que puede llegar a funcionar? si es una certeza que va a fallar después, tiraría un error.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

esta con warning por esto:
El caso que protege es cross-account: si la imagen vive en un ECR de otra cuenta, el rol asumido puede no tener permiso para escribir la
policy ahí — pero si esa policy ya está seteada del lado dueño del repo, Lambda igual puede pullear y el deployment funciona. Hoy,
set-repository-policy falla → warning → update-function-code igual tiene éxito. Si lo hago fallar duro, ese deployment cross-account (que
andaría) se rompería innecesariamente.

Comment on lines +43 to +45
# Use the image URI as-is. If PLACEHOLDER_IMAGE_URI is not set, the default
# :latest tag is used without any architecture suffix — publish arch-specific
# tags and set PLACEHOLDER_IMAGE_URI explicitly if needed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saquemos este comment

Comment thread lambda/scope/tofu/iam/setup Outdated
Comment on lines +7 to +9
iam_role_name="${LAMBDA_FUNCTION_NAME}-role"
# Prefix with "np-lambda-" so the role name matches the iam:CreateRole/PassRole
# Resource constraint in lambda/setup (arn:aws:iam::*:role/np-lambda-*).
iam_role_name="np-lambda-${LAMBDA_FUNCTION_NAME}-role"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Esto es un breaking change, además, por qué estamos forzando a que el role tenga que terminar con -role, usaría sólo lo que venga de la variable y que se respete esa convención

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decis que validemos el valor de la variable? Igual esto se va con lo del provider.

Comment thread lambda/scope/tofu/do_tofu Outdated
Comment thread lambda/specs/tofu/outputs.tf
Comment thread lambda/utils/assume_role
@@ -0,0 +1,41 @@
#!/bin/bash
# Sourceable helper — do NOT execute directly.

@fedemaleh fedemaleh Jun 5, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Este comment sólo debería tener que hace, el tema de que es ourceable y los requirements saquemoslo

Comment thread lambda/utils/assume_role
fi
}

if [ -n "${ASSUME_ROLE_ARN:-}" ]; then

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usaría un nombre más específico como "SCOPE_LAMBDA_ASSUME_ROLE_ARN"

En general un mismo agente ejecuta distintos scopes y servicios. Si usas nombres de variables genéricos (que se pueden setear como env var del agente) es probable que se generen colisiones.

Comment thread lambda/utils/assume_role
# Expects: ASSUME_ROLE_ARN (exported by fetch_scope_configuration or values.yaml)
# SCOPE_ID (optional, used for the session name)

_ar_log() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sacaría esto, asumiría que log existe como en el resto de los scripts, nosotros armamos los workflows, podemos asegurarnos que esté exportada.

Comment thread lambda/utils/assume_role
Comment on lines +27 to +30
_ar_log info "ERROR: sts:AssumeRole failed for $ASSUME_ROLE_ARN"
_ar_log info "$(cat "$_ar_sts_error")"
rm -f "$_ar_sts_error"
return 1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Esto me suena que debería ser un exit para fallar. Para los logs usemos level error o warn.


# From scope-configurations category
TOFU_STATE_BUCKET=$(echo "$SCOPE_CONFIG" | jq -r '.state.tofu_state_bucket // empty')
TOFU_STATE_BUCKET=$(echo "$SCOPE_CONFIG" | jq -r '.state.tofu_state_bucket // .provider.aws_state_bucket // empty')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

De donde salio el .provider.aws_state_bucket?

Me bajé el payload de un config de verdad y tiene esta pinta:

{
  "attributes": {
    "deployment": {
      "placeholder_image_uri": "855647970243.dkr.ecr.us-east-1.amazonaws.com/aws-lambda/nullplatform-lambda-placeholder:latest"
    },
    "state": {
      "tofu_state_bucket": "gal3-scopes-tfstate-galicia-3-68bb45dd"
    }
  },
  "created_at": "2026-05-19T16:44:54.180Z",
  "dimensions": {},
  "groups": [],
  "id": "70a0a1fa-dea0-4db1-9d6c-fe71b1843186",
  "nrn": "organization=1636958496:account=1807223679",
  "specification_id": "80fc7026-7164-4c09-8a4f-424dc3b6aa50",
  "tags": [],
  "updated_at": "2026-05-19T16:44:54.180Z"
}


PLACEHOLDER_IMAGE_URI=$(echo "$SCOPE_CONFIG" | jq -r '.deployment.placeholder_image_uri // empty')
log debug " ✅ placeholder_image_uri=$PLACEHOLDER_IMAGE_URI"
# Fallback to env var set in values.yaml when the provider does not supply it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sacar este comment y el de las líneas 81 y 100 también.

Comment thread lambda/utils/fetch_scope_configuration Outdated
Comment on lines +82 to +83
ASSUME_ROLE_ARN=$(echo "$SCOPE_CONFIG" | jq -r '.assume_role.arn // empty')
ASSUME_ROLE_ARN="${ASSUME_ROLE_ARN:-${ASSUME_ROLE_ARN_DEFAULT:-}}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Esto no va a salir de scope config, va a salir del nuevo provider verdad?

David Fernandez and others added 7 commits June 8, 2026 11:59
… selector

The agent resolves the IAM role to assume from the "AWS IAM" provider
(category Identity & Access Control, spec aws-iam-configuration) declared in
nullplatform, matching its arns list by the "lambda" selector. Precedence:
ASSUME_ROLE_ARN env -> IAM provider -> scope-configurations assume_role.arn
-> ASSUME_ROLE_ARN_DEFAULT -> pod IRSA.

- assume_role_lib (new): pure arn_for_selector_from_json +
  provider_arn_for_selector (np provider list -> read, since list omits deep
  attributes). Mirrors the services-s3 mechanism.
- fetch_scope_configuration: insert the provider-by-selector lookup as
  priority 2, deriving the account NRN from the scope NRN (strip :namespace=).
- diagnose/build_context: same resolution before sourcing assume_role (it
  previously sourced assume_role without ever resolving an ARN).
- values.yaml: document the precedence and the ASSUME_ROLE_SELECTOR override.
- tests: BATS unit tests for both lib functions using the mock_np harness.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Commit the account's private ECR placeholder image as the default, overridable
per scope via the scope-config provider or per agent via Helm extra_envs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rkflow

The assumed credentials were never reaching the steps that need them: assume
role ran inside build_context, but the workflow engine drops a step's exported
vars unless declared as output:environment — and no workflow declared the AWS
credentials. So tofu/aws steps ran with the pod's IRSA identity and failed on
permissions.

Fix: a dedicated `assume_role` step runs first in every AWS-touching workflow,
resolves the role and assumes it, and exports AWS_ACCESS_KEY_ID/SECRET/
SESSION_TOKEN as output:environment so all later steps inherit them.

- utils/assume_role_step (new): resolves NRN from CONTEXT, assumes, exports creds.
- utils/assume_role_lib: add resolve_assume_role_arn (env -> IAM provider by
  selector -> scope-config -> DEFAULT) and scope_config_assume_role_arn.
- fetch_scope_configuration, diagnose/build_context: remove the now-centralized
  assume-role resolution (single source of truth; avoids self-assume).
- 18 workflow yamls: prepend the assume_role step with the 3 credential outputs.
- assume_role_lib.bats: tests for the precedence chain.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The execution role name was hardcoded as np-lambda-<function>-role. Resolve the
prefix via get_config_value (scope-config provider lambda.execution_role_prefix
> LAMBDA_EXECUTION_ROLE_PREFIX env > default "np-lambda-"), keeping the previous
name as the default so existing scopes are unaffected.

Warn (non-blocking) when the prefix falls outside the assume role's IAM policy
constraint (np-lambda-* / nullplatform-*), since CreateRole/PassRole would
otherwise be denied unless that policy is widened.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop the comment block above the tofu run; the 2>&1 redirect itself is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tofu

Relocate the standalone install module (IAM role+policies and the NP
scope_definition registration) so the .tf lives next to the .json.tpl specs it
consumes. No behavior change; still a standalone root module.

- git mv lambda/setup -> lambda/specs/tofu (history preserved)
- backend.tf key: lambda/setup/... -> lambda/specs/tofu/...
- installation.md / prerequisites.md: update the paths

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The specs/tofu module now also creates the Lambda IAM requirements
(aws resources), so its provider pin must be compatible with consumers
running the AWS provider 6.x line (EKS/agent stack). ~> 5.0 made the
provider graph unresolvable when composed with those modules.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants