This document tracks findings from a comprehensive audit of the Netlix Platform project. The goal is to make this an enterprise-grade HashiCorp reference architecture suitable for customer demos and internal enablement.
Architecture: HCP Terraform Stacks (8 components, 2 deployments)
| # | Item | Resolution |
|---|---|---|
| 1 | Secrets committed to git | Removed all .tfvars, state files, .terraform/ dirs |
| 2 | Terraform state files in git | Properly gitignored, HCP Terraform manages state |
| 3 | .terraform/ binaries in git |
Properly gitignored |
| 4 | Architecture split (Stacks vs Workspaces) | Committed to Terraform Stacks |
| 5 | Provider version constraints too loose | Using ~> pessimistic constraints in providers |
| 6 | Missing tags (Sentinel would block) | default_tags in AWS provider config |
| 7 | No deletion_protection on critical resources |
RDS: multi-AZ, deletion protection, final snapshot for non-dev envs |
| 8 | Sentinel test coverage incomplete | All 6 policies now have pass/fail tests with mock data |
| 10 | CI references non-existent Go code | CI validates 8 Terraform components + format check + Sentinel tests |
| 11 | No CODEOWNERS file |
Created .github/CODEOWNERS with team ownership rules |
| 13 | Hardcoded values in K8s manifests | Kustomize overlays for dev/staging with environment patches |
| 14 | Single NAT gateway | Multi-NAT for non-dev envs (single_nat_gateway = environment == "dev") |
| 16 | VPC flow logs not implemented | Networking component has flow logs |
| 18 | EKS endpoint without CIDR restriction | Configurable via cluster_endpoint_public_access_cidrs; staging uses private-only |
| 28 | No terraform fmt CI check |
CI has format check job for components + bootstrap |
| 30 | Inconsistent required_version |
Standardized across all components (>= 1.9) |
| # | Item | Status | Resolution |
|---|---|---|---|
| 31 | Dockerfile runs as root | Done | Switched to distroless/static-debian12:nonroot, USER nonroot:nonroot |
| 32 | No securityContext on pods |
Done | Added pod-level (runAsNonRoot, runAsUser: 65532, seccompProfile: RuntimeDefault) and container-level (allowPrivilegeEscalation: false, readOnlyRootFilesystem: true, drop: ALL) to web + api |
| 33 | No container image scanning in CI/CD | Done | Trivy scan (CRITICAL/HIGH) in CI (image-scan job) and CD (gate before GHCR push) |
| 34 | No Kubernetes NetworkPolicies | Done | Default-deny + explicit allow for web/api (DNS, Consul, Vault, Datadog egress) |
| 35 | No Pod Security Standards on namespace | Done | Namespace manifest: enforce baseline, warn+audit restricted |
| 36 | Bootstrap IAM role uses AdministratorAccess | Done | Scoped inline policy: EC2, EKS, RDS, Route53, ACM, IAM, KMS, CloudWatch, SNS, STS |
| 37 | Vault TFC policy is path "*" with sudo |
Won't fix | HCP Vault requires full access at parent namespace for cross-namespace operations; scoped paths don't propagate to child namespaces |
| # | Item | Status | Resolution |
|---|---|---|---|
| 38 | CD pipeline commits directly to main | Done | Branch-based gitops: CD triggers on dev/staging branches, promotion via PR workflow |
| 39 | No manifest validation in CI | Done | kubeconform validates rendered Kustomize overlays (dev/staging) in CI; skips Vault CRDs |
| 40 | No promotion strategy between envs | Done | Manual promotion workflow (promote.yaml): dev→staging→main with PR review gate |
| 41 | Staging image tag is latest |
Done | SHA-pinned tags per environment (staging-<sha>), set by branch-aware CD |
| # | Item | Status | Resolution |
|---|---|---|---|
| 42 | Datadog configured but agent not deployed | Done | Replaced with AWS CloudWatch Container Insights (EKS addon); IRSA role for agent; removed all Datadog config |
| 43 | No alerting or SLO definitions | Done | 10 CloudWatch alarms (pod CPU/memory/restarts, node CPU/memory, RDS, VPC flow logs); SNS notifications |
| 44 | VPC Flow Logs not analyzed | Done | CloudWatch metric filter on REJECT actions + alarm (>100 rejected/5min); SNS notifications |
| 48 | No observability dashboard | Done | CloudWatch dashboard with 6 rows: cluster, pods, health, RDS, network, alarm status |
| # | Item | Status | Resolution |
|---|---|---|---|
| 47 | No load testing framework | Done | Distributed Locust setup (master + workers) on EKS; GitHub Actions workflow with configurable thresholds; validates HPA autoscaling |
| # | Item | Status | Resolution |
|---|---|---|---|
| 45 | No EKS cluster backup (Velero) | TODO | Add Velero for CRD/ConfigMap backup |
| 46 | No cross-region RDS replica | TODO | Add read replica or cross-region backup copy |
The if block after the main rule uses top-level imperative code. While valid Sentinel,
it could be cleaner as a helper rule with when clause. Cosmetic only.
path "*" with full sudo capabilities in terraform/components/vault-config/auth.tf.
This is intentional for the bootstrap userpass admin account, but should be documented
as bootstrap-only. For production, create a scoped admin policy.
| # | Item | Recommendation |
|---|---|---|
| 12 | No CLAUDE.md project instructions |
Add project conventions file |
| 17 | No Terraform tests (.tftest.hcl) |
Add integration tests for critical components |
| 23 | No ResourceQuotas | Add per-namespace quotas in K8s manifests |
| 24 | No PodDisruptionBudgets | Add PDBs for web/api workloads |
| 25 | No HorizontalPodAutoscaler | Add HPA for production |
| 29 | No pre-commit hooks | Add .pre-commit-config.yaml |
| 34 | No .terraform-docs.yml |
Add for auto-generated module docs |
- Remove all secrets from repo
- Commit to Terraform Stacks architecture
- Create 8 Stack components with dependency wiring
- Configure OIDC workload identity (no static credentials)
- Set up HCP Terraform variable sets
- Bootstrap AWS OIDC trust
- Update
.gitignorefor Stacks - Update CI workflow
- Update README for Stacks
- Fix existing broken Sentinel tests (missing mock files)
- Add tests for
enforce-cost-limit,no-public-s3-buckets,require-vpc-flow-logs - Integrate Sentinel tests into main CI workflow
- Kustomize overlays for K8s manifests (dev/staging)
- Parameterize Datadog monitoring (cluster name, environment tags)
- Add CODEOWNERS
- EKS public endpoint CIDR restriction (configurable per deployment)
- Fix RDS protection logic (was checking for "production", now checks
!= "dev")
- Add pre-commit hooks (
.pre-commit-config.yaml) - Add CLAUDE.md project conventions
- Add PodDisruptionBudgets for web/api
- Add HorizontalPodAutoscalers (CPU 70%, memory 80%, 2-10 replicas)
- Add ResourceQuota for consul namespace
- Add Terraform tests (
.tftest.hcl) for networking, dns, eks, rds - Add
.terraform-docs.ymlfor auto-generated module docs - Fix networking NAT gateway logic (was checking
"production", now"dev")
Checkpoint tag before starting: checkpoint/pre-phase1-vault-rotation-2026-04-30-dev.
Three items from the cross-cutting audit (Phase 5 follow-up), sequenced to minimize migration risk and to give the demo a coherent "least-privilege + live rotation" narrative:
- 6.1 — Per-service Vault policies (least-privilege)
- 6.2 — JWT signing-key rotation (N+1 keys, no pod restart)
- 6.3 — Root token hygiene (non-root admin for ongoing TF, documented rotation)
Each item is committed separately; each is independently revertible. None of them remove
existing roles/policies until the new ones are proven — old netlix-vso and netlix-app
policies stay in place as backstops during migration.
Problem: terraform/components/vault-config/policies.tf:8 — both netlix-vso
and netlix-app grant secret/data/netlix/* (wildcard). Compromise of any service
exposes every secret in secret/netlix/* (db, jwt, grafana admin, feature flags).
Today VSO authenticates once with this wildcard role and syncs everything.
Approach: Keep VSO's centralized-pull model, but split by secret class via per-secret ServiceAccounts that VSO impersonates. Each secret class gets its own Vault role + policy scoped to a single KVv2 / PKI path.
Resources to add:
| Service Account (consul) | VaultAuth CRD (consul) | Vault role (env) | Vault policy | Grants |
|---|---|---|---|---|
vso-shop-db |
vault-auth-shop-db |
netlix-shop-db |
netlix-shop-db-reader |
read secret/data/netlix/db |
vso-shop-jwt |
vault-auth-shop-jwt |
netlix-shop-jwt |
netlix-shop-jwt-reader |
read secret/data/netlix/jwt |
vso-shop-config |
vault-auth-shop-config |
netlix-shop-config |
netlix-shop-config-reader |
read secret/data/netlix/featureflags |
vso-shop-pki |
vault-auth-shop-pki |
netlix-shop-pki |
netlix-shop-pki-issuer |
pki_int/issue/netlix-app (create only) |
Migration steps (each safe to revert):
- Add 4 SAs in app/manifests/base/vso-impersonators.yaml (new file)
- Add 4 Vault auth roles + 4 policies in terraform/components/vault-config/ (new files:
policies-per-service.tf,auth-per-service.tf) - Add 4 VaultAuth CRDs in app/manifests/base/vault-auth-per-secret.yaml (new file)
- Switch app/manifests/shop/vault-secrets.yaml
vaultAuthRef:default→ per-secret name - Same for app/manifests/shop/feature-flags.yaml and app/manifests/shop/pki-secrets.yaml
- Old
netlix-vsopolicy/role kept in place (becomes unused but doesn't break anything)
Reversibility: revert the manifest commit → VaultStaticSecrets fall back to default VaultAuth → netlix-vso wildcard role still works.
Problem: auth/main.go:107-121 reads JWT_SIGNING_KEY once at startup;
orders/main.go:151-171 does the same. Rotating the key in Vault breaks every
pre-rotation token until both services are fully re-rolled.
Approach: Mirror the feature-flag pattern (worked great there):
- Vault KVv2 stores a key-set:
{"keys": [{"id":"v2","key":"...","status":"primary"}, {"id":"v1","key":"...","status":"verifying"}]} - VSO syncs to a K8s Secret with one key
keys.json - Auth + orders mount the Secret as a file (no env var), poll every 30 s
- Auth signs with the
primarykey, setskidJWT header - Orders verifies by
kid(looks up matching key in current set), accepts bothprimaryandverifyingkeys - Rotation procedure:
vault kv put secret/netlix/jwt keys=...— both pods pick up within ~60 s, no restart, in-flight tokens stay valid
Backwards compatibility: Tokens with no kid header (old format) fall through to the primary key, so existing sessions keep working through the upgrade.
File changes:
- Edit terraform/components/vault-config/kv.tf:38-46 — multi-key structure with
lifecycle.ignore_changes = [data_json] - Edit app/manifests/shop/vault-secrets.yaml:69-73 — VSO template writes
keys.jsonfile, not env var - Edit app/manifests/shop/auth.yaml — mount
shop-jwtas a volume, dropenvFrom: shop-jwt - Edit app/manifests/shop/orders.yaml — same volume mount
- New: app/services/auth/jwks.go —
JWKSManagerwith file watcher, sign/verify helpers - New: app/services/orders/jwks.go — verify-only JWKS helper
- Edit app/services/auth/main.go and auth/jwt.go — use JWKS instead of single key
- Edit app/services/orders/main.go — use JWKS verifier
Reversibility: revert source + manifest commit. Existing tokens still verify with the primary key.
Problem: terraform/workspaces/vault-cluster/providers.tf:35-39 uses var.vault_root_token for every TF apply. The token is in HCP TF state forever and never rotated.
Approach (low-risk): Don't try to automate root rotation — that's how you accidentally lock yourself out. Instead:
- Add a
vault_tokenresource that issues a long-lived (1 year), renewable, non-root admin token from the existing admin policy - Output it (sensitive)
- Document the manual rotation procedure in docs/vault-root-rotation.md:
- First apply: bootstraps with
var.vault_root_tokenas today - Operator copies the new admin token from
terraform output -raw tf_admin_tokeninto HCP TF workspace varvault_root_token - Re-apply: provider now uses non-root admin token
- Operator runs
vault operator generate-rootto issue a fresh root token (recovery key quorum) - Operator runs
vault token revoke -accessor <old-root-accessor>to invalidate the original
- First apply: bootstraps with
- Add audit-log query example in the doc — proves the rotation is observable
Reversibility: revert the resource + remove the doc. No actual cluster state changes happen until the operator runs the manual procedure.
- Commit A —
feat(vault): per-service Vault policies (Phase 6.1): adds new resources, no behavior change yet - Commit B —
feat(vault): switch VaultStaticSecrets to per-service auth (Phase 6.1 cutover): flipsvaultAuthRefon each secret one at a time. Validate each in Loki +kubectl get secret. - Commit C —
feat(vault): JWT N+1 key rotation (Phase 6.2): depends on thenetlix-shop-jwt-readerpolicy from A; multi-key structure + JWKS code - Commit D —
feat(vault): non-root admin token + rotation runbook (Phase 6.3): independent, just adds Terraform resource + docs
Each commit lands on dev, gets applied via HCP TF, validated, then we proceed to the next.