Netlix Platform — Enterprise Improvement Plan

This document tracks findings from a comprehensive audit of the Netlix Platform project. The goal is to make this an enterprise-grade HashiCorp reference architecture suitable for customer demos and internal enablement.

Architecture: HCP Terraform Stacks (8 components, 2 deployments)

Completed

#	Item	Resolution
1	Secrets committed to git	Removed all `.tfvars`, state files, `.terraform/` dirs
2	Terraform state files in git	Properly gitignored, HCP Terraform manages state
3	`.terraform/` binaries in git	Properly gitignored
4	Architecture split (Stacks vs Workspaces)	Committed to Terraform Stacks
5	Provider version constraints too loose	Using `~>` pessimistic constraints in providers
6	Missing tags (Sentinel would block)	`default_tags` in AWS provider config
7	No `deletion_protection` on critical resources	RDS: multi-AZ, deletion protection, final snapshot for non-dev envs
8	Sentinel test coverage incomplete	All 6 policies now have pass/fail tests with mock data
10	CI references non-existent Go code	CI validates 8 Terraform components + format check + Sentinel tests
11	No `CODEOWNERS` file	Created `.github/CODEOWNERS` with team ownership rules
13	Hardcoded values in K8s manifests	Kustomize overlays for dev/staging with environment patches
14	Single NAT gateway	Multi-NAT for non-dev envs (`single_nat_gateway = environment == "dev"`)
16	VPC flow logs not implemented	Networking component has flow logs
18	EKS endpoint without CIDR restriction	Configurable via `cluster_endpoint_public_access_cidrs`; staging uses private-only
28	No `terraform fmt` CI check	CI has format check job for components + bootstrap
30	Inconsistent `required_version`	Standardized across all components (`>= 1.9`)

Phase 5 — Enterprise Hardening

Container & Workload Security

#	Item	Status	Resolution
31	Dockerfile runs as root	Done	Switched to `distroless/static-debian12:nonroot`, `USER nonroot:nonroot`
32	No `securityContext` on pods	Done	Added pod-level (`runAsNonRoot`, `runAsUser: 65532`, `seccompProfile: RuntimeDefault`) and container-level (`allowPrivilegeEscalation: false`, `readOnlyRootFilesystem: true`, `drop: ALL`) to web + api
33	No container image scanning in CI/CD	Done	Trivy scan (CRITICAL/HIGH) in CI (`image-scan` job) and CD (gate before GHCR push)
34	No Kubernetes NetworkPolicies	Done	Default-deny + explicit allow for web/api (DNS, Consul, Vault, Datadog egress)
35	No Pod Security Standards on namespace	Done	Namespace manifest: enforce baseline, warn+audit restricted
36	Bootstrap IAM role uses AdministratorAccess	Done	Scoped inline policy: EC2, EKS, RDS, Route53, ACM, IAM, KMS, CloudWatch, SNS, STS
37	Vault TFC policy is `path "*"` with sudo	Won't fix	HCP Vault requires full access at parent namespace for cross-namespace operations; scoped paths don't propagate to child namespaces

CI/CD Maturity

#	Item	Status	Resolution
38	CD pipeline commits directly to main	Done	Branch-based gitops: CD triggers on `dev`/`staging` branches, promotion via PR workflow
39	No manifest validation in CI	Done	`kubeconform` validates rendered Kustomize overlays (dev/staging) in CI; skips Vault CRDs
40	No promotion strategy between envs	Done	Manual promotion workflow (`promote.yaml`): dev→staging→main with PR review gate
41	Staging image tag is `latest`	Done	SHA-pinned tags per environment (`staging-<sha>`), set by branch-aware CD

Observability

#	Item	Status	Resolution
42	Datadog configured but agent not deployed	Done	Replaced with AWS CloudWatch Container Insights (EKS addon); IRSA role for agent; removed all Datadog config
43	No alerting or SLO definitions	Done	10 CloudWatch alarms (pod CPU/memory/restarts, node CPU/memory, RDS, VPC flow logs); SNS notifications
44	VPC Flow Logs not analyzed	Done	CloudWatch metric filter on REJECT actions + alarm (>100 rejected/5min); SNS notifications
48	No observability dashboard	Done	CloudWatch dashboard with 6 rows: cluster, pods, health, RDS, network, alarm status

Performance & Load Testing

#	Item	Status	Resolution
47	No load testing framework	Done	Distributed Locust setup (master + workers) on EKS; GitHub Actions workflow with configurable thresholds; validates HPA autoscaling

Disaster Recovery

#	Item	Status	Resolution
45	No EKS cluster backup (Velero)	TODO	Add Velero for CRD/ConfigMap backup
46	No cross-region RDS replica	TODO	Add read replica or cross-region backup copy

Remaining — Medium Priority

9. Sentinel `enforce-cost-limit` Top-Level Print

The if block after the main rule uses top-level imperative code. While valid Sentinel, it could be cleaner as a helper rule with when clause. Cosmetic only.

19. Vault Admin Policy Too Broad

path "*" with full sudo capabilities in terraform/components/vault-config/auth.tf. This is intentional for the bootstrap userpass admin account, but should be documented as bootstrap-only. For production, create a scoped admin policy.

Remaining — Low Priority (Operational Excellence)

#	Item	Recommendation
12	No `CLAUDE.md` project instructions	Add project conventions file
17	No Terraform tests (`.tftest.hcl`)	Add integration tests for critical components
23	No ResourceQuotas	Add per-namespace quotas in K8s manifests
24	No PodDisruptionBudgets	Add PDBs for web/api workloads
25	No HorizontalPodAutoscaler	Add HPA for production
29	No pre-commit hooks	Add `.pre-commit-config.yaml`
34	No `.terraform-docs.yml`	Add for auto-generated module docs

Implementation Progress

Phase 1 — Security & Architecture (done)

Remove all secrets from repo
Commit to Terraform Stacks architecture
Create 8 Stack components with dependency wiring
Configure OIDC workload identity (no static credentials)
Set up HCP Terraform variable sets
Bootstrap AWS OIDC trust
Update .gitignore for Stacks
Update CI workflow
Update README for Stacks

Phase 2 — Governance (done)

Fix existing broken Sentinel tests (missing mock files)
Add tests for enforce-cost-limit, no-public-s3-buckets, require-vpc-flow-logs
Integrate Sentinel tests into main CI workflow

Phase 3 — Hardening (done)

Kustomize overlays for K8s manifests (dev/staging)
Parameterize Datadog monitoring (cluster name, environment tags)
Add CODEOWNERS
EKS public endpoint CIDR restriction (configurable per deployment)
Fix RDS protection logic (was checking for "production", now checks != "dev")

Phase 4 — Operational Excellence (done)

Add pre-commit hooks (.pre-commit-config.yaml)
Add CLAUDE.md project conventions
Add PodDisruptionBudgets for web/api
Add HorizontalPodAutoscalers (CPU 70%, memory 80%, 2-10 replicas)
Add ResourceQuota for consul namespace
Add Terraform tests (.tftest.hcl) for networking, dns, eks, rds
Add .terraform-docs.yml for auto-generated module docs
Fix networking NAT gateway logic (was checking "production", now "dev")

Phase 6 — Vault Security Hardening (in flight)

Checkpoint tag before starting: checkpoint/pre-phase1-vault-rotation-2026-04-30-dev.

Goals

Three items from the cross-cutting audit (Phase 5 follow-up), sequenced to minimize migration risk and to give the demo a coherent "least-privilege + live rotation" narrative:

6.1 — Per-service Vault policies (least-privilege)
6.2 — JWT signing-key rotation (N+1 keys, no pod restart)
6.3 — Root token hygiene (non-root admin for ongoing TF, documented rotation)

Each item is committed separately; each is independently revertible. None of them remove existing roles/policies until the new ones are proven — old netlix-vso and netlix-app policies stay in place as backstops during migration.

6.1 — Per-service Vault policies (foundation)

Problem: terraform/components/vault-config/policies.tf:8 — both netlix-vso and netlix-app grant secret/data/netlix/* (wildcard). Compromise of any service exposes every secret in secret/netlix/* (db, jwt, grafana admin, feature flags). Today VSO authenticates once with this wildcard role and syncs everything.

Approach: Keep VSO's centralized-pull model, but split by secret class via per-secret ServiceAccounts that VSO impersonates. Each secret class gets its own Vault role + policy scoped to a single KVv2 / PKI path.

Resources to add:

Service Account (consul)	VaultAuth CRD (consul)	Vault role (env)	Vault policy	Grants
`vso-shop-db`	`vault-auth-shop-db`	`netlix-shop-db`	`netlix-shop-db-reader`	read `secret/data/netlix/db`
`vso-shop-jwt`	`vault-auth-shop-jwt`	`netlix-shop-jwt`	`netlix-shop-jwt-reader`	read `secret/data/netlix/jwt`
`vso-shop-config`	`vault-auth-shop-config`	`netlix-shop-config`	`netlix-shop-config-reader`	read `secret/data/netlix/featureflags`
`vso-shop-pki`	`vault-auth-shop-pki`	`netlix-shop-pki`	`netlix-shop-pki-issuer`	`pki_int/issue/netlix-app` (create only)

Migration steps (each safe to revert):

Add 4 SAs in app/manifests/base/vso-impersonators.yaml (new file)
Add 4 Vault auth roles + 4 policies in terraform/components/vault-config/ (new files: policies-per-service.tf, auth-per-service.tf)
Add 4 VaultAuth CRDs in app/manifests/base/vault-auth-per-secret.yaml (new file)
Switch app/manifests/shop/vault-secrets.yaml vaultAuthRef: default → per-secret name
Same for app/manifests/shop/feature-flags.yaml and app/manifests/shop/pki-secrets.yaml
Old netlix-vso policy/role kept in place (becomes unused but doesn't break anything)

Reversibility: revert the manifest commit → VaultStaticSecrets fall back to default VaultAuth → netlix-vso wildcard role still works.

6.2 — JWT signing-key rotation (N+1 keys, hot-reload)

Problem: auth/main.go:107-121 reads JWT_SIGNING_KEY once at startup; orders/main.go:151-171 does the same. Rotating the key in Vault breaks every pre-rotation token until both services are fully re-rolled.

Approach: Mirror the feature-flag pattern (worked great there):

Vault KVv2 stores a key-set: {"keys": [{"id":"v2","key":"...","status":"primary"}, {"id":"v1","key":"...","status":"verifying"}]}
VSO syncs to a K8s Secret with one key keys.json
Auth + orders mount the Secret as a file (no env var), poll every 30 s
Auth signs with the primary key, sets kid JWT header
Orders verifies by kid (looks up matching key in current set), accepts both primary and verifying keys
Rotation procedure: vault kv put secret/netlix/jwt keys=... — both pods pick up within ~60 s, no restart, in-flight tokens stay valid

Backwards compatibility: Tokens with no kid header (old format) fall through to the primary key, so existing sessions keep working through the upgrade.

File changes:

Edit terraform/components/vault-config/kv.tf:38-46 — multi-key structure with lifecycle.ignore_changes = [data_json]
Edit app/manifests/shop/vault-secrets.yaml:69-73 — VSO template writes keys.json file, not env var
Edit app/manifests/shop/auth.yaml — mount shop-jwt as a volume, drop envFrom: shop-jwt
Edit app/manifests/shop/orders.yaml — same volume mount
New: app/services/auth/jwks.go — JWKSManager with file watcher, sign/verify helpers
New: app/services/orders/jwks.go — verify-only JWKS helper
Edit app/services/auth/main.go and auth/jwt.go — use JWKS instead of single key
Edit app/services/orders/main.go — use JWKS verifier

Reversibility: revert source + manifest commit. Existing tokens still verify with the primary key.

6.3 — Root token hygiene

Problem: terraform/workspaces/vault-cluster/providers.tf:35-39 uses var.vault_root_token for every TF apply. The token is in HCP TF state forever and never rotated.

Approach (low-risk): Don't try to automate root rotation — that's how you accidentally lock yourself out. Instead:

Add a vault_token resource that issues a long-lived (1 year), renewable, non-root admin token from the existing admin policy
Output it (sensitive)
Document the manual rotation procedure in docs/vault-root-rotation.md:
- First apply: bootstraps with var.vault_root_token as today
- Operator copies the new admin token from terraform output -raw tf_admin_token into HCP TF workspace var vault_root_token
- Re-apply: provider now uses non-root admin token
- Operator runs vault operator generate-root to issue a fresh root token (recovery key quorum)
- Operator runs vault token revoke -accessor <old-root-accessor> to invalidate the original
Add audit-log query example in the doc — proves the rotation is observable

Reversibility: revert the resource + remove the doc. No actual cluster state changes happen until the operator runs the manual procedure.

Order of operations

Commit A — feat(vault): per-service Vault policies (Phase 6.1): adds new resources, no behavior change yet
Commit B — feat(vault): switch VaultStaticSecrets to per-service auth (Phase 6.1 cutover): flips vaultAuthRef on each secret one at a time. Validate each in Loki + kubectl get secret.
Commit C — feat(vault): JWT N+1 key rotation (Phase 6.2): depends on the netlix-shop-jwt-reader policy from A; multi-key structure + JWKS code
Commit D — feat(vault): non-root admin token + rotation runbook (Phase 6.3): independent, just adds Terraform resource + docs

Each commit lands on dev, gets applied via HCP TF, validated, then we proceed to the next.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Netlix Platform — Enterprise Improvement Plan

Completed

Phase 5 — Enterprise Hardening

Container & Workload Security

CI/CD Maturity

Observability

Performance & Load Testing

Disaster Recovery

Remaining — Medium Priority

9. Sentinel `enforce-cost-limit` Top-Level Print

19. Vault Admin Policy Too Broad

Remaining — Low Priority (Operational Excellence)

Implementation Progress

Phase 1 — Security & Architecture (done)

Phase 2 — Governance (done)

Phase 3 — Hardening (done)

Phase 4 — Operational Excellence (done)

Phase 6 — Vault Security Hardening (in flight)

Goals

6.1 — Per-service Vault policies (foundation)

6.2 — JWT signing-key rotation (N+1 keys, hot-reload)

6.3 — Root token hygiene

Order of operations

FilesExpand file tree

IMPROVEMENTS.md

Latest commit

History

IMPROVEMENTS.md

File metadata and controls

Netlix Platform — Enterprise Improvement Plan

Completed

Phase 5 — Enterprise Hardening

Container & Workload Security

CI/CD Maturity

Observability

Performance & Load Testing

Disaster Recovery

Remaining — Medium Priority

9. Sentinel enforce-cost-limit Top-Level Print

19. Vault Admin Policy Too Broad

Remaining — Low Priority (Operational Excellence)

Implementation Progress

Phase 1 — Security & Architecture (done)

Phase 2 — Governance (done)

Phase 3 — Hardening (done)

Phase 4 — Operational Excellence (done)

Phase 6 — Vault Security Hardening (in flight)

Goals

6.1 — Per-service Vault policies (foundation)

6.2 — JWT signing-key rotation (N+1 keys, hot-reload)

6.3 — Root token hygiene

Order of operations

9. Sentinel `enforce-cost-limit` Top-Level Print