Implement HPA/KEDA scaling and prove with demo runs on synthetic load endpoints, with J6#266
Implement HPA/KEDA scaling and prove with demo runs on synthetic load endpoints, with J6#266du-dhartley wants to merge 21 commits into
Conversation
Preview EnvironmentA preview environment can be spun up on demand for this PR.
|
| } | ||
|
|
||
| code = status.HTTP_200_OK if payload["status"] == "ok" else status.HTTP_503_SERVICE_UNAVAILABLE | ||
| return JSONResponse(payload, status_code=code) |
CI: Frontend
One or more checks failed. View logs |
CI: Engine
One or more checks failed. View logs |
CI: Backend API
One or more checks failed. View logs |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4265e01abf
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
PR in progress, not ready to review yet
Summary
This PR is the scaling proof-of-concept for AutoAudit on AKS — every workload now has an autoscaler, a disruption budget,
dependency-aware health checks, graceful shutdown, and topology constraints, with a Grafana dashboard and a repeatable k6
load harness to prove it works.
Chart changes (
helm/autoaudit/)backend-api,frontend,opa,powershell-service(CPU-targeted, with custombehaviorfor fasterscale-down).
ScaledObjects on the worker — split from a single Deployment into three per-queue Deployments (default,controls.graph,controls.powershell), each with its own queue-depth scaler, PDB, and pool config (geventfor thegraph queue,
preforkfor the others).minAvailable: 1,minAvailable: 2for OPA).terminationGracePeriodSeconds+lifecycle.preStopfor graceful shutdown; targeted Celeryshutdown --destination=$queue@$HOSTNAMEso a rollout doesn't broadcast-kill the whole worker fleet.topologySpreadConstraints+ host-level pod anti-affinity.synthetic.enabledandmonitoring.podMonitor.enabledtoggles on every workload; newvalues-scaling.yamldemooverlay.
App-layer changes
backend-api/app/api/health.py—/healthzfalls back totype(exc).__name__whenstr(exc)is empty (caught a Redisasync
TimeoutErrorreturning"error": "").backend-api/app/api/v1/test.py— addsPOST /v1/test/synthetic-scan, the synthetic worker-fanout lever. Double-gatedby Helm flag +
APP_ENV != prod, returns 404 (not 403) when disabled.engine/worker/celery_app.py— enablesworker_send_task_events+task_send_sent_eventso the celery-exporter hasdata to scrape.
engine/worker/tasks.py—synthetic_evaluatetask (configurable sleep / CPU-burn / OPA call / powershell-servicecall).
engine/pyproject.toml— addsgevent(the chart setspool: geventforcontrols.graph; the dep was missing and thepod CrashLooped).
Monitoring (
infrastructure/monitoring/in-cluster/, dashboards)Grafana in order.
keda_scaler_metrics_valueappears as a series.scrape_interval: 15s(the default 30s was too sparse forrate(...[1m])).job=(OPA'sFastAPI was leaking into the backend-api latency panel).
Load harness (
tests/load/)api-baseline.js(backend-api HPA),scan-fanout.js(worker KEDA + OPA + powershell-service all atonce),
sustained.js(10-minute steady-state for the "after" snapshot).tests/load/in-cluster/— k6 Job +run.sh+demo.shfor in-cluster runs that load-balance across all backend-apipods (not pinned to one like
port-forwardwould).Complete(the breach still prints in thesummary).
Documentation (
docs/plan/)scaling-after.md— "what the chart looks like now" snapshot, mirrors the structure ofscaling-baseline.mdworkload-for-workload, with a delta table and a 15-row "chart bugs found and fixed during AKS testing" log.
scaling-test-runbook.md— end-to-end "deploy this on a fresh AKS cluster" sequence.How to run the tests
Endpoints exercised
POST /v1/auth/loginadmin@example.com, captures JWT for the runGET /v1/test/protectedapi-baseline.js,sustained.jsPOST /v1/test/synthetic-scanscan-fanout.js,sustained.jscontrols.graph+controls.powershell. Drives KEDA on the worker queues; each task can also POST to OPA and to/execute/synthetic.Scripts
api-baseline.jsramping-arrival-rate, 5 → 200 RPS over 2m ramp / 2m hold / 1m down/v1/test/protected—scan-fanout.jsramping-arrival-rate, 1 → 6 pulses per 10s over 30s warm-up / 2m ramp / 3m hold / 1m down / 2m/v1/test/synthetic-scanper pulse, each enqueuingN_GRAPH=120+N_POWERSHELL=60synthetic taskssustained.jsconstant-arrival-ratescenarios for 10 minutes: 80 RPS to/protectedplus 4/synthetic-scanConfiguration
Required env vars (all scripts):
AUTOAUDIT_BASE_URLhttp://localhost:8000http://autoaudit-backend-api.autoaudit:8000AUTOAUDIT_TOKEN/v1/auth/login.scan-fanout.jsper-pulse knobs (override with-e KEY=value):N_GRAPH120controls.graphN_POWERSHELL60controls.powershellSLEEP_MS2000CPU_BURN_MS200CALL_POWERSHELL_SERVICEtruecontrols.powershelltasks call/execute/syntheticon theThresholds:
api-baseline.jshttp_req_failed < 2%,http_req_duration p(95) < 500ms,p(99) < 800msscan-fanout.jshttp_req_failed < 5%,synthetic-scan p(95) < 2000mssustained.jsprotectedp(95) < 500ms and <2% errors;synthetic-scan<5% errorsRunning
Local (port-forwarded backend-api):
In-cluster (load-balances across all backend-api pods):
run.shre-syncs thek6-scriptsConfigMap fromtests/load/*.json every invocation, so the cluster always runs theversion on disk. The Job's init container logs in via
/v1/auth/loginusing the admin password from theautoaudit-adminSecret and writes the JWT to a shared
emptyDir; the k6 container reads it and runs against the in-cluster Service URL.Full fresh-cluster setup (namespace + Secret + KEDA + monitoring + ACR push +
helm install) is indocs/scaling/scaling-test-runbook.md.Type of Change
Affected Components
/backend-api/frontend/engine(collectors / policies)/security/infrastructure/.github/workflows/docsMotivation
The primary reason for this work was for the HD panel, showing specific technical depth, but as this this is a critical feature of production grade infrastructure, I wanted to share it for future students to be able to use.
Testing Done
Security Considerations
No security impact to consider.
Breaking Changes
Rollback Plan
Checklist
Screenshots
Screenshots to come.