Implement HPA/KEDA scaling and prove with demo runs on synthetic load endpoints, with J6 by du-dhartley · Pull Request #266 · Hardhat-Enterprises/AutoAudit

du-dhartley · 2026-05-24T06:07:37Z

PR in progress, not ready to review yet

Summary

This PR is the scaling proof-of-concept for AutoAudit on AKS — every workload now has an autoscaler, a disruption budget,
dependency-aware health checks, graceful shutdown, and topology constraints, with a Grafana dashboard and a repeatable k6
load harness to prove it works.

Chart changes (helm/autoaudit/)

HPAs on backend-api, frontend, opa, powershell-service (CPU-targeted, with custom behavior for faster
scale-down).
KEDA ScaledObjects on the worker — split from a single Deployment into three per-queue Deployments (default,
controls.graph, controls.powershell), each with its own queue-depth scaler, PDB, and pool config (gevent for the
graph queue, prefork for the others).
PDBs on every workload (minAvailable: 1, minAvailable: 2 for OPA).
terminationGracePeriodSeconds + lifecycle.preStop for graceful shutdown; targeted Celery shutdown --destination=$queue@$HOSTNAME so a rollout doesn't broadcast-kill the whole worker fleet.
Zone-spread topologySpreadConstraints + host-level pod anti-affinity.
synthetic.enabled and monitoring.podMonitor.enabled toggles on every workload; new values-scaling.yaml demo
overlay.

App-layer changes

backend-api/app/api/health.py — /healthz falls back to type(exc).__name__ when str(exc) is empty (caught a Redis
async TimeoutError returning "error": "").
backend-api/app/api/v1/test.py — adds POST /v1/test/synthetic-scan, the synthetic worker-fanout lever. Double-gated
by Helm flag + APP_ENV != prod, returns 404 (not 403) when disabled.
engine/worker/celery_app.py — enables worker_send_task_events + task_send_sent_event so the celery-exporter has
data to scrape.
engine/worker/tasks.py — synthetic_evaluate task (configurable sleep / CPU-burn / OPA call / powershell-service
call).
engine/pyproject.toml — adds gevent (the chart sets pool: gevent for controls.graph; the dep was missing and the
pod CrashLooped).

Monitoring (infrastructure/monitoring/in-cluster/, dashboards)

README + values for installing KEDA → Prometheus + kube-state-metrics → celery-exporter → dashboard ConfigMap →
Grafana in order.
KEDA install enables operator + metric-server Prometheus endpoints so keda_scaler_metrics_value appears as a series.
Prometheus scrape_interval: 15s (the default 30s was too sparse for rate(...[1m])).
Dashboard query fixes: filter OPA health/metrics probes out of "decisions/sec", scope backend-api panels by job= (OPA's
FastAPI was leaking into the backend-api latency panel).

Load harness (tests/load/)

Three k6 scripts: api-baseline.js (backend-api HPA), scan-fanout.js (worker KEDA + OPA + powershell-service all at
once), sustained.js (10-minute steady-state for the "after" snapshot).
tests/load/in-cluster/ — k6 Job + run.sh + demo.sh for in-cluster runs that load-balance across all backend-api
pods (not pinned to one like port-forward would).
k6 wrapper catches threshold-breach exit-99 and exits 0 so the Job lands Complete (the breach still prints in the
summary).

Documentation (docs/plan/)

scaling-after.md — "what the chart looks like now" snapshot, mirrors the structure of scaling-baseline.md
workload-for-workload, with a delta table and a 15-row "chart bugs found and fixed during AKS testing" log.
scaling-test-runbook.md — end-to-end "deploy this on a fresh AKS cluster" sequence.

How to run the tests

Endpoints exercised

Endpoint	Used by	Purpose
`POST /v1/auth/login`	k6 Job init container	Logs in as `admin@example.com`, captures JWT for the run
`GET /v1/test/protected`	`api-baseline.js`, `sustained.js`	Auth-gated, no DB — pure backend-api CPU. Drives the
backend-api HPA.
`POST /v1/test/synthetic-scan`	`scan-fanout.js`, `sustained.js`	Enqueues N synthetic Celery tasks across
`controls.graph` + `controls.powershell`. Drives KEDA on the worker queues; each task can also POST to OPA and to
powershell-service `/execute/synthetic`.

Scripts

Script	Load shape	What it exercises
`api-baseline.js`	`ramping-arrival-rate`, 5 → 200 RPS over 2m ramp / 2m hold / 1m down	`/v1/test/protected` —
backend-api HPA story
`scan-fanout.js`	`ramping-arrival-rate`, 1 → 6 pulses per 10s over 30s warm-up / 2m ramp / 3m hold / 1m down / 2m
observe	One `/v1/test/synthetic-scan` per pulse, each enqueuing `N_GRAPH=120` + `N_POWERSHELL=60` synthetic tasks
(~180/pulse) — worker KEDA + OPA + powershell-service
`sustained.js`	Two parallel `constant-arrival-rate` scenarios for 10 minutes: 80 RPS to `/protected` plus 4
pulses/10s to `/synthetic-scan`	Steady-state for the "after" snapshot

Configuration

Required env vars (all scripts):

Variable	Default	Notes
`AUTOAUDIT_BASE_URL`	`http://localhost:8000`	In-cluster runner sets `http://autoaudit-backend-api.autoaudit:8000`
`AUTOAUDIT_TOKEN`	— (required)	Bearer JWT. The in-cluster Job's init container fetches this from `/v1/auth/login`.

scan-fanout.js per-pulse knobs (override with -e KEY=value):

Variable	Default	Effect
`N_GRAPH`	`120`	Tasks per pulse onto `controls.graph`
`N_POWERSHELL`	`60`	Tasks per pulse onto `controls.powershell`
`SLEEP_MS`	`2000`	Per-task sleep (simulates I/O wait)
`CPU_BURN_MS`	`200`	Per-task tight-loop CPU
`CALL_POWERSHELL_SERVICE`	`true`	If true, `controls.powershell` tasks call `/execute/synthetic` on the
powershell-service

Thresholds:

Script	Thresholds
`api-baseline.js`	`http_req_failed < 2%`, `http_req_duration p(95) < 500ms`, `p(99) < 800ms`
`scan-fanout.js`	`http_req_failed < 5%`, `synthetic-scan p(95) < 2000ms`
`sustained.js`	`protected` p(95) < 500ms and <2% errors; `synthetic-scan` <5% errors

Running

Local (port-forwarded backend-api):

export AUTOAUDIT_TOKEN=$(curl -s -X POST http://localhost:8000/v1/auth/login \
  -d 'username=admin@example.com&password=admin' | jq -r .access_token)

k6 run tests/load/api-baseline.js
k6 run tests/load/scan-fanout.js -e N_GRAPH=20 -e N_POWERSHELL=10

In-cluster (load-balances across all backend-api pods):

# One script at a time
bash tests/load/in-cluster/run.sh api-baseline.js
bash tests/load/in-cluster/run.sh scan-fanout.js -e N_GRAPH=150 -e N_POWERSHELL=50

# Full demo (api-baseline -> 60s cooldown -> scan-fanout)
bash tests/load/in-cluster/demo.sh

run.sh re-syncs the k6-scripts ConfigMap from tests/load/*.js on every invocation, so the cluster always runs the
version on disk. The Job's init container logs in via /v1/auth/login using the admin password from the autoaudit-admin
Secret and writes the JWT to a shared emptyDir; the k6 container reads it and runs against the in-cluster Service URL.

Full fresh-cluster setup (namespace + Secret + KEDA + monitoring + ACR push + helm install) is in
docs/scaling/scaling-test-runbook.md.

Type of Change

Affected Components

Motivation

The primary reason for this work was for the HD panel, showing specific technical depth, but as this this is a critical feature of production grade infrastructure, I wanted to share it for future students to be able to use.

Testing Done

Unit tests pass locally
Tested manually — describe how:
No tests required — explain why:

Security Considerations

No security impact to consider.

Breaking Changes

No breaking changes
Yes — describe below:

Rollback Plan

Revert commit is sufficient
Requires additional steps — describe below:

Checklist

Code follows project conventions
No secrets, credentials, or tokens committed
Relevant documentation updated (if applicable)
[] CI/CD workflows pass on this branch
PR is focused on one thing

Screenshots

Screenshots to come.

…-dha-003

github-actions · 2026-05-24T06:07:50Z

Preview Environment

A preview environment can be spun up on demand for this PR.

Action	Label	Includes
Spin up preview	`deploy-preview`	Frontend, backend, database, Redis, OPA, worker
Spin up preview with M365	`deploy-preview-m365`	Everything above + PowerShell service for Exchange/Teams scan testing
Tear down preview	`teardown-preview`	Stops the environment early

The environment will also be torn down automatically when the PR is closed or merged.
Preview URLs will appear in a follow-up comment once the deploy completes (~5–8 min).
M365 scans require real tenant credentials added through the frontend UI.

+    }
+
+    code = status.HTTP_200_OK if payload["status"] == "ok" else status.HTTP_503_SERVICE_UNAVAILABLE
+    return JSONResponse(payload, status_code=code)


github-actions · 2026-05-24T06:09:04Z

CI: Frontend

Job	Result
Security analysis (CodeQL)	✅ `success`
Lint	✅ `success`
Build and test	❌ `failure`

One or more checks failed. View logs

github-actions · 2026-05-24T06:10:33Z

CI: Engine

Job	Result
Security analysis (CodeQL)	✅ `success`
Lint	❌ `failure`
Tests	✅ `success`

One or more checks failed. View logs

github-actions · 2026-05-24T06:10:40Z

CI: Backend API

Job	Result
Security analysis (CodeQL + Bandit)	❌ `failure`
Lint	❌ `failure`

One or more checks failed. View logs

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4265e01abf

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

du-dhartley added 19 commits April 3, 2026 12:45

Add helm chart

c0a87d9

Update dockerignore

df27214

Document helm decisions

f1f974b

Add production config to frontend

71481c6

Add a prod OPA dockerfile

560d648

POC github action for build and push

5e120c7

Temp changes to test while not on main

659547d

Use root to copy files before switching user

128dfce

Use a date and sha combo

d227a6c

Fix default image tags, migration order and admin pw seeding

aa38ac6

Admin seed updates, OPA dockerfile revamp

d9464ce

Trigger workflow on any push to branch

1c12bbb

Always pull

ee9b4e5

Output manifest as txt rather than JSON so it isn't loaded by OPA

5c602ba

Add a second hostname

a2791b8

Copy the policy files so benchmarks are populated

5a1a29f

Merge branch 'main' into feature/proof-of-concept-deployment-26t1-imp…

3eb502a

…-dha-003

add horizontal scaling, PDBs, and observability for PoC

4265e01

Merge branch 'main' into feature/proof-of-concept-deployment-26t1-imp…

1de7943

…-dha-003

du-dhartley requested review from 6igby and liyunze-coding as code owners May 24, 2026 06:07

github-advanced-security AI found potential problems May 24, 2026

View reviewed changes

Comment thread backend-api/app/api/health.py

}

code = status.HTTP_200_OK if payload["status"] == "ok" else status.HTTP_503_SERVICE_UNAVAILABLE

return JSONResponse(payload, status_code=code)

du-dhartley changed the title ~~Feature/proof of concept deployment 26t1 imp dha 003~~ Implement HPA/KEDA scaling and prove with demo runs on synthetic load endpoints, with J6 May 24, 2026

chatgpt-codex-connector Bot reviewed May 24, 2026

View reviewed changes

Comment thread helm/autoaudit/templates/worker/deployment.yaml Outdated

Comment thread helm/autoaudit/templates/worker/deployment.yaml Outdated

du-dhartley added 2 commits May 24, 2026 16:27

Chart and scaling updates, scaling docs commit

5182aa5

Refine scaling and dashboards

6f5473c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement HPA/KEDA scaling and prove with demo runs on synthetic load endpoints, with J6#266

Implement HPA/KEDA scaling and prove with demo runs on synthetic load endpoints, with J6#266
du-dhartley wants to merge 21 commits into
mainfrom
feature/proof-of-concept-deployment-26t1-imp-dha-003

du-dhartley commented May 24, 2026

Uh oh!

github-actions Bot commented May 24, 2026

Uh oh!

github-actions Bot commented May 24, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 24, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 24, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

du-dhartley commented May 24, 2026

PR in progress, not ready to review yet

Summary

How to run the tests

Endpoints exercised

Scripts

Configuration

Running

Type of Change

Affected Components

Motivation

Testing Done

Security Considerations

Breaking Changes

Rollback Plan

Checklist

Screenshots

Uh oh!

github-actions Bot commented May 24, 2026

Preview Environment

Uh oh!

github-actions Bot commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI: Frontend

Uh oh!

github-actions Bot commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI: Engine

Uh oh!

github-actions Bot commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI: Backend API

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented May 24, 2026 •

edited

Loading

github-actions Bot commented May 24, 2026 •

edited

Loading

github-actions Bot commented May 24, 2026 •

edited

Loading