Skip to content

Implement HPA/KEDA scaling and prove with demo runs on synthetic load endpoints, with J6#266

Open
du-dhartley wants to merge 21 commits into
mainfrom
feature/proof-of-concept-deployment-26t1-imp-dha-003
Open

Implement HPA/KEDA scaling and prove with demo runs on synthetic load endpoints, with J6#266
du-dhartley wants to merge 21 commits into
mainfrom
feature/proof-of-concept-deployment-26t1-imp-dha-003

Conversation

@du-dhartley
Copy link
Copy Markdown
Collaborator

PR in progress, not ready to review yet

Summary

This PR is the scaling proof-of-concept for AutoAudit on AKS — every workload now has an autoscaler, a disruption budget,
dependency-aware health checks, graceful shutdown, and topology constraints, with a Grafana dashboard and a repeatable k6
load harness to prove it works.

Chart changes (helm/autoaudit/)

  • HPAs on backend-api, frontend, opa, powershell-service (CPU-targeted, with custom behavior for faster
    scale-down).
  • KEDA ScaledObjects on the worker — split from a single Deployment into three per-queue Deployments (default,
    controls.graph, controls.powershell), each with its own queue-depth scaler, PDB, and pool config (gevent for the
    graph queue, prefork for the others).
  • PDBs on every workload (minAvailable: 1, minAvailable: 2 for OPA).
  • terminationGracePeriodSeconds + lifecycle.preStop for graceful shutdown; targeted Celery shutdown --destination=$queue@$HOSTNAME so a rollout doesn't broadcast-kill the whole worker fleet.
  • Zone-spread topologySpreadConstraints + host-level pod anti-affinity.
  • synthetic.enabled and monitoring.podMonitor.enabled toggles on every workload; new values-scaling.yaml demo
    overlay.

App-layer changes

  • backend-api/app/api/health.py/healthz falls back to type(exc).__name__ when str(exc) is empty (caught a Redis
    async TimeoutError returning "error": "").
  • backend-api/app/api/v1/test.py — adds POST /v1/test/synthetic-scan, the synthetic worker-fanout lever. Double-gated
    by Helm flag + APP_ENV != prod, returns 404 (not 403) when disabled.
  • engine/worker/celery_app.py — enables worker_send_task_events + task_send_sent_event so the celery-exporter has
    data to scrape.
  • engine/worker/tasks.pysynthetic_evaluate task (configurable sleep / CPU-burn / OPA call / powershell-service
    call).
  • engine/pyproject.toml — adds gevent (the chart sets pool: gevent for controls.graph; the dep was missing and the
    pod CrashLooped).

Monitoring (infrastructure/monitoring/in-cluster/, dashboards)

  • README + values for installing KEDA → Prometheus + kube-state-metrics → celery-exporter → dashboard ConfigMap →
    Grafana
    in order.
  • KEDA install enables operator + metric-server Prometheus endpoints so keda_scaler_metrics_value appears as a series.
  • Prometheus scrape_interval: 15s (the default 30s was too sparse for rate(...[1m])).
  • Dashboard query fixes: filter OPA health/metrics probes out of "decisions/sec", scope backend-api panels by job= (OPA's
    FastAPI was leaking into the backend-api latency panel).

Load harness (tests/load/)

  • Three k6 scripts: api-baseline.js (backend-api HPA), scan-fanout.js (worker KEDA + OPA + powershell-service all at
    once), sustained.js (10-minute steady-state for the "after" snapshot).
  • tests/load/in-cluster/ — k6 Job + run.sh + demo.sh for in-cluster runs that load-balance across all backend-api
    pods (not pinned to one like port-forward would).
  • k6 wrapper catches threshold-breach exit-99 and exits 0 so the Job lands Complete (the breach still prints in the
    summary).

Documentation (docs/plan/)

  • scaling-after.md — "what the chart looks like now" snapshot, mirrors the structure of scaling-baseline.md
    workload-for-workload, with a delta table and a 15-row "chart bugs found and fixed during AKS testing" log.
  • scaling-test-runbook.md — end-to-end "deploy this on a fresh AKS cluster" sequence.

How to run the tests

Endpoints exercised

Endpoint Used by Purpose
POST /v1/auth/login k6 Job init container Logs in as admin@example.com, captures JWT for the run
GET /v1/test/protected api-baseline.js, sustained.js Auth-gated, no DB — pure backend-api CPU. Drives the
backend-api HPA.
POST /v1/test/synthetic-scan scan-fanout.js, sustained.js Enqueues N synthetic Celery tasks across
controls.graph + controls.powershell. Drives KEDA on the worker queues; each task can also POST to OPA and to
powershell-service /execute/synthetic.

Scripts

Script Load shape What it exercises
api-baseline.js ramping-arrival-rate, 5 → 200 RPS over 2m ramp / 2m hold / 1m down /v1/test/protected
backend-api HPA story
scan-fanout.js ramping-arrival-rate, 1 → 6 pulses per 10s over 30s warm-up / 2m ramp / 3m hold / 1m down / 2m
observe One /v1/test/synthetic-scan per pulse, each enqueuing N_GRAPH=120 + N_POWERSHELL=60 synthetic tasks
(~180/pulse) — worker KEDA + OPA + powershell-service
sustained.js Two parallel constant-arrival-rate scenarios for 10 minutes: 80 RPS to /protected plus 4
pulses/10s to /synthetic-scan Steady-state for the "after" snapshot

Configuration

Required env vars (all scripts):

Variable Default Notes
AUTOAUDIT_BASE_URL http://localhost:8000 In-cluster runner sets http://autoaudit-backend-api.autoaudit:8000
AUTOAUDIT_TOKEN — (required) Bearer JWT. The in-cluster Job's init container fetches this from /v1/auth/login.

scan-fanout.js per-pulse knobs (override with -e KEY=value):

Variable Default Effect
N_GRAPH 120 Tasks per pulse onto controls.graph
N_POWERSHELL 60 Tasks per pulse onto controls.powershell
SLEEP_MS 2000 Per-task sleep (simulates I/O wait)
CPU_BURN_MS 200 Per-task tight-loop CPU
CALL_POWERSHELL_SERVICE true If true, controls.powershell tasks call /execute/synthetic on the
powershell-service

Thresholds:

Script Thresholds
api-baseline.js http_req_failed < 2%, http_req_duration p(95) < 500ms, p(99) < 800ms
scan-fanout.js http_req_failed < 5%, synthetic-scan p(95) < 2000ms
sustained.js protected p(95) < 500ms and <2% errors; synthetic-scan <5% errors

Running

Local (port-forwarded backend-api):

export AUTOAUDIT_TOKEN=$(curl -s -X POST http://localhost:8000/v1/auth/login \
  -d 'username=admin@example.com&password=admin' | jq -r .access_token)

k6 run tests/load/api-baseline.js
k6 run tests/load/scan-fanout.js -e N_GRAPH=20 -e N_POWERSHELL=10

In-cluster (load-balances across all backend-api pods):

# One script at a time
bash tests/load/in-cluster/run.sh api-baseline.js
bash tests/load/in-cluster/run.sh scan-fanout.js -e N_GRAPH=150 -e N_POWERSHELL=50

# Full demo (api-baseline -> 60s cooldown -> scan-fanout)
bash tests/load/in-cluster/demo.sh

run.sh re-syncs the k6-scripts ConfigMap from tests/load/*.js on every invocation, so the cluster always runs the
version on disk. The Job's init container logs in via /v1/auth/login using the admin password from the autoaudit-admin
Secret and writes the JWT to a shared emptyDir; the k6 container reads it and runs against the in-cluster Service URL.

Full fresh-cluster setup (namespace + Secret + KEDA + monitoring + ACR push + helm install) is in
docs/scaling/scaling-test-runbook.md.

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Refactor / code cleanup
  • Documentation
  • CI/CD / infrastructure
  • Security

Affected Components

  • /backend-api
  • /frontend
  • /engine (collectors / policies)
  • /security
  • /infrastructure
  • /.github/workflows
  • /docs

Motivation

The primary reason for this work was for the HD panel, showing specific technical depth, but as this this is a critical feature of production grade infrastructure, I wanted to share it for future students to be able to use.

Testing Done

  • Unit tests pass locally
  • Tested manually — describe how:
  • No tests required — explain why:

Security Considerations

No security impact to consider.

Breaking Changes

  • No breaking changes
  • Yes — describe below:

Rollback Plan

  • Revert commit is sufficient
  • Requires additional steps — describe below:

Checklist

  • Code follows project conventions
  • No secrets, credentials, or tokens committed
  • Relevant documentation updated (if applicable)
  • [] CI/CD workflows pass on this branch
  • PR is focused on one thing

Screenshots

Screenshots to come.

@github-actions
Copy link
Copy Markdown
Contributor

Preview Environment

A preview environment can be spun up on demand for this PR.

Action Label Includes
Spin up preview deploy-preview Frontend, backend, database, Redis, OPA, worker
Spin up preview with M365 deploy-preview-m365 Everything above + PowerShell service for Exchange/Teams scan testing
Tear down preview teardown-preview Stops the environment early

The environment will also be torn down automatically when the PR is closed or merged.
Preview URLs will appear in a follow-up comment once the deploy completes (~5–8 min).
M365 scans require real tenant credentials added through the frontend UI.

}

code = status.HTTP_200_OK if payload["status"] == "ok" else status.HTTP_503_SERVICE_UNAVAILABLE
return JSONResponse(payload, status_code=code)
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 24, 2026

CI: Frontend

Job Result
Security analysis (CodeQL) success
Lint success
Build and test failure

One or more checks failed. View logs

@du-dhartley du-dhartley changed the title Feature/proof of concept deployment 26t1 imp dha 003 Implement HPA/KEDA scaling and prove with demo runs on synthetic load endpoints, with J6 May 24, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 24, 2026

CI: Engine

Job Result
Security analysis (CodeQL) success
Lint failure
Tests success

One or more checks failed. View logs

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 24, 2026

CI: Backend API

Job Result
Security analysis (CodeQL + Bandit) failure
Lint failure

One or more checks failed. View logs

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4265e01abf

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread helm/autoaudit/templates/worker/deployment.yaml Outdated
Comment thread helm/autoaudit/templates/worker/deployment.yaml Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants