Skip to content

docs(audits): performance, security, and quality audit results#62

Merged
drernie merged 21 commits intomainfrom
23-audits
Mar 25, 2026
Merged

docs(audits): performance, security, and quality audit results#62
drernie merged 21 commits intomainfrom
23-audits

Conversation

@drernie
Copy link
Copy Markdown
Member

@drernie drernie commented Mar 24, 2026

Summary

Full audit results across three dimensions — performance, security, and quality — for the RAJA MVP. See specs/23-audits/04-audit-summary.md for the canonical summary.

Performance Audit

Direct S3 baseline (scale/1k, n=100): P50 0.924 s · P95 1.096 s · P99 7.106 s

tier P50 (s) P95 (s) P99 (s) overhead P99 errors
scale/1m 0.901 2.051 9.321 +31 % none
scale/100k 0.897 2.791 9.063 +28 % none
scale/10k 0.917 3.280 4.834 −32 % † none
scale/1k 0.915 2.862 5.561 −22 % † 21 × 503

† P99 baseline (7.1 s) was inflated by a single outlier; these tiers beat it by chance.

P50 is flat across all tiers at ~0.91 s. The JWT+Lua filter chain adds negligible median cost — S3 fetch dominates.

P99 is noise, not tier-driven. Tail values are driven by ECS task cycling (IMDS credential-refresh bursts), not package size or auth overhead.

503s only on scale/1k. Correlated with IMDS bursts during credential refresh, not a routing problem.

Decision: no optimization required. The 15 % P99 threshold cannot be reliably evaluated from this data set. Recommended follow-up: re-run baseline at n=1 000 to reduce noise; investigate IMDS 503 recurrence as a separate issue.

Security Audit

severity file issue
MEDIUM infra/terraform/main.tf API Gateway has authorization = "NONE" on all resources; no resource policy, throttling, or access logging
MEDIUM infra/terraform/main.tf Lambda Function URLs grant principal = "*" constrained only by source_account
MEDIUM infra/terraform/main.tf IAM grants overly broad: DataZone owner has s3:* over both buckets; control plane can mutate Lambda config and write secrets
MEDIUM infra/terraform/main.tf JWT signing secret has no Secrets Manager rotation resource

These are infrastructure hardening gaps, not flaws in the core authorization design.

Quality Audit

severity file issue
HIGH src/raja/enforcer.py, src/raja/token.py Coverage at 69 % and 71 % respectively — below audit targets for core auth logic
HIGH lambda_handlers/rale_authorizer/handler.py Coverage at 66 %; error branches and external-call paths unverified
MEDIUM .github/workflows/ci.yml CI does not gate on bandit, pip-audit, vulture, or coverage thresholds

Artifacts

  • specs/23-audits/01a-code-audit-results.md — quality findings
  • specs/23-audits/02a-security-audit-results.md — security findings
  • specs/23-audits/03f-live-performance-results.md — raw performance numbers
  • specs/23-audits/04-audit-summary.md — consolidated findings and decisions

🤖 Generated with Claude Code

drernie and others added 20 commits March 23, 2026 11:52
…ve stack

- Add prerequisites section defining `hey` (brew install hey, now installed)
- Replace all placeholder endpoints with real values from tf-outputs.json
- Replace token_type:taj with correct token_type:raja; add admin key header
- Use pinned package hashes for all scale tiers (1k/10k/100k/1m)
- Create scale packages via Quilt Packaging Engine (SQS) in data-yaml-spec-tests
- Replace localhost:9901 admin stats with `aws ecs execute-command` approach
- Replace auth-disable vagueness with `terraform apply -var auth_disabled=true`
- Remove docker-compose echo server step (no local testing)
- Remove pytest performance marker approach; hey is the benchmark tool
- Move CI regression gate to Out-of-scope
- Fix security audit results: downgrade docker-compose finding to Low severity

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…peline

- Add scale/1k-1m to seed-config.yaml with pre-built URIs pointing to
  data-yaml-spec-tests; seed_packages.py skips S3/quilt push and only
  creates DataZone listings + subscription grants for packages with a uri field
- Add perf_test_bucket Terraform variable (default: data-yaml-spec-tests);
  grant RALE router Lambda read access and include bucket in
  RAJEE_PUBLIC_PATH_PREFIXES so the auth-disabled baseline can reach it
- Add verify_perf_access.py: reads principal + package URI from
  .rale-seed-state.json, probes /token then Envoy end-to-end; gates deploy
- Wire _verify-perf-access into deploy sequence; expose as ./poe verify-perf
- Fix performance spec URLs from Quilt catalog format (/b/.../packages/...)
  to RALE USL format (/<bucket>/<author>/<name>@<hash>)
- Add lesson-learned note to spec documenting the two gaps (IAM + DataZone)
  that caused all 2026-03-23 benchmark requests to fail
- Remove stale ernest-test from RAJA_USERS (.env); was never an IAM user

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… ECS exec

_extract_principal in the RALE authorizer called json.loads directly on the
x-raja-jwt-payload header. Envoy's forward_payload_header forwards the JWT
payload as base64url, not plain JSON, so the parse silently failed and the
function fell through to the ECS task role ARN — which is not in any DataZone
project — causing every authorized request to return 403.

Fix mirrors the guard already in authorize.lua: check if the value starts with
'{'; if not, base64url-decode it first. Removes the x-raja-principal workaround
from verify_perf_access.py; the benchmark now uses real SigV4-issued tokens.

Also enable ECS execute-command on the RAJEE service so the Envoy admin stats
step in the performance spec can be completed via 'aws ecs execute-command'.

Update 03b-live-performance-results.md with root-cause analysis and resolution
notes for both blockers from the 2026-03-23 run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Mirrors the retry logic in tests/integration/test_rale_end_to_end.py.
Transient 503s occur while ECS replaces tasks after a service update.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add ECSExec policy statement (ssmmessages:*) to rajee_task_permissions
  so ECS execute-command works for Envoy admin stats collection
- Fix verify_perf_access.py to request token_type "rajee" (not "raja")
  so the issued JWT passes Envoy's jwt_authn filter validation

All three verify_perf_access.py checks now pass:
  ✓ /token → 200
  ✓ Envoy GET /data-yaml-spec-tests/scale/1k@40ff9e73 → 200
  ✓ ECS execute-command → 200

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…enchmarking

Adds a dedicated per-route bypass of the jwt_authn + lua filters, scoped
to the perf test bucket, so the baseline can be measured without toggling
auth_disabled on the live stack. Terraform passes PERF_DIRECT_BUCKET from
var.perf_test_bucket to the ECS task automatically.

verify_perf_access.py now checks direct (no-token) access first, making
IAM and route config failures immediately visible before the auth checks.
Removes the --exercise-auth-toggle path and all terraform-apply logic.

The performance audit spec is updated to use the direct route for the
baseline instead of the auth_disabled toggle cycle.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…UCKET unset

The env var is injected into the ECS task by Terraform but is not present
in .env when the verify script runs locally during ./poe deploy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous route matched /{perf_bucket}/ which hijacked the auth-enabled
test URL. Switch to a dedicated /_perf/ prefix with prefix_rewrite "/" so
normal auth paths are unaffected.

Add s3_perf_upstream cluster with aws_request_signing (service_name: s3)
so the direct route can access private buckets using the ECS task role.

Add rajee_task_perf_bucket IAM policy granting the Envoy task role
s3:GetObject / s3:ListBucket on perf_test_bucket (mirrors the existing
rale_router_perf_bucket policy).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… check URL

/_perf/ now routes to s3_perf_upstream (aws_request_signing service=s3),
not rale_router_cluster. No auth filters. prefix_rewrite "/" strips the
/_perf prefix before forwarding to S3.

verify_perf_access.py direct check hits /_perf/{bucket}/ (bucket root).
Accepts 200 or 403 as success — both prove S3 was reached, not Envoy auth
(which would return 401).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix path description: /{perf_bucket}/... → /_perf/{perf_bucket}/...
- Fix token_type in both hey benchmark examples: raja → rajee
- Add live performance results doc
- Update tf-outputs.json from latest deploy

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@drernie drernie linked an issue Mar 24, 2026 that may be closed by this pull request
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review.

Tip: disable this comment in your organization's Code Review settings.

…l host

- Add request_headers_to_remove to envoy.yaml.tmpl so clients cannot
  supply a forged JWT payload header; Lua now only sees the value written
  by jwt_authn after successful verification
- Fix deny response metadata leak in rale_authorizer (manifest_hash,
  package_name, registry no longer exposed)
- Fix hard-coded /tmp in Lambda handlers (use tempfile.gettempdir())
- Fix strict mypy: Lambda handler dirs promoted to packages
- Upgrade dependency lockfile: fastapi, starlette, mangum, boto3, ruff,
  pydantic-core and others brought to current releases
- Update audit summary and changelog per review feedback

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@drernie drernie assigned kevinemoore and unassigned kevinemoore Mar 24, 2026
@drernie drernie changed the title 23-audits: JWT+Lua auth overhead measured — no optimization needed 23-audits: performance, security, and quality audit results Mar 24, 2026
@drernie drernie self-assigned this Mar 24, 2026
@drernie drernie changed the title 23-audits: performance, security, and quality audit results docs(audits): performance, security, and quality audit results Mar 24, 2026
@drernie drernie requested a review from kevinemoore March 24, 2026 05:22
@drernie drernie merged commit a3b3de2 into main Mar 25, 2026
6 checks passed
@drernie drernie deleted the 23-audits branch March 25, 2026 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Audits

2 participants