docs(audits): performance, security, and quality audit results by drernie · Pull Request #62 · quiltdata/raja

drernie · 2026-03-24T04:54:59Z

Summary

Full audit results across three dimensions — performance, security, and quality — for the RAJA MVP. See specs/23-audits/04-audit-summary.md for the canonical summary.

Performance Audit

Direct S3 baseline (scale/1k, n=100): P50 0.924 s · P95 1.096 s · P99 7.106 s

tier	P50 (s)	P95 (s)	P99 (s)	overhead P99	errors
`scale/1m`	0.901	2.051	9.321	+31 %	none
`scale/100k`	0.897	2.791	9.063	+28 %	none
`scale/10k`	0.917	3.280	4.834	−32 % †	none
`scale/1k`	0.915	2.862	5.561	−22 % †	21 × 503

† P99 baseline (7.1 s) was inflated by a single outlier; these tiers beat it by chance.

P50 is flat across all tiers at ~0.91 s. The JWT+Lua filter chain adds negligible median cost — S3 fetch dominates.

P99 is noise, not tier-driven. Tail values are driven by ECS task cycling (IMDS credential-refresh bursts), not package size or auth overhead.

503s only on scale/1k. Correlated with IMDS bursts during credential refresh, not a routing problem.

Decision: no optimization required. The 15 % P99 threshold cannot be reliably evaluated from this data set. Recommended follow-up: re-run baseline at n=1 000 to reduce noise; investigate IMDS 503 recurrence as a separate issue.

Security Audit

severity	file	issue
MEDIUM	`infra/terraform/main.tf`	API Gateway has `authorization = "NONE"` on all resources; no resource policy, throttling, or access logging
MEDIUM	`infra/terraform/main.tf`	Lambda Function URLs grant `principal = "*"` constrained only by `source_account`
MEDIUM	`infra/terraform/main.tf`	IAM grants overly broad: DataZone owner has `s3:*` over both buckets; control plane can mutate Lambda config and write secrets
MEDIUM	`infra/terraform/main.tf`	JWT signing secret has no Secrets Manager rotation resource

These are infrastructure hardening gaps, not flaws in the core authorization design.

Quality Audit

severity	file	issue
HIGH	`src/raja/enforcer.py`, `src/raja/token.py`	Coverage at 69 % and 71 % respectively — below audit targets for core auth logic
HIGH	`lambda_handlers/rale_authorizer/handler.py`	Coverage at 66 %; error branches and external-call paths unverified
MEDIUM	`.github/workflows/ci.yml`	CI does not gate on `bandit`, `pip-audit`, `vulture`, or coverage thresholds

Artifacts

specs/23-audits/01a-code-audit-results.md — quality findings
specs/23-audits/02a-security-audit-results.md — security findings
specs/23-audits/03f-live-performance-results.md — raw performance numbers
specs/23-audits/04-audit-summary.md — consolidated findings and decisions

🤖 Generated with Claude Code

…ve stack - Add prerequisites section defining `hey` (brew install hey, now installed) - Replace all placeholder endpoints with real values from tf-outputs.json - Replace token_type:taj with correct token_type:raja; add admin key header - Use pinned package hashes for all scale tiers (1k/10k/100k/1m) - Create scale packages via Quilt Packaging Engine (SQS) in data-yaml-spec-tests - Replace localhost:9901 admin stats with `aws ecs execute-command` approach - Replace auth-disable vagueness with `terraform apply -var auth_disabled=true` - Remove docker-compose echo server step (no local testing) - Remove pytest performance marker approach; hey is the benchmark tool - Move CI regression gate to Out-of-scope - Fix security audit results: downgrade docker-compose finding to Low severity Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…peline - Add scale/1k-1m to seed-config.yaml with pre-built URIs pointing to data-yaml-spec-tests; seed_packages.py skips S3/quilt push and only creates DataZone listings + subscription grants for packages with a uri field - Add perf_test_bucket Terraform variable (default: data-yaml-spec-tests); grant RALE router Lambda read access and include bucket in RAJEE_PUBLIC_PATH_PREFIXES so the auth-disabled baseline can reach it - Add verify_perf_access.py: reads principal + package URI from .rale-seed-state.json, probes /token then Envoy end-to-end; gates deploy - Wire _verify-perf-access into deploy sequence; expose as ./poe verify-perf - Fix performance spec URLs from Quilt catalog format (/b/.../packages/...) to RALE USL format (/<bucket>/<author>/<name>@<hash>) - Add lesson-learned note to spec documenting the two gaps (IAM + DataZone) that caused all 2026-03-23 benchmark requests to fail - Remove stale ernest-test from RAJA_USERS (.env); was never an IAM user Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… ECS exec _extract_principal in the RALE authorizer called json.loads directly on the x-raja-jwt-payload header. Envoy's forward_payload_header forwards the JWT payload as base64url, not plain JSON, so the parse silently failed and the function fell through to the ECS task role ARN — which is not in any DataZone project — causing every authorized request to return 403. Fix mirrors the guard already in authorize.lua: check if the value starts with '{'; if not, base64url-decode it first. Removes the x-raja-principal workaround from verify_perf_access.py; the benchmark now uses real SigV4-issued tokens. Also enable ECS execute-command on the RAJEE service so the Envoy admin stats step in the performance spec can be completed via 'aws ecs execute-command'. Update 03b-live-performance-results.md with root-cause analysis and resolution notes for both blockers from the 2026-03-23 run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Mirrors the retry logic in tests/integration/test_rale_end_to_end.py. Transient 503s occur while ECS replaces tasks after a service update. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add ECSExec policy statement (ssmmessages:*) to rajee_task_permissions so ECS execute-command works for Envoy admin stats collection - Fix verify_perf_access.py to request token_type "rajee" (not "raja") so the issued JWT passes Envoy's jwt_authn filter validation All three verify_perf_access.py checks now pass: ✓ /token → 200 ✓ Envoy GET /data-yaml-spec-tests/scale/1k@40ff9e73 → 200 ✓ ECS execute-command → 200 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…enchmarking Adds a dedicated per-route bypass of the jwt_authn + lua filters, scoped to the perf test bucket, so the baseline can be measured without toggling auth_disabled on the live stack. Terraform passes PERF_DIRECT_BUCKET from var.perf_test_bucket to the ECS task automatically. verify_perf_access.py now checks direct (no-token) access first, making IAM and route config failures immediately visible before the auth checks. Removes the --exercise-auth-toggle path and all terraform-apply logic. The performance audit spec is updated to use the direct route for the baseline instead of the auth_disabled toggle cycle. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…UCKET unset The env var is injected into the ECS task by Terraform but is not present in .env when the verify script runs locally during ./poe deploy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The previous route matched /{perf_bucket}/ which hijacked the auth-enabled test URL. Switch to a dedicated /_perf/ prefix with prefix_rewrite "/" so normal auth paths are unaffected. Add s3_perf_upstream cluster with aws_request_signing (service_name: s3) so the direct route can access private buckets using the ECS task role. Add rajee_task_perf_bucket IAM policy granting the Envoy task role s3:GetObject / s3:ListBucket on perf_test_bucket (mirrors the existing rale_router_perf_bucket policy). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… check URL /_perf/ now routes to s3_perf_upstream (aws_request_signing service=s3), not rale_router_cluster. No auth filters. prefix_rewrite "/" strips the /_perf prefix before forwarding to S3. verify_perf_access.py direct check hits /_perf/{bucket}/ (bucket root). Accepts 200 or 403 as success — both prove S3 was reached, not Envoy auth (which would return 401). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Fix path description: /{perf_bucket}/... → /_perf/{perf_bucket}/... - Fix token_type in both hey benchmark examples: raja → rajee - Add live performance results doc - Update tf-outputs.json from latest deploy Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review.

_{Tip: disable this comment in your organization's Code Review settings.}

…l host - Add request_headers_to_remove to envoy.yaml.tmpl so clients cannot supply a forged JWT payload header; Lua now only sees the value written by jwt_authn after successful verification - Fix deny response metadata leak in rale_authorizer (manifest_hash, package_name, registry no longer exposed) - Fix hard-coded /tmp in Lambda handlers (use tempfile.gettempdir()) - Fix strict mypy: Lambda handler dirs promoted to packages - Upgrade dependency lockfile: fastapi, starlette, mangum, boto3, ruff, pydantic-core and others brought to current releases - Update audit summary and changelog per review feedback Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

drernie and others added 20 commits March 23, 2026 11:52

Add audit specs and results

2b7a097

Fix code audit issues

ca1f04d

Add live performance audit results

a9ee115

fix(verify-perf): retry on 503 connection termination (ECS task cycling)

3b87f63

Mirrors the retry logic in tests/integration/test_rale_end_to_end.py. Transient 503s occur while ECS replaces tasks after a service update. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Update perf verification and audit results

b701b2c

fix(verify-perf): derive perf_bucket from perf_uri when PERF_DIRECT_B…

789e4fe

…UCKET unset The env var is injected into the ECS task by Terraform but is not present in .env when the verify script runs locally during ./poe deploy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Align perf verifier with benchmark spec

92abe07

Use real file paths for performance audit

3b4fb47

chore(spec): remove superseded 03a-e drafts

2697d41

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(audits): add 04-audit-summary with final perf results

61b9107

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Bump version to 1.3.2

b982afe

docs(changelog): add 1.3.2 entry for performance audit results

2c2200a

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

drernie linked an issue Mar 24, 2026 that may be closed by this pull request

Audits #23

Closed

claude bot reviewed Mar 24, 2026

View reviewed changes

drernie assigned kevinemoore and unassigned kevinemoore Mar 24, 2026

drernie changed the title ~~23-audits: JWT+Lua auth overhead measured — no optimization needed~~ 23-audits: performance, security, and quality audit results Mar 24, 2026

drernie self-assigned this Mar 24, 2026

drernie changed the title ~~23-audits: performance, security, and quality audit results~~ docs(audits): performance, security, and quality audit results Mar 24, 2026

drernie requested a review from kevinemoore March 24, 2026 05:22

drernie merged commit a3b3de2 into main Mar 25, 2026
6 checks passed

drernie deleted the 23-audits branch March 25, 2026 17:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(audits): performance, security, and quality audit results#62

docs(audits): performance, security, and quality audit results#62
drernie merged 21 commits intomainfrom
23-audits

drernie commented Mar 24, 2026 •

edited

Loading

Uh oh!

claude bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drernie commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance Audit

Security Audit

Quality Audit

Artifacts

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

drernie commented Mar 24, 2026 •

edited

Loading