This document describes the three intentional faults in the application and the full demo cycle. These faults exist to trigger the self-healing pipeline (CloudWatch -> Lambda -> RAG -> GitHub Actions -> Deploy).
The system supports a repeatable demo cycle:
- Push faulty code — The split fault files are deployed with their intentional bugs
- Trigger faults — Use the UI or curl to trigger one or more faults
- Dashboard shows errors — Incidents appear as "detected" with severity/type info
- Self-healing triggers — CloudWatch -> Lambda -> Backboard RAG -> Claude -> GitHub push -> CI/CD deploy
- Faults stop triggering — After deploy, the fixed code no longer produces errors
- Reset All — Clears incidents, SSM cooldowns, pauses self-healing, and pushes the original faulty split files back to GitHub
- CI/CD redeploys faulty code — The faults are restored, ready for another demo cycle
The stable route wrappers live in hello/page/_fault_cores.py, while the actual fault
implementations live in hello/page/views_sql.py, hello/page/views_api.py, and
hello/page/views_db.py. The self-healing Lambda must only read and patch those split files.
ENABLE_FAULT_INJECTIONmust beTrueinconfig/settings.py(default:True)- The application must be running with Docker Compose (
docker compose up) - PostgreSQL, Redis, and the mock API service must be available
- For "Reset All" to restore faulty code, set one of:
GITHUB_LAMBDA_NAME(to invoke the GithubTool Lambda), ORGITHUB_TOKEN,GITHUB_OWNER,GITHUB_REPO(for direct GitHub API)
Route: POST /test-fault/run
File: hello/page/views_sql.py -> test_fault_run()
Executes intentionally malformed SQL (SELECT FROM) which always fails with a syntax error.
curl -X POST http://localhost:8000/test-fault/run- PostgreSQL raises a syntax error on
SELECT FROM - The error is caught, rolled back, and logged to stderr
- A live incident is created with error_code
FAULT_SQL_INJECTION_TEST - Returns HTTP 500
db.session.execute(text("SELECT FROM")) # invalid SQL on purposeRoute: POST /test-fault/external-api
File: hello/page/views_api.py -> test_fault_external_api()
Calls the configured mock external API ($MOCK_API_BASE_URL/data, default
http://mock_api:5001/data locally) with a 3-second timeout.
The mock API (mock_api.py) is configured with API_FAULT_MODE=latency,wrong_data which causes:
- 60% chance of a 3.4-8 second delay (causes timeout because delay > 3s)
- 30% chance of returning malformed data with HTTP 200 (
{"value": "forty-two"})
Combined, ~90% of requests fail — either from timeout or bad upstream data.
curl -X POST http://localhost:8000/test-fault/external-apiRun it multiple times — it's probabilistic, not deterministic.
On timeout: Returns HTTP 504, creates incident with reason external_timeout
On wrong data: Returns HTTP 504, creates incident with reason wrong_data
On success (~30%): Returns HTTP 200 with {"value": 42}
mock_api_base_url = os.getenv("MOCK_API_BASE_URL", "http://mock_api:5001").rstrip("/")
r = requests.get(f"{mock_api_base_url}/data", timeout=3) # 3s timeout vs >3s mock delayRoute: POST /test-fault/db-timeout
File: hello/page/views_db.py -> test_fault_db_timeout()
Sets a 5.5-second statement timeout then runs SELECT pg_sleep(10). The timeout is shorter
than the sleep, so PostgreSQL waits a bit over 5 seconds and then cancels the query with a
real statement-timeout error.
curl -X POST http://localhost:8000/test-fault/db-timeoutSET LOCAL statement_timeout = '5500ms'sets a ~5.5-second limitSELECT pg_sleep(10)starts but is cancelled after ~5.5 seconds- The error is caught, rolled back, and logged to stderr
- A live incident is created with error_code
FAULT_DB_TIMEOUT - Returns HTTP 500
db.session.execute(text("SET LOCAL statement_timeout = '5500ms';"))
db.session.execute(text("SELECT pg_sleep(10);")) # always times out (~10s > 5.5s)Endpoint: POST /developer/incidents/reset
- Deletes all live incidents from PostgreSQL
- Clears AWS SSM fault cooldown parameters (so faults can be processed again immediately)
- Clears the self-healing cooldown markers and records a reset timestamp
- Restores
hello/page/views_sql.py,hello/page/views_api.py, andhello/page/views_db.py - Pushes the original faulty split-file bodies back to GitHub (triggers CI/CD redeploy)
The original faulty split-file content is stored in hello/page/fault_sql.txt,
hello/page/fault_api.txt, and hello/page/fault_db.txt.
On reset, the endpoint pushes this content to GitHub using:
- The GithubTool Lambda (if
GITHUB_LAMBDA_NAMEis set), or - The GitHub API directly (if
GITHUB_TOKEN/GITHUB_OWNER/GITHUB_REPOare set)
This triggers the CI/CD pipeline which redeploys the faulty code.
- Fault triggers -> error logged to stderr -> shipped to CloudWatch by ECS
- CloudWatch subscription filter -> triggers FaultRouter Lambda
- Lambda -> sends error to Backboard RAG for analysis, then calls Claude API
- Claude reads only the routed split file -> sees the bug -> generates fix -> pushes to GitHub
- GitHub Actions -> builds, tests, deploys the fix to ECS
- Pipeline callback -> updates incident status to "resolved" on dashboard
- After deploy -> triggering the same fault no longer produces an error (code is fixed)
- Check
ENABLE_FAULT_INJECTIONisTrueinconfig/settings.py - Check the split fault file for the route still has the faulty code (not a fixed version)
- Check
git log --oneline -10for commits like "[FAULT:...]" that may have fixed the faults - For DB timeout: ensure
SET LOCAL statement_timeout = '5500ms'precedespg_sleep(10)
- Check PostgreSQL is running:
docker compose logs postgres - Check the live store:
curl http://localhost:8000/developer/incidents/api/data - Check app logs for "Failed to create incident" errors
- Check GitHub credentials:
GITHUB_TOKEN/GITHUB_OWNER/GITHUB_REPOorGITHUB_LAMBDA_NAME - Check the reset response for
code_resetfield:curl -X POST http://localhost:8000/developer/incidents/reset - Wait for CI/CD to complete after reset (check GitHub Actions)
This was a known bug (now fixed). The _sync_status function only auto-resolves incidents
that have had some remediation action taken (auto_fix_pushed etc.). Newly detected
incidents stay as "detected" until the self-healing loop processes them.