This document defines complete recovery scenarios, data restoration testing, service restoration runbooks, and repeatable testing procedures for TeachLink.
- Purpose: Ensure timely, verifiable recovery from incidents affecting data, indexers, off-chain artifacts, or full environment failures.
- Scope: smart contract state snapshots and manifests, indexer databases, off-chain artifacts referenced by integrity hashes, deployment infrastructure (indexers, API services, observability), critical third-party integrations.
- On-call Recovery Lead: coordinates recovery and communications.
- Infrastructure Engineer: restores infrastructure and storage.
- Indexer Operator: restores indexer DBs, replays events.
- Application Owner: runs smoke tests and validates functionality.
- Compliance/Audit: collects evidence artifacts and signs off.
- Recovery Time Objective (RTO): target by component (e.g., indexer DB 2 hours, API service 4 hours, full environment 8 hours).
- Recovery Point Objective (RPO): target snapshot age (e.g., off-chain artifacts hourly, indexer WAL-based replay to last confirmed block).
-
Data Corruption (single-table / artifact)
- Detection: alert from integrity-check or failed verification.
- Immediate action: isolate affected service, promote read-only fallback if available.
- Restore: identify latest good manifest, restore artifact from
backups/artifacts/<manifest>, verify integrity hash. - Validation: run
data_integrity_verifyand application smoke tests. - Post-recovery: replay missing events if required; record incident and corrective actions.
-
Partial Data Loss (indexer shards or partial contract state)
- Detection: missing indexer metrics, inconsistent query results.
- Restore: restore indexer DB from latest full backup; replay WAL or event stream from last backup point to current.
- Validation: run indexer reconciliation job and compare counts with golden manifest.
-
Full Environment Loss (region or cluster outage)
- Actions:
- Failover to secondary region (if configured) or provision new cluster following the
infrastructure/runbooks/provision_cluster.mdsteps. - Restore storage volumes from backups and attach to instances.
- Redeploy indexers, APIs, and workers using the tagged release used at backup time.
- Failover to secondary region (if configured) or provision new cluster following the
- Validation: run end-to-end smoke tests and synthetic transactions.
- Actions:
-
Key/Secrets Compromise
- Actions: rotate compromised secrets, revoke affected credentials, update manifests referencing secrets, redeploy services with new secrets.
- Validation: verify unauthorized access stops and rotate verification keys where applicable.
-
Third-Party Service Outage (e.g., cloud storage)
- Actions: switch to configured secondary provider or restore artifacts from alternative replication target.
- Validation: confirm read/write operations against the failover provider.
- Pre-reqs: isolated test environment, service account with restore privileges, sample backup manifest id, and a verification key.
Step-by-step restore (example):
- Provision an isolated environment (use VM/container image
teachlink/dr-test). - Fetch backup manifest:
aws s3 cp s3://teachlink-backups/manifests/<manifest>.json ./manifest.json(or equivalent provider command). - Validate manifest integrity: compare stored
integrity_hashwithsha256sumof artifacts. - Restore artifacts to test storage:
restore_tool --manifest ./manifest.json --target ./restore. - Restore indexer DB (if included): stop indexer service, load DB snapshot, start indexer, run
indexer_replay --from <manifest_block>. - Run automated validation suite:
scripts/recovery_test.sh(Linux/macOS) orscripts/recovery_test.ps1(Windows). - Record outcome: capture
RecoveryExecutedEventif run on-chain or savedr_report.jsoninbackups/recovery_reports/.
Verification checks:
- Hash match for each restored artifact.
- Application smoke tests pass: health endpoints, a sample read, and sample write (if safe).
- Indexer reconciliation: counts within tolerance vs golden manifest.
Roll-back plan: if validation fails, revert test environment, record failure with logs, and iterate on restore steps.
-
Triage & Communication
- Notify stakeholders and escalate via on-call rota.
- Create incident ticket with severity, target RTO/RPO, and assigned roles.
-
Stabilize & Isolate
- Disable incoming traffic to affected services via load balancer/DNS.
- Ensure monitoring continues to capture metrics and logs.
-
Restore Persistence Layer
- Restore object store from backups.
- Restore databases (indexer DBs) from snapshots and replay event streams.
-
Restore Core Services in Order
- Indexer services (bring online first so downstream APIs can serve data).
- API/backend services.
- Worker/background jobs.
- Frontend and public endpoints.
-
Validate
- Execute smoke test suite and synthetic transactions.
- Run integrity verification and reconcile indexer counts.
-
Scale & Harden
- Scale services to target capacity.
- Apply any hotfixes and mitigations identified during recovery.
-
Close Incident
- Document timeline, RTO/RPO achieved, root cause analysis, and follow-ups.
-
Drill types and cadence:
- Backup verification: weekly automated checks.
- Restoration drill (isolated): monthly.
- Full DR scenario (cross-team): quarterly.
- Tabletop exercises (process review): semi-annually.
-
Drill execution checklist:
- Announce drill window and non-production environment targets.
- Run
scripts/recovery_test.shorscripts/recovery_test.ps1. - Validate results and collect
dr_report.jsonand logs. - Post-drill review and action items.
See scripts/recovery_test.sh and scripts/recovery_test.ps1 for a small, repeatable validation harness that:
- verifies artifact integrity,
- checks indexer reconciliation endpoints,
- runs smoke tests against restored environment,
- emits a
dr_report.jsonwith pass/fail and timing metrics.
- Store drill reports in
backups/recovery_reports/<YYYY-MM-DD>-<drill-id>.json. - Attach relevant logs, verification traces, and artifact manifests.
- Recovery duration per component (seconds)
- Success/failure boolean
- Data integrity pass rate
- Number of manual interventions required
- Perform RCA within 72 hours, publish action items, and track remediation in the incident ticket.
- Test scripts: scripts/recovery_test.sh
- Windows test script: scripts/recovery_test.ps1
Created/Updated by DR automation on branch dr/comprehensive-procedures.