Skip to content

feat(chaos): add ChaosMonkey fixture and pipeline resilience tests#271

Open
purvask2006-collab wants to merge 1 commit into
sahoo-tech:mainfrom
purvask2006-collab:feat/chaos-pipeline-resilience
Open

feat(chaos): add ChaosMonkey fixture and pipeline resilience tests#271
purvask2006-collab wants to merge 1 commit into
sahoo-tech:mainfrom
purvask2006-collab:feat/chaos-pipeline-resilience

Conversation

@purvask2006-collab

Copy link
Copy Markdown

🔗 Related Issue

Closes #<! -- 144 -->

📝 Summary of Changes

Implements chaos engineering tests to verify that ExecraPipeline continues operating correctly when individual subsystems fail randomly.

Adds a ChaosMonkey fixture that injects three independent fault types:

Subsystem Fault Probability
LLM client TimeoutError on complete() 20%
Redis None cache miss on get() 50%
Perception queue Silent frame drop on put() 30%

The pipeline runs for 60 seconds under all three faults simultaneously. On every fault it falls back gracefully — delivering degraded guidance (confidence < 0.5) rather than crashing. All faults are captured in a structured error_log.

Also adds docs/resilience.md documenting each failure mode, recovery strategy, and how to extend the suite.

🔍 Type of Change

  • 🧪 Test addition or improvement
  • 📖 Documentation update

🧪 How Was This Tested?

pytest tests/chaos/test_pipeline_resilience.py -v --timeout=120

15 tests covering:

  • 60s chaos run — no unhandled exceptions
  • Guidance delivered on ≥ 75% of ticks under combined chaos
  • Every injected TimeoutError logged with kind == "llm_timeout"
  • Statistical rate checks within ±15% of configured probabilities
  • Fallback guidance carries confidence < 0.5
  • Combined/cascading fault scenarios
  • ChaosMonkey counter accuracy

Environment: Windows 11 · Python 3.14

✅ Pre-Submission Checklist

  • Branch is up to date with upstream/main
  • Docstrings and type hints on all new public functions
  • Documentation updated (docs/resilience.md)
  • PR title follows Conventional Commits format
  • No .env files, secrets, or model weights committed
  • PR scope matches linked issue only

💬 Notes for Reviewer

ChaosMonkey uses random.Random(seed=2026) — fault sequence is fully reproducible across CI runs.

ExecraPipeline is a thin orchestration wrapper — zero changes to existing production code.

Requires pip install pytest-timeout for the 120s CI timeout guard.

- Add tests/chaos/test_pipeline_resilience.py (15 tests, 60s chaos run)
- Implement ChaosMonkey: LLM 20% timeout, Redis 50% miss, 30% frame drop
- ExecraPipeline with graceful fallback on every fault type
- Add docs/resilience.md documenting failure modes and recovery strategies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant