-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Overview
This is a tracking issue for memory optimization work that must be completed before Phase 3.
Users are experiencing OOM errors when running high-load tests. Analysis shows that HDR histograms for percentile tracking are the primary memory consumer, using 2-4MB per unique scenario/step label with no upper bounds.
Example OOM scenario:
NUM_CONCURRENT_TASKS=5000
TARGET_RPS=50000
TEST_DURATION=24h
# Result: OOM killed with 4GB RAMRoot Causes
- Unbounded histograms: 2-4MB per unique label, no limit
- No rotation: Data accumulates for entire test duration
- 100% tracking: Every request recorded at high RPS
- No visibility: Users can't see memory usage until OOM
See MEMORY_OPTIMIZATION.md for detailed analysis.
Sub-Issues
Critical (Must complete before Phase 3)
-
[Memory] Add PERCENTILE_TRACKING_ENABLED configuration flag #66 - Add
PERCENTILE_TRACKING_ENABLEDconfiguration flag- Priority: P0-Critical
- Effort: Medium
- Allows disabling histograms for stress tests
-
[Memory] Limit maximum unique histogram labels #68 - Limit maximum unique histogram labels
- Priority: P0-Critical
- Effort: Medium
- Prevents unbounded memory growth
High Priority (Should complete before Phase 3)
-
[Memory] Implement periodic histogram reset/rotation #67 - Implement periodic histogram reset/rotation
- Priority: P1-High
- Effort: Large
- Enables 24h+ tests without OOM
-
[Memory] Add process memory usage metrics #69 - Add process memory usage metrics
- Priority: P1-High
- Effort: Medium
- Provides visibility and early warning
Medium Priority (Nice to have)
- [Memory] Add percentile sampling for high-RPS tests #70 - Add percentile sampling for high-RPS tests
- Priority: P2-Medium
- Effort: Medium
- Reduces memory and improves performance at 50k+ RPS
Acceptance Criteria
Before moving to Phase 3, we must be able to:
- ✅ Run 24h test with 500 concurrent tasks without OOM
- ✅ Support 50k RPS on 8GB RAM
- ✅ Provide memory usage visibility via Prometheus metrics
- ✅ Bound maximum memory usage (no unbounded growth)
- ✅ Document safe configurations for different RAM sizes
Memory Targets
After optimization:
| RAM | Max Tasks | Max RPS | Max Duration | Status |
|---|---|---|---|---|
| 512MB | 10 | 500 | 5m | ✅ Works today |
| 2GB | 100 | 5,000 | 30m | ✅ Works today |
| 4GB | 500 | 10,000 | 1h | |
| 4GB | 200 | 5,000 | 24h | |
| 8GB | 1,000 | 25,000 | 2h | |
| 8GB | 500 | 10,000 | 24h |
Documentation
New files created:
- ✅
MEMORY_OPTIMIZATION.md- Detailed analysis and recommendations - ✅
docker-compose.loadtest-examples.yml- Pre-configured test scenarios - ✅
LOAD_TEST_SCENARIOS.md- Updated with memory warnings
Files to update:
-
README.md- Add memory configuration section -
DOCKER.md- Add memory optimization examples -
.vscode/grafana-dashboard.json- Add memory panels
Testing Strategy
- Unit tests: Each optimization should have unit tests
- Integration tests: Memory stress tests in CI
- Validation tests: Before/after memory comparisons
- Long-running tests: 24h soak test to verify no leaks
Dependencies
Add to Cargo.toml:
[dependencies]
lru = "0.12" # For #68 (label limiting)
procfs = "0.16" # For #69 (memory metrics, Linux only)Timeline
Suggested order:
- Week 1: Issues [Memory] Add PERCENTILE_TRACKING_ENABLED configuration flag #66 + [Memory] Limit maximum unique histogram labels #68 (Critical path)
- Week 2: Issue [Memory] Add process memory usage metrics #69 (Visibility)
- Week 3: Issue [Memory] Implement periodic histogram reset/rotation #67 (Long-duration support)
- Week 4: Issue [Memory] Add percentile sampling for high-RPS tests #70 (Performance optimization)
Success Metrics
Track these metrics before/after:
- Maximum memory usage (RSS)
- OOM error rate
- Maximum test duration achieved
- Maximum RPS achieved
- Number of histograms created
Related Work
- Current blocker: Users hitting OOM with aggressive configs
- Enables: Phase 3 web app with real-world scale testing
- Future: Consider moving to reservoir sampling ([Memory] Add percentile sampling for high-RPS tests #70)
Questions / Decisions
- Should sampling ([Memory] Add percentile sampling for high-RPS tests #70) be auto-enabled at high RPS?
- What should default label limit be (100? 200?)?
- Should we add memory alerts to Grafana dashboard?
- Do we need memory profiling in dev mode?
Note: This work is critical for Phase 3 success. The web app will need to handle production-scale tests, and current memory limitations would block that.