[Meta] Memory optimization for high-load scenarios (Phase 2.5)

## Overview

This is a **tracking issue** for memory optimization work that must be completed **before Phase 3**.

Users are experiencing OOM errors when running high-load tests. Analysis shows that HDR histograms for percentile tracking are the primary memory consumer, using 2-4MB per unique scenario/step label with no upper bounds.

**Example OOM scenario:**
```bash
NUM_CONCURRENT_TASKS=5000
TARGET_RPS=50000  
TEST_DURATION=24h
# Result: OOM killed with 4GB RAM
```

## Root Causes

1. **Unbounded histograms**: 2-4MB per unique label, no limit
2. **No rotation**: Data accumulates for entire test duration
3. **100% tracking**: Every request recorded at high RPS
4. **No visibility**: Users can't see memory usage until OOM

See `MEMORY_OPTIMIZATION.md` for detailed analysis.

## Sub-Issues

### Critical (Must complete before Phase 3)

- [ ] #66 - Add `PERCENTILE_TRACKING_ENABLED` configuration flag
  - Priority: P0-Critical
  - Effort: Medium
  - Allows disabling histograms for stress tests

- [ ] #68 - Limit maximum unique histogram labels
  - Priority: P0-Critical  
  - Effort: Medium
  - Prevents unbounded memory growth

### High Priority (Should complete before Phase 3)

- [ ] #67 - Implement periodic histogram reset/rotation
  - Priority: P1-High
  - Effort: Large
  - Enables 24h+ tests without OOM

- [ ] #69 - Add process memory usage metrics
  - Priority: P1-High
  - Effort: Medium
  - Provides visibility and early warning

### Medium Priority (Nice to have)

- [ ] #70 - Add percentile sampling for high-RPS tests
  - Priority: P2-Medium
  - Effort: Medium
  - Reduces memory and improves performance at 50k+ RPS

## Acceptance Criteria

Before moving to Phase 3, we must be able to:

1. ✅ Run 24h test with 500 concurrent tasks without OOM
2. ✅ Support 50k RPS on 8GB RAM
3. ✅ Provide memory usage visibility via Prometheus metrics
4. ✅ Bound maximum memory usage (no unbounded growth)
5. ✅ Document safe configurations for different RAM sizes

## Memory Targets

After optimization:

| RAM   | Max Tasks | Max RPS | Max Duration | Status         |
|-------|-----------|---------|--------------|----------------|
| 512MB | 10        | 500     | 5m           | ✅ Works today |
| 2GB   | 100       | 5,000   | 30m          | ✅ Works today |
| 4GB   | 500       | 10,000  | 1h           | ⚠️ Needs #66+#68 |
| 4GB   | 200       | 5,000   | 24h          | ⚠️ Needs #67 |
| 8GB   | 1,000     | 25,000  | 2h           | ⚠️ Needs #66+#68+#67 |
| 8GB   | 500       | 10,000  | 24h          | ⚠️ Needs all |

## Documentation

New files created:
- ✅ `MEMORY_OPTIMIZATION.md` - Detailed analysis and recommendations
- ✅ `docker-compose.loadtest-examples.yml` - Pre-configured test scenarios
- ✅ `LOAD_TEST_SCENARIOS.md` - Updated with memory warnings

Files to update:
- [ ] `README.md` - Add memory configuration section
- [ ] `DOCKER.md` - Add memory optimization examples
- [ ] `.vscode/grafana-dashboard.json` - Add memory panels

## Testing Strategy

1. **Unit tests**: Each optimization should have unit tests
2. **Integration tests**: Memory stress tests in CI
3. **Validation tests**: Before/after memory comparisons
4. **Long-running tests**: 24h soak test to verify no leaks

## Dependencies

Add to `Cargo.toml`:
```toml
[dependencies]
lru = "0.12"          # For #68 (label limiting)
procfs = "0.16"       # For #69 (memory metrics, Linux only)
```

## Timeline

Suggested order:

1. **Week 1**: Issues #66 + #68 (Critical path)
2. **Week 2**: Issue #69 (Visibility)
3. **Week 3**: Issue #67 (Long-duration support)
4. **Week 4**: Issue #70 (Performance optimization)

## Success Metrics

Track these metrics before/after:
- Maximum memory usage (RSS)
- OOM error rate
- Maximum test duration achieved
- Maximum RPS achieved
- Number of histograms created

## Related Work

- Current blocker: Users hitting OOM with aggressive configs
- Enables: Phase 3 web app with real-world scale testing
- Future: Consider moving to reservoir sampling (#70)

## Questions / Decisions

- [ ] Should sampling (#70) be auto-enabled at high RPS?
- [ ] What should default label limit be (100? 200?)?
- [ ] Should we add memory alerts to Grafana dashboard?
- [ ] Do we need memory profiling in dev mode?

---

**Note**: This work is critical for Phase 3 success. The web app will need to handle production-scale tests, and current memory limitations would block that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Meta] Memory optimization for high-load scenarios (Phase 2.5) #71

Overview

Root Causes

Sub-Issues

Critical (Must complete before Phase 3)

High Priority (Should complete before Phase 3)

Medium Priority (Nice to have)

Acceptance Criteria

Memory Targets

Documentation

Testing Strategy

Dependencies

Timeline

Success Metrics

Related Work

Questions / Decisions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

RAM	Max Tasks	Max RPS	Max Duration	Status
512MB	10	500	5m	✅ Works today
2GB	100	5,000	30m	✅ Works today
4GB	500	10,000	1h	⚠️ Needs #66+#68
4GB	200	5,000	24h	⚠️ Needs #67
8GB	1,000	25,000	2h	⚠️ Needs #66+#68+#67
8GB	500	10,000	24h	⚠️ Needs all

[Meta] Memory optimization for high-load scenarios (Phase 2.5) #71

Description

Overview

Root Causes

Sub-Issues

Critical (Must complete before Phase 3)

High Priority (Should complete before Phase 3)

Medium Priority (Nice to have)

Acceptance Criteria

Memory Targets

Documentation

Testing Strategy

Dependencies

Timeline

Success Metrics

Related Work

Questions / Decisions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions