Skip to content

[Meta] Memory optimization for high-load scenarios (Phase 2.5) #71

@cbaugus

Description

@cbaugus

Overview

This is a tracking issue for memory optimization work that must be completed before Phase 3.

Users are experiencing OOM errors when running high-load tests. Analysis shows that HDR histograms for percentile tracking are the primary memory consumer, using 2-4MB per unique scenario/step label with no upper bounds.

Example OOM scenario:

NUM_CONCURRENT_TASKS=5000
TARGET_RPS=50000  
TEST_DURATION=24h
# Result: OOM killed with 4GB RAM

Root Causes

  1. Unbounded histograms: 2-4MB per unique label, no limit
  2. No rotation: Data accumulates for entire test duration
  3. 100% tracking: Every request recorded at high RPS
  4. No visibility: Users can't see memory usage until OOM

See MEMORY_OPTIMIZATION.md for detailed analysis.

Sub-Issues

Critical (Must complete before Phase 3)

High Priority (Should complete before Phase 3)

Medium Priority (Nice to have)

Acceptance Criteria

Before moving to Phase 3, we must be able to:

  1. ✅ Run 24h test with 500 concurrent tasks without OOM
  2. ✅ Support 50k RPS on 8GB RAM
  3. ✅ Provide memory usage visibility via Prometheus metrics
  4. ✅ Bound maximum memory usage (no unbounded growth)
  5. ✅ Document safe configurations for different RAM sizes

Memory Targets

After optimization:

RAM Max Tasks Max RPS Max Duration Status
512MB 10 500 5m ✅ Works today
2GB 100 5,000 30m ✅ Works today
4GB 500 10,000 1h ⚠️ Needs #66+#68
4GB 200 5,000 24h ⚠️ Needs #67
8GB 1,000 25,000 2h ⚠️ Needs #66+#68+#67
8GB 500 10,000 24h ⚠️ Needs all

Documentation

New files created:

  • MEMORY_OPTIMIZATION.md - Detailed analysis and recommendations
  • docker-compose.loadtest-examples.yml - Pre-configured test scenarios
  • LOAD_TEST_SCENARIOS.md - Updated with memory warnings

Files to update:

  • README.md - Add memory configuration section
  • DOCKER.md - Add memory optimization examples
  • .vscode/grafana-dashboard.json - Add memory panels

Testing Strategy

  1. Unit tests: Each optimization should have unit tests
  2. Integration tests: Memory stress tests in CI
  3. Validation tests: Before/after memory comparisons
  4. Long-running tests: 24h soak test to verify no leaks

Dependencies

Add to Cargo.toml:

[dependencies]
lru = "0.12"          # For #68 (label limiting)
procfs = "0.16"       # For #69 (memory metrics, Linux only)

Timeline

Suggested order:

  1. Week 1: Issues [Memory] Add PERCENTILE_TRACKING_ENABLED configuration flag #66 + [Memory] Limit maximum unique histogram labels #68 (Critical path)
  2. Week 2: Issue [Memory] Add process memory usage metrics #69 (Visibility)
  3. Week 3: Issue [Memory] Implement periodic histogram reset/rotation #67 (Long-duration support)
  4. Week 4: Issue [Memory] Add percentile sampling for high-RPS tests #70 (Performance optimization)

Success Metrics

Track these metrics before/after:

  • Maximum memory usage (RSS)
  • OOM error rate
  • Maximum test duration achieved
  • Maximum RPS achieved
  • Number of histograms created

Related Work

Questions / Decisions


Note: This work is critical for Phase 3 success. The web app will need to handle production-scale tests, and current memory limitations would block that.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions