Skip to content

HIGH: Inefficient Memory Allocation Strategy - Multiple Small Allocations #3

@juliensimon

Description

@juliensimon

Issue Description

Performance issue: Memory allocation strategy uses multiple small allocations instead of efficient bulk allocation, leading to significant performance degradation and memory fragmentation.

Location

File: common/memory_utils.cpp
Lines: 67-89 (memory allocation functions)
Function: calculate_buffer_size() and related allocation logic

Performance Impact

  • Severity: HIGH (P1)
  • Impact: Significant performance degradation in memory-intensive operations
  • Affected Operations: All memory benchmark tests
  • Scalability Issue: Performance degrades with larger working sets

Technical Details

Current Implementation Issue

The current allocation strategy in memory_utils.cpp lines 67-89:

size_t calculate_buffer_size(size_t total_size,
                            size_t num_buffers, 
                            size_t cache_line_size) {
    if (total_size == 0 || num_buffers == 0) {
        return 0;
    }
    
    size_t buffer_size = total_size / num_buffers;
    // This leads to multiple small allocations instead of bulk allocation
    // Missing optimization for bulk memory operations
    
    if (buffer_size < MIN_BUFFER_SIZE || buffer_size < cache_line_size) {
        return 0;
    }
    
    return buffer_size;
}

Performance Problems

  1. Memory fragmentation from multiple small allocations
  2. Increased allocation overhead - each malloc() call has overhead
  3. Cache inefficiency - scattered memory locations
  4. Memory allocator stress - frequent alloc/free cycles
  5. Poor NUMA locality on multi-socket systems

Expected Behavior

  • Use bulk memory allocation for better performance
  • Minimize memory fragmentation
  • Optimize for cache locality
  • Reduce allocation/deallocation overhead
  • Scale efficiently with large working sets

Actual Behavior

  • Multiple small allocations create fragmentation
  • Poor cache performance due to scattered memory
  • High allocation overhead impacts benchmark accuracy
  • Performance degrades significantly with larger working sets

Steps to Reproduce

  1. Run memory benchmarks with large working sets (>1GB)
  2. Monitor memory allocation patterns with tools like:
    # Profile memory allocations
    valgrind --tool=massif ./memory_bandwidth --size 2048
    
    # Check fragmentation
    malloc_stats() output analysis
  3. Compare performance with bulk allocation approach
  4. Observe significant performance difference

Performance Analysis

# Current approach benchmark
./memory_bandwidth --pattern sequential --size 1024 --threads 8

# Expected improvement with bulk allocation: 15-30% faster
# Memory fragmentation reduction: 60-80%
# Cache miss reduction: 20-40%

Suggested Solution

// Enhanced bulk allocation strategy
class BulkMemoryAllocator {
private:
    std::unique_ptr<uint8_t[]> bulk_memory_;
    size_t total_size_;
    size_t allocated_size_;
    
public:
    BulkMemoryAllocator(size_t total_size, size_t alignment) 
        : total_size_(total_size), allocated_size_(0) {
        
        // Allocate one large contiguous block
        size_t aligned_size = align_up(total_size, alignment);
        bulk_memory_ = std::make_unique<uint8_t[]>(aligned_size);
        
        // Ensure the bulk allocation is properly aligned
        if (!is_aligned(bulk_memory_.get(), alignment)) {
            throw MemoryError("Failed to create aligned bulk allocation");
        }
    }
    
    uint8_t* allocate_chunk(size_t size, size_t alignment) {
        size_t aligned_size = align_up(size, alignment);
        
        if (allocated_size_ + aligned_size > total_size_) {
            return nullptr; // Out of bulk memory
        }
        
        uint8_t* chunk = bulk_memory_.get() + allocated_size_;
        allocated_size_ += aligned_size;
        
        return chunk;
    }
};

// Optimized buffer allocation
std::vector<AlignedBuffer> allocate_test_buffers(
    size_t total_size, size_t num_buffers, size_t alignment) {
    
    // Single bulk allocation
    BulkMemoryAllocator allocator(total_size, alignment);
    
    std::vector<AlignedBuffer> buffers;
    buffers.reserve(num_buffers);
    
    size_t buffer_size = total_size / num_buffers;
    
    for (size_t i = 0; i < num_buffers; ++i) {
        uint8_t* chunk = allocator.allocate_chunk(buffer_size, alignment);
        if (chunk) {
            buffers.emplace_back(chunk, buffer_size, alignment);
        }
    }
    
    return buffers;
}

Acceptance Criteria

  • Implement bulk memory allocation strategy
  • Reduce memory allocations by 80%+ for large working sets
  • Improve benchmark performance by 15-30%
  • Maintain cache line alignment requirements
  • Add memory allocation profiling tests
  • Validate NUMA-aware allocation on multi-socket systems
  • Benchmark against current implementation

Performance Testing Requirements

  • Memory allocation profiling with Valgrind/Massif
  • Cache performance analysis with perf
  • NUMA locality testing on multi-socket systems
  • Memory fragmentation analysis
  • Scalability testing with various working set sizes

Benchmarking Commands

# Profile current implementation
perf record -g ./memory_bandwidth --cache-hierarchy
perf report

# Memory allocation tracking
valgrind --tool=massif --massif-out-file=massif.out ./memory_bandwidth
ms_print massif.out

# Cache analysis
perf stat -e cache-misses,cache-references ./memory_bandwidth

Expected Performance Improvements

  • Allocation time: 60-80% reduction
  • Memory fragmentation: 70-90% reduction
  • Cache performance: 20-40% improvement
  • Overall benchmark speed: 15-30% faster
  • Memory efficiency: 10-20% better utilization

References

  • "What Every Programmer Should Know About Memory" by Ulrich Drepper
  • Intel Optimization Manual: Memory Management
  • NUMA-aware allocation best practices

Priority: HIGH - Significant performance impact
Estimated Effort: 3-5 days
Performance Review Required: Yes

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds-investigationRequires investigation before implementationperformancePerformance improvement or issuepriority-highHigh priority - resolve soon

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions