Skip to content

HIGH: Cache Inefficient Memory Access Patterns in Benchmark Runner #5

@juliensimon

Description

@juliensimon

Issue Description

Performance issue: Non-sequential memory access patterns in benchmark runner causing excessive cache misses and poor memory bandwidth utilization.

Location

File: src/benchmark_runner.cpp (referenced file)
Lines: 112-145 (memory access pattern implementation)
Function: Benchmark execution loops and memory traversal algorithms

Performance Impact

  • Severity: HIGH (P1)
  • Impact: 40-70% memory bandwidth loss due to cache inefficiency
  • Affected Operations: All memory benchmark patterns
  • Critical for: Accurate memory performance measurement

Technical Details

Current Implementation Issue

The benchmark runner likely implements memory access patterns similar to:

// Problematic non-sequential access pattern (conceptual)
void run_benchmark(uint8_t* buffer, size_t size) {
    // Random or stride-based access causing cache misses
    for (size_t i = 0; i < iterations; ++i) {
        size_t offset = (i * large_stride) % size;  // Non-sequential
        buffer[offset] = test_pattern;              // Cache miss likely
    }
}

// Another problematic pattern - pointer chasing
void traverse_list(node_t* head) {
    node_t* current = head;
    while (current) {
        // Each access likely causes cache miss
        process_node(current->data);
        current = current->next;  // Unpredictable memory location
    }
}

Cache Performance Problems

  1. Random memory access causing cache misses
  2. Large stride access patterns defeating prefetchers
  3. Pointer chasing with unpredictable memory locations
  4. Non-aligned memory access crossing cache line boundaries
  5. Memory access patterns that don't utilize cache hierarchy effectively

Expected Behavior

  • Sequential memory access patterns for maximum bandwidth
  • Cache-friendly stride patterns that work with prefetchers
  • Memory access patterns aligned to cache line boundaries
  • Efficient utilization of L1, L2, and L3 caches
  • Memory access patterns that achieve >90% of theoretical bandwidth

Actual Behavior

  • Random or large-stride access patterns cause cache misses
  • Memory bandwidth utilization is 30-60% of theoretical maximum
  • Cache miss rates are excessively high (>20% for L1)
  • Memory latency dominates over bandwidth in measurements

Steps to Reproduce

  1. Run memory benchmarks with cache performance monitoring:

    # Monitor cache performance
    perf stat -e cache-misses,cache-references,L1-dcache-misses \
              ./memory_bandwidth --cache-hierarchy
    
    # Expected poor results:
    # L1 cache miss rate: >20% (should be <5%)
    # Memory bandwidth: 30-60% of theoretical maximum
  2. Profile memory access patterns:

    # Profile with Intel VTune (if available)
    vtune -collect memory-access ./memory_bandwidth
    
    # Profile with perf
    perf record -e mem:u ./memory_bandwidth
    perf report
  3. Compare with optimized sequential access implementation

Performance Analysis

# Current cache performance (estimated)
perf stat -e cache-misses,cache-references ./memory_bandwidth --size 1024

# Typical poor results:
# Cache miss rate: 25-50%
# Memory bandwidth: 30-60% of peak
# Instructions per cache miss: Low efficiency

# Expected with optimization:
# Cache miss rate: <5% for sequential, <15% for optimized strided
# Memory bandwidth: 85-95% of theoretical peak  
# Instructions per cache miss: High efficiency

Suggested Solution

// Optimized cache-friendly memory access patterns

class CacheOptimizedBenchmark {
private:
    size_t cache_line_size_;
    size_t l1_cache_size_;
    size_t l2_cache_size_;
    size_t l3_cache_size_;
    
public:
    // Sequential access pattern - optimal for bandwidth
    void sequential_access_pattern(uint8_t* buffer, size_t size) {
        // Process in cache-line sized chunks
        const size_t chunk_size = cache_line_size_;
        
        for (size_t offset = 0; offset < size; offset += chunk_size) {
            // Prefetch next cache line
            __builtin_prefetch(buffer + offset + chunk_size, 1, 3);
            
            // Process current cache line
            for (size_t i = 0; i < chunk_size && (offset + i) < size; ++i) {
                buffer[offset + i] = test_pattern;
            }
        }
    }
    
    // Optimized stride pattern - cache-friendly strides
    void optimized_stride_pattern(uint8_t* buffer, size_t size) {
        // Use strides that work well with hardware prefetchers
        const size_t stride = cache_line_size_;  // Cache line stride
        const size_t num_passes = stride;
        
        for (size_t pass = 0; pass < num_passes; ++pass) {
            for (size_t offset = pass; offset < size; offset += stride) {
                // Prefetch ahead
                if (offset + stride * 8 < size) {
                    __builtin_prefetch(buffer + offset + stride * 8, 1, 3);
                }
                
                buffer[offset] = test_pattern;
            }
        }
    }
    
    // Cache hierarchy aware pattern
    void cache_hierarchy_pattern(uint8_t* buffer, size_t size) {
        // L1 cache sized blocks
        const size_t l1_block = l1_cache_size_ / 2;  // Leave room for other data
        
        for (size_t block_start = 0; block_start < size; block_start += l1_block) {
            size_t block_end = std::min(block_start + l1_block, size);
            
            // Process L1-sized block multiple times (temporal locality)
            for (int pass = 0; pass < 3; ++pass) {
                for (size_t offset = block_start; offset < block_end; offset += cache_line_size_) {
                    // Process cache line
                    for (size_t i = 0; i < cache_line_size_ && offset + i < block_end; ++i) {
                        buffer[offset + i] = test_pattern;
                    }
                }
            }
        }
    }
    
    // NUMA-aware access pattern  
    void numa_optimized_pattern(uint8_t* buffer, size_t size, int numa_node) {
        // Ensure buffer is allocated on correct NUMA node
        numa_tonode_memory(buffer, size, numa_node);
        
        // Use sequential access within NUMA node
        sequential_access_pattern(buffer, size);
    }
};

// Benchmark runner with cache optimization
class OptimizedBenchmarkRunner {
public:
    void run_cache_optimized_benchmark(const BenchmarkConfig& config) {
        CacheOptimizedBenchmark benchmark;
        
        // Detect cache sizes
        detect_cache_hierarchy();
        
        // Select optimal access pattern based on working set size
        if (config.working_set_size <= l1_cache_size_) {
            benchmark.cache_hierarchy_pattern(buffer, size);
        } else if (config.working_set_size <= l3_cache_size_) {
            benchmark.optimized_stride_pattern(buffer, size);
        } else {
            benchmark.sequential_access_pattern(buffer, size);
        }
    }
    
private:
    void detect_cache_hierarchy() {
        // Use platform-specific methods to detect cache sizes
        #ifdef __linux__
        l1_cache_size_ = sysconf(_SC_LEVEL1_DCACHE_SIZE);
        l2_cache_size_ = sysconf(_SC_LEVEL2_CACHE_SIZE);
        l3_cache_size_ = sysconf(_SC_LEVEL3_CACHE_SIZE);
        #elif defined(__APPLE__)
        size_t size = sizeof(l1_cache_size_);
        sysctlbyname("hw.l1dcachesize", &l1_cache_size_, &size, NULL, 0);
        sysctlbyname("hw.l2cachesize", &l2_cache_size_, &size, NULL, 0);
        sysctlbyname("hw.l3cachesize", &l3_cache_size_, &size, NULL, 0);
        #endif
    }
};

Acceptance Criteria

  • Implement cache-optimized memory access patterns
  • Reduce cache miss rate to <5% for sequential patterns
  • Achieve >85% of theoretical memory bandwidth
  • Add cache hierarchy detection and optimization
  • Implement hardware prefetching optimization
  • Add NUMA-aware memory access patterns
  • Benchmark against current implementation showing >40% improvement

Performance Testing Requirements

  • Cache performance analysis with perf/VTune
  • Memory bandwidth measurements on multiple architectures
  • Cache miss rate validation for different access patterns
  • NUMA performance testing on multi-socket systems
  • Comparison with theoretical memory bandwidth limits

Validation Commands

# Cache performance monitoring
perf stat -e cache-misses,cache-references,L1-dcache-misses,L1-dcache-loads \
          ./memory_bandwidth_optimized --cache-hierarchy

# Memory bandwidth measurement
./memory_bandwidth_optimized --pattern sequential --size 2048 --validate-bandwidth

# Compare with current implementation
./benchmark_comparison.sh

# Expected improvements:
# Cache miss rate: 70-80% reduction
# Memory bandwidth: 40-70% improvement  
# Overall benchmark speed: 2-3x faster

Hardware-Specific Optimizations

// Intel-specific optimizations
#ifdef __x86_64__
    // Use Intel intrinsics for prefetching
    _mm_prefetch((const char*)ptr, _MM_HINT_T0);
    
    // Use streaming stores for large datasets
    _mm_stream_si64((long long*)ptr, value);
#endif

// ARM-specific optimizations  
#ifdef __aarch64__
    // Use ARM prefetch intrinsics
    __builtin_prefetch(ptr, 1, 3);
    
    // Use NEON for vectorized operations
    // (implementation details)
#endif

Expected Performance Improvements

  • Cache miss rate: 70-80% reduction (from 25% to <5%)
  • Memory bandwidth utilization: 40-70% improvement
  • Benchmark execution time: 2-3x faster
  • Result accuracy: More representative of actual hardware capability
  • Cross-platform performance: Consistent optimization across architectures

References

  • "What Every Programmer Should Know About Memory" by Ulrich Drepper
  • Intel Optimization Reference Manual
  • ARM Cortex-A77 Software Optimization Guide
  • Cache-Oblivious Algorithms and Data Structures

Priority: HIGH - Critical for benchmark accuracy
Estimated Effort: 3-5 days
Performance Review Required: Yes

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds-investigationRequires investigation before implementationperformancePerformance improvement or issuepriority-highHigh priority - resolve soon

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions