LunarSearch is a specialized string-searching utility designed to push the boundaries of the Intel Lunar Lake architecture. By utilizing hand-tuned AVX2 intrinsics and OpenMP parallelism, this project achieves near-theoretical peak memory bandwidth, successfully hitting the "Memory Wall" of modern mobile silicon.
Benchmarks conducted on 4GB of contiguous memory show that LunarSearch effectively saturates the memory controller, doubling the performance of single-threaded standard library functions.
| Method | Mean Time (4GB) | Throughput | Status |
|---|---|---|---|
| OpenMP + AVX2 | 50.77 ms | 78.78 GB/s | π Champion |
| std::find (icpx) | 111.29 ms | 35.94 GB/s | Statistical Tie |
| Manual AVX2 | 112.15 ms | 35.67 GB/s | Statistical Tie |
| Naive Loop (-O3) | 138.88 ms | 28.80 GB/s | 1.3x Slower |
Standard auto-vectorization often fails to account for the specific prefetching needs and instruction-level parallelism (ILP) required to saturate high-speed LPDDR5x memory. Coming from a 10-month background in Python, this project was a deep-dive into "Mechanical Sympathy"βwriting code that respects the physical constraints of the CPU and RAM.
- Manual AVX2 Unrolling: Bypasses compiler heuristics to process 128 bytes per iteration using four 256-bit YMM registers.
- Branchless Search logic: Uses bitmask-to-index conversion via
_mm256_movemask_epi8and_tzcnt_u32to keep the execution pipeline fluid. - Software Prefetching: Utilizes
_mm_prefetchto mask memory latency, ensuring data is in the L1/L2 cache before the CPU execution ports require it. - OpenMP Orchestration: Parallelizes the SIMD core to open multiple concurrent pipelines to the memory controller, achieving a 2.2x scaling factor over the single-core limit.
Performance in systems programming is a science, not a feeling. This repository includes a high-precision benchmarking harness that:
- Performs 100-sample runs for every implementation.
- Calculates Mean and Standard Deviation to filter out OS jitter.
- Applies a 2-Sigma Significance Test to ensure that "Wins" are architecturally real and not just statistical noise.

This project is optimized for the Intel oneAPI DPC++/C++ Compiler (icpx).
- Intel oneAPI Base Toolkit
- A CPU supporting AVX2 (Optimized for Intel Lunar Lake / Lion Cove)
- OpenMP 4.5+
icpx -qopenmp -O3 -march=native string_match_bench.cpp -o string_match_bench