Replace dispatch heuristics with profiling-derived thresholds; fix AVX-512/MSVC build#15
Merged
Merged
Conversation
Root cause: selectStrategyBasedOnCapabilities unconditionally preferred WORK_STEALING for all distributions at batch_size >= work_stealing_min (8000 for AVX-512). Profiling showed WORK_STEALING is 3-4x slower than PARALLEL for regular workloads (Gaussian, Exponential, Uniform, etc.) due to load-balancing overhead on uniform-cost elements. Changes: - WORK_STEALING now limited to distributions with irregular per-element cost (Poisson, Gamma, ChiSquared) where load balancing helps - AVX-512 base parallel_min raised from 500 to 5000 (wider SIMD keeps VECTORIZED competitive to higher batch sizes) - AVX-512 work_stealing_min raised from 8000 to 50000 Impact (pylibstats benchmark, Gaussian N=100k): PDF: 0.2x vs SciPy -> 2.6x CDF: 0.4x -> 3.3x Add gaussian_strategy_profile tool for per-strategy timing investigation. Co-Authored-By: Oz <oz-agent@warp.dev>
- inverse_t_cdf: raise normal-approximation cutoff from df>100 to df>1000 (consistent with t_cdf); Newton-Raphson now refines the estimate for intermediate degrees of freedom (fixes TTableValues) - performance_dispatcher: use 2x base parallel_min for simple distributions (Uniform, Discrete) so threading overhead cannot undercut low per-element cost; extend createForSIMDLevel to all 9 distribution types; clamp per-distribution thresholds after refineWithCapabilities (fixes DistributionSpecificThresholds) - test_gamma_enhanced: use absolute time bound when traditional_time ≤ 2μs instead of ratio check — dispatch overhead dominates at sub-microsecond scalar times (fixes AutoDispatchAssessment) Co-Authored-By: Oz <oz-agent@warp.dev>
- test_performance_dispatcher: use batch_size=3 (below all simd_min thresholds) instead of 5 which is above NEON/SSE2 simd_min of 4 - validators.h: lower parallel validation thresholds for small-medium batch sizes where threading overhead dominates on architectures with efficient vectorization - test_discrete_enhanced: replace hardcoded parallel speedup assertions with architecture-aware adaptive validators consistent with other enhanced test suites NOTE: the dispatch thresholds in performance_dispatcher.cpp have known issues across all architectures — inverted SIMD-efficiency refinement logic and non-empirical base thresholds cause PARALLEL to be selected at batch sizes where VECTORIZED is faster. This needs a dedicated follow-up using gaussian_strategy_profile on each target architecture. Co-Authored-By: Oz <oz-agent@warp.dev>
Add strategy_profile tool that benchmarks forced SCALAR/VECTORIZED/PARALLEL/ WORK_STEALING across all 9 distributions, 3 operations (PDF/LogPDF/CDF), and 16 batch sizes. Produces canonical CSV for dispatcher threshold tuning. Update capture_dispatcher_profile.sh and summarize_dispatcher_profile.py to use the new profiler as the canonical data source. Capture script now copies bundles into tracked data/profiles/dispatcher/ so profiles from all target architectures accumulate in version control. Remove 4 superseded tools: - gaussian_strategy_profile.cpp (strict subset of strategy_profile) - parallel_threshold_benchmark.cpp (strict subset of strategy_profile) - performance_dispatcher_tool.cpp (simulation-based, not measured data) - learning_analyzer.cpp (simulation-based, not measured data) Include NEON profiling bundle from Mac Mini M1 (1728 measurements). Update tool references in CMakeLists.txt, README.md, WARP.md, PROJECT_CONCEPT.md, and tools/README.md. Co-Authored-By: Oz <oz-agent@warp.dev>
Captured on Intel Core i7-7820HQ @ 2.90GHz (darwin-x86_64, AVX2, 4C/8T). 9 distributions × 3 operations × 16 batch sizes = 1,728 measurements. Key crossover findings: - Beta CDF, Gaussian CDF, StudentT CDF, Uniform PDF/LogPDF: VECTORIZED wins at all measured batch sizes (parallel never pays) - Poisson PDF: parallel threshold 2,000; LogPDF: 50,000 - StudentT PDF/LogPDF: parallel threshold 100,000 - Most others (ChiSquared, Exponential, Gamma, Gaussian PDF/LogPDF): parallel crossover at batch size 8-16 Co-Authored-By: Oz <oz-agent@warp.dev>
Remove the Dev (-O1) NEON profile and add a Release (-O3) capture. Release profiles are canonical for threshold tuning since they reflect production optimization levels. Strategy win distribution shifts with -O3: WORK_STEALING gains at PARALLEL's expense as per-element cost decreases and threading overhead becomes relatively more significant. Co-Authored-By: Oz <oz-agent@warp.dev>
Canonical strategy_profile run on Ivy Bridge with Release build (Clang -O3). 9 distributions x 3 operations x 4 strategies x 16 batch sizes. Needs bundling via capture_dispatcher_profile.sh for full metadata. Co-Authored-By: Oz <oz-agent@warp.dev>
Full capture_dispatcher_profile.sh bundle for Ivy Bridge i7-3820QM (SSE2+AVX). Release build, Clang -O3. 9 distributions x 3 ops x 4 strategies x 16 sizes. Includes metadata, summary, crossovers, best strategies, and logs. Co-Authored-By: Oz <oz-agent@warp.dev>
Captured on ASUS TUF A16 with AMD Ryzen 7 7445HS (6P/12T, Zen 4). Release build, MSVC 17 2022, AVX-512 enabled. Completes four-architecture profiling dataset: NEON, AVX, AVX2, AVX-512. Co-Authored-By: Oz <oz-agent@warp.dev>
Beta CDF: hoist lgamma(a+b)-lgamma(a)-lgamma(b) prefix out of the per-element loop in getCumulativeProbabilityBatchUnsafeImpl. Add beta_i(x, a, b, log_prefix) overload to skip redundant lgamma calls. Fix PARALLEL/WS lambdas to acquire cache_mutex_ once instead of per element and use the hoisted prefix with direct beta_i calls. Beta PDF/LogPDF: replace per-element scalar std::log/std::exp in PARALLEL/WS lambdas with chunked (1024-element) delegation to the SIMD batch impl (vector_log/vector_exp). Parallel tasks now get SIMD within each chunk instead of losing vectorization entirely. Also update vector_beta_i to hoist the lgamma prefix. 33/33 correctness tests pass, 54/54 SIMD verification tests pass. Co-Authored-By: Oz <oz-agent@warp.dev>
…able Add dispatch_thresholds.h with per-(SIMDLevel, DistributionType, OperationType) parallel thresholds derived from four-architecture Release profiling data (NEON, AVX, AVX2, AVX-512). Each of the 108 entries traces directly to a profiling bundle in data/profiles/dispatcher/. Add OperationType enum (PDF, LOG_PDF, CDF, BATCH_FIT) and new selectStrategy() method that replaces the old complexity-based dispatch with a three-line table lookup: SCALAR below simd_min, VECTORIZED below parallel threshold, then PARALLEL or WORK_STEALING based on platform. P-vs-WS selection uses platform detection: macOS/GCD+HT prefers WORK_STEALING, Windows/TP prefers PARALLEL, macOS/GCD without HT prefers PARALLEL. Based on four-architecture profiling showing threading backend as the dominant factor (not distribution type). Beta gets SIZE_MAX on all architectures — vectorization is not viable for any Beta operation due to the serial incomplete-beta continued fraction. Update all 24 autoDispatch() call sites across 8 distributions to pass OperationType instead of ComputationComplexity. Update 6 parallelBatchFit call sites to use dispatch_table::BATCH_FIT_MIN directly. Old threshold systems (AdaptiveThresholdCalculator, Thresholds struct with refineWithCapabilities) retained for now as deprecated — removal follows in a separate commit. 33/33 correctness tests pass. 54/54 SIMD verification tests pass. 36/36 parallel correctness tests pass. Co-Authored-By: Oz <oz-agent@warp.dev>
…rategy Update tests, tools, and examples to use selectStrategy() with OperationType instead of selectOptimalStrategy() with ComputationComplexity. No deprecated API calls remain in the codebase. Co-Authored-By: Oz <oz-agent@warp.dev>
Delete parallel_thresholds.h/.cpp (AdaptiveThresholdCalculator), distribution_characteristics.h (empirical complexity constants), and empirical_characteristics_demo.cpp (demo tool for deleted system). Remove deprecated selectOptimalStrategy() and selectStrategyBasedOnCapabilities() from PerformanceDispatcher. Simplify Thresholds struct population to fixed defaults (constexpr lookup table in dispatch_thresholds.h is now the authority). Replace all get_optimal_parallel_threshold() calls with get_min_elements_for_distribution_parallel(). Update docs to reflect changes. Co-Authored-By: Oz <oz-agent@warp.dev>
Mark 'system' as [[maybe_unused]] — the constexpr threshold table replaced the runtime system-capability conditioning. Co-Authored-By: Oz <oz-agent@warp.dev>
Superseded by the bundled profile in data/profiles/dispatcher/. Co-Authored-By: Oz <oz-agent@warp.dev>
- CMake: use /arch:AVX512 globally when SIMDDetection detects AVX-512, instead of hardcoding /arch:AVX2 for all MSVC x64 builds. Ensures __AVX512F__ is defined in non-SIMD source files (validators, tests). Clang-cl path updated symmetrically (-mavx512f). - validators.h: add AVX-512 awareness to adaptive test thresholds. AMD branch gains __AVX512F__ tier (base 2.0, Zen4 double-pumped). Complex-distribution SIMD multiplier reduced to 0.7x on AVX-512 (lgamma/factorial scalar bottlenecks limit wide-pipeline benefit). Parallel thresholds below 100K accept >= 0.1x (forced PARALLEL below the vectorized-to-parallel crossover is expected to underperform). Large-batch SIMD multiplier lowered to 1.05x (amortisation curve flattens earlier on 8-wide processing). - student_t.cpp: add NU_MAX=1000 upper bound and clamp initial moment estimate to 100, preventing Newton-Raphson divergence in the flat tail of the score function when sample excess kurtosis is near zero. - test_student_t_enhanced.cpp: increase MLE sample size from 500 to 2000 for stable convergence across stdlib implementations (MSVC vs libc++ produce different samples from identical mt19937 seeds). - test_system_capabilities.cpp: replace vector<bool> with vector<int> in ThreadSafety test (bit-packing caused concurrent writes to different indices to race on the same byte). Widen threading overhead bound from 100us to 500us (Windows scheduler jitter). Co-Authored-By: Oz <oz-agent@warp.dev>
…r, source list - Correct 'server CPUs' to 'Intel Skylake-X+, AMD Zen4+' for AVX-512 - Add AVX-512 detection output example - Document that Windows global SIMD flag follows SIMDDetection results - Add /arch:AVX512 to MSVC manual flags example - Add simd_avx512.cpp and simd_dispatch.cpp to source file listing Co-Authored-By: Oz <oz-agent@warp.dev>
Leftover from the old complexity-loop in displayDispatcherConfiguration() that was simplified during the dispatch rework. Co-Authored-By: Oz <oz-agent@warp.dev>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replace the heuristic-based dispatch threshold system with a
constexprlookup table derived from empirical profiling data across four SIMD architectures. Fix AVX-512 build and test infrastructure for MSVC/Windows.Closes #14
Dispatch threshold rework
The previous dispatcher had three compounding problems (documented in #14):
refineWithCapabilities()inverted the SIMD-efficiency adjustmentThis PR replaces both
AdaptiveThresholdCalculatorandPerformanceDispatcher::Thresholdswith a singleconstexprlookup table indexed by(SIMDLevel, DistributionType, OperationType). Each entry is derived directly from profiling CSV data collected on four machines:Key design decisions:
SIZE_MAXsentinel on all architectures (no SIMD path — scalar continued-fraction)AVX-512/MSVC fixes
SIMDDetectionresults — uses/arch:AVX512when detected instead of hardcoding/arch:AVX2for all MSVC x64 builds__AVX512F__tier; SIMD and parallel thresholds adjusted for wide-vector characteristics (vectorized-to-parallel crossovers at 50K–100K vs 8–64 on narrower architectures)NU_MAX = 1000upper bound prevents Newton-Raphson divergence; initial moment estimate clamped to 100vector<bool>→vector<int>(bit-packing caused concurrent writes to race on the same byte)Other improvements
lgammaprefix out of the loop via newbeta_i(x, a, b, log_beta_prefix)overloadtools/strategy_profile) replaces ad-hoc benchmarks as the primary dispatcher-threshold evidence sourceBUILD_SYSTEM_GUIDE.mdupdated for AVX-512 (not server-only, MSVC flag behavior, source file listing)Profiling data
Four profile bundles in
data/profiles/dispatcher/, each containing 1728 measurements (9 distributions × 3 operations × 16 batch sizes × 4 strategies).Testing
Warp conversations: session 1, session 2
Co-Authored-By: Oz oz-agent@warp.dev