Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
10b2ac1
docs: require session-start architecture build routing in WARP
OldCrow Apr 28, 2026
a915acf
Implement adaptive FB selector and profiling tools
OldCrow Apr 29, 2026
e89cd35
Add Phase D correctness gates: D1 FB mode parity, D2 BW parity, D3 ph…
OldCrow May 1, 2026
59002f0
Clean up policy header: remove dead FbIsaClass and spurious mutable
OldCrow May 1, 2026
01ddb7b
Remove dead LogSpaceOps infrastructure
OldCrow May 1, 2026
fcd38cb
Simplify FB recurrence policy and adopt transcendental kernels abstra…
OldCrow May 1, 2026
d7115b0
Baum-Welch locality refactor with dense/sparse xi split
OldCrow May 1, 2026
51a1de3
Add bw_hotspot profiling tool
OldCrow May 1, 2026
12b9b66
Add benchmark-analysis scratchpad: focus sweep CSVs, rerun dumps, hel…
OldCrow May 1, 2026
e23cd64
Merge remote-tracking branch 'origin/main' into perf/trainers-calcula…
OldCrow May 2, 2026
e0bc0d8
Merge remote-tracking branch 'origin/main' into perf/trainers-calcula…
OldCrow May 2, 2026
690c567
Implement SIMD backends for transcendental_kernels (AVX-512/AVX/SSE2/…
OldCrow May 2, 2026
0692b0d
Retune FB recurrence crossover: N>=5 -> N>=4 on x86; add fb_crossover…
OldCrow May 2, 2026
9fccd81
benchmarks: fix HMMLib detection to not require Boost
OldCrow May 2, 2026
1c4e536
Promote LogNormal and Pareto to Tier 2: add vector log helper
OldCrow May 2, 2026
aa55cac
Add test_simd_platform: fill Platform Capabilities test group
OldCrow May 2, 2026
010d5af
configure_catalina.sh: add -DCMAKE_BUILD_TYPE=Release
OldCrow May 2, 2026
9bd42a2
Ivy Bridge validation: AVX-1 path confirmed; document Catalina build-…
OldCrow May 2, 2026
931ebcc
benchmark-analysis: add Ivy Bridge / Catalina / AVX-1 crossover sweep…
OldCrow May 2, 2026
484dedd
M1 NEON validation: 37/37 pass; crossover + GHMM/HMMLib benchmarks
OldCrow May 2, 2026
b61cc5a
Kaby Lake / AVX2 validation: 37/37 pass; crossover + HMMLib benchmarks
OldCrow May 3, 2026
fb62215
Kaby Lake / AVX2: add GHMM continuous benchmark results
OldCrow May 3, 2026
ac0e00f
Pre-merge review: fix two stale comments
OldCrow May 3, 2026
55f6f42
CI fixes: trailing whitespace, EOF, clang-format, cppcheck suppressions
OldCrow May 3, 2026
662c172
style: apply clang-format 19.1.7 to all source files; fix cppcheck su…
OldCrow May 3, 2026
39ba7c9
chore: register format commit 662c172 in .git-blame-ignore-revs
OldCrow May 3, 2026
d9bfcb1
Fix CI: correct *.ps1 eol=crlf in .gitattributes; structural cppcheck…
OldCrow May 3, 2026
b9d231d
Release v3.3.0: SIMD transcendental kernels, Tier-2 LogNormal/Pareto
OldCrow May 3, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .clang-tidy
Original file line number Diff line number Diff line change
Expand Up @@ -211,7 +211,7 @@ CheckOptions:
value: ''
- key: readability-identifier-naming.NamespaceSuffix
value: ''

# Performance and modernization options
- key: modernize-use-auto.MinTypeNameLength
value: '5'
Expand All @@ -223,13 +223,13 @@ CheckOptions:
value: 'true'
- key: performance-unnecessary-value-param.IncludeStyle
value: 'llvm'

# Certificate and security options
- key: cert-dcl16-c.NewSuffixes
value: 'L;LL;LU;LLU'
- key: cert-oop54-cpp.WarnOnlyIfThisHasSuspiciousField
value: 'false'

# Core guidelines options
- key: cppcoreguidelines-special-member-functions.AllowSoleDefaultDtor
value: 'true'
Expand Down
3 changes: 3 additions & 0 deletions .git-blame-ignore-revs
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,6 @@

# style: bulk reformat all source files with clang-format (2026-04-23)
7221753

# style: apply clang-format 19.1.7 to all source files (2026-05-03)
662c172
4 changes: 2 additions & 2 deletions .gitattributes
Original file line number Diff line number Diff line change
Expand Up @@ -30,10 +30,10 @@ CMakeLists.txt text eol=lf
# Scripts — always LF so they run correctly in bash/sh
*.sh text eol=lf

# Windows-only scripts stay CRLF
# Windows batch/cmd scripts stay CRLF; PowerShell handles LF on all platforms
*.bat text eol=crlf
*.cmd text eol=crlf
*.ps1 text eol=crlf
*.ps1 text eol=lf

# XML (HMM model files)
*.xml text eol=lf
Expand Down
48 changes: 48 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,54 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

## [3.3.0] - 2026-05-03

SIMD performance phase: explicit vector kernels for transcendental
operations and two additional Tier-2 distributions. 37/37 tests pass.

### Added

- **SIMD transcendental kernels** (`src/performance/transcendental_kernels.cpp`):
five inner-loop kernels used by `ForwardBackwardCalculator` (FB max-reduce
recurrence) and `BaumWelchTrainer` (dense-xi accumulation) now have
AVX-512 / AVX / SSE2 / NEON backends. The vector `exp` helper uses a
13-term Horner polynomial with Cephes `ln2` range reduction and branch-free
underflow masking at `MIN_LOG_PROBABILITY`. AVX path stays AVX-1 compatible
for Ivy Bridge / Catalina. Benchmarks on Zen 4 / AVX-512 (T=1000):
FB max-reduce 5.7× faster at N=32; BW xi accumulation 1.03–1.15×.
- **LogNormal and Pareto promoted to Tier 2** (`src/distributions/`): explicit
SIMD `getBatchLogProbabilities` via a vector `log` helper (IEEE-754 exponent
extraction, 7-term Horner, split-LN2 reconstruction, ≤5 ULP).
- **`simd_kernels_internal.h`**: single source of truth for vector exp/log
primitives shared by all Tier-2 distribution TUs and the transcendental
kernels TU.
- **FB recurrence crossover retuned** (`fb_recurrence_policy.h`): threshold
moved from N≥5 to N≥4 on x86 after profiling post-SIMD (MaxReduce is 1.7×
faster at N=4).
- **New tests** (37 total, up from 33):
- `test_simd_platform`: compile-time ISA hierarchy invariants (`#error`) and
runtime contracts on `simd_platform.h` utility functions.
- `test_transcendental_kernels`: SIMD vs `std::exp` parity for all five
kernels across 11 sizes; 1e-12 rel / 1e-15 abs tolerance.
- `test_fb_mode_parity`: Pairwise vs MaxReduce FB log-likelihood agreement.
- `test_bw_parity`: BW determinism (bit-exact) and EM monotonicity.
- **New tools**: `bw_hotspot` (BW E-step phase breakdown), `hotspot_breakdown`
(FB phase-level timings), `fb_crossover_sweep` (Pairwise vs MaxReduce
timing across N), `fb_contour_sweep` (2-D N×T timing heatmap data).

### Changed

- `fb_recurrence_policy.h` moved from `include/libhmm/calculators/` to
`include/libhmm/performance/` (cross-cutting primitive, not calculator-specific).
- Test group labels in `tests/CMakeLists.txt` changed from numeric Level N
notation to semantic names; Performance Primitives group reordered before
Distributions to reflect dependency order.
- `performance/PERFORMANCE_ARCHITECTURE.md` updated: Tier-2 coverage,
delivered recurrence-kernel SIMD, corrected `LIBHMM_SIMD_SOURCES` list.
- `*.ps1` line-ending rule in `.gitattributes` changed from `eol=crlf` to
`eol=lf` (PowerShell handles LF on all platforms; avoids CI pre-commit
mixed-line-ending failures).

## [3.2.1] - 2026-05-02

CI hygiene fix; no functional changes.
Expand Down
12 changes: 11 additions & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ if(APPLE AND NOT CMAKE_CXX_COMPILER)
endif()

project(libhmm
VERSION 3.2.1
VERSION 3.3.0
DESCRIPTION "Modern C++20 Hidden Markov Model Library"
LANGUAGES CXX
)
Expand Down Expand Up @@ -479,6 +479,15 @@ set(LIBHMM_SIMD_SOURCES
src/distributions/weibull_distribution.cpp
)

# Additional TUs that include simd_kernels_internal.h or transcendental_kernels.h
# and therefore need LIBHMM_BEST_SIMD_FLAGS to activate the #if LIBHMM_HAS_* cascade.
# (log_normal and pareto are already in LIBHMM_SIMD_SOURCES above.)
list(APPEND LIBHMM_SIMD_SOURCES
src/performance/transcendental_kernels.cpp
src/calculators/forward_backward_calculator.cpp
src/training/baum_welch_trainer.cpp
)

if(LIBHMM_BEST_SIMD_FLAGS)
foreach(simd_src ${LIBHMM_SIMD_SOURCES})
set_source_files_properties(
Expand All @@ -499,6 +508,7 @@ set(LIBHMM_SOURCES
src/common/common.cpp
src/common/string_tokenizer.cpp
src/common/numerical_stability.cpp
src/performance/transcendental_kernels.cpp
src/distributions/distribution_base.cpp
src/distributions/discrete_distribution.cpp
src/distributions/gaussian_distribution.cpp
Expand Down
28 changes: 15 additions & 13 deletions WARP.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@ This file provides guidance to Warp (warp.dev) when working in this repository.

## Current Status

**Version**: v3.2.1 — latest tag and published release on `main`.
**Tests**: 33/33 passing on all four CI platforms (Linux/GCC, Linux/Clang, macOS/AppleClang, Windows/MSVC).
**Version**: v3.3.0 — latest tag and published release on `main`.
**Tests**: 37/37 passing on all four CI platforms (Linux/GCC, Linux/Clang, macOS/AppleClang, Windows/MSVC).
**Active phase**: Complete. All phases through Post-Phase 5 (CI/tooling, benchmarks) are done.

---
Expand Down Expand Up @@ -36,7 +36,7 @@ include/libhmm/
│ └── segmental_kmeans_trainer.h # Discrete-state initialisation
└── io/ # XML I/O
src/ # Implementation (mirrors include/)
tests/ # GTest suite — levels 0–7 (see tests/CMakeLists.txt)
tests/ # GTest suite — semantic groups (see tests/CMakeLists.txt)
examples/ # 13 usage demonstrations (all canonical API)
tools/ # Standalone diagnostic/benchmarking executables
benchmarks/ # Comparative benchmarks
Expand Down Expand Up @@ -70,7 +70,7 @@ Both are always produced regardless of `BUILD_SHARED_LIBS`. Tests link against

2. **Two canonical calculators** — `ForwardBackwardCalculator` (log-space, precomputed log-trans) and `ViterbiCalculator`. Both call `getBatchLogProbabilities()` per state per time step.

3. **Compile-time SIMD dispatch** — source-distributed; each machine builds for its own CPU. GCC/Clang: `-march=native`. MSVC: `check_cxx_source_runs`-verified `/arch:AVX512`/`AVX2`/`AVX`. All 15 distribution TUs in `LIBHMM_SIMD_SOURCES`. Tier 2 explicit intrinsics: Gaussian + Exponential via `detail::` free functions (extractable to separate TU for future runtime dispatch).
3. **Compile-time SIMD dispatch** — source-distributed; each machine builds for its own CPU. GCC/Clang: `-march=native`. MSVC: `check_cxx_source_runs`-verified `/arch:AVX512`/`AVX2`/`AVX`. All 15 distribution TUs plus transcendental kernels, FB calculator, and BW trainer in `LIBHMM_SIMD_SOURCES`. Tier 2 explicit intrinsics: Gaussian, Exponential, LogNormal, Pareto via `detail::` free functions; recurrence kernels (FB max-reduce, BW xi) via `TranscendentalKernels` in `src/performance/`. Shared vector exp/log helpers in `include/libhmm/performance/simd_kernels_internal.h`.

4. **Thread-safe cache** — `std::atomic<bool> cacheValid_` in `DistributionBase`. Avoids mutex; safe for concurrent const reads if the library is invoked from multiple threads (calculators and trainers themselves run single-threaded — see `performance/PERFORMANCE_ARCHITECTURE.md`).

Expand Down Expand Up @@ -210,24 +210,26 @@ CRLF: `.gitattributes` enforces LF. CRLF warnings on `git add` are normal.

- Always run `./scripts/configure_catalina.sh build` for the first configure.
- The script sanitizes toolchain-related environment variables, pins AppleClang via `xcrun`, and sets `CMAKE_OSX_DEPLOYMENT_TARGET=10.15`.
- **Build type:** the script defaults to `Release` (`-O3`). This is required for correctness: at `-O0`, AppleClang inserts `VZEROUPPER` in the prologue of large-frame AVX functions before saving the `__m256d` argument, silently zeroing `x[2]` and `x[3]`. For debuggable builds use `RelWithDebInfo` (`-O2 -g`) — SIMD helpers inline at `-O2` so the issue cannot occur: `./scripts/configure_catalina.sh build -DCMAKE_BUILD_TYPE=RelWithDebInfo`. Pure `Debug` (`-O0`) is unsafe for any code path that passes `__m256d` through a real call boundary.
- Do not point Catalina builds at Homebrew LLVM/libc++ (`/usr/local/opt/llvm`, `Cellar/llvm*`, libc++ include paths). The root `CMakeLists.txt` guard fails configure when those hints are detected.
- Use `-DLIBHMM_ALLOW_UNSUPPORTED_CATALINA_HOMEBREW_LIBCXX=ON` only for explicit troubleshooting; runtime stability is not guaranteed.

---

## Test Suite Structure

Tests in `tests/CMakeLists.txt` use `add_hmm_test()` helper organized into 8 levels:
Tests in `tests/CMakeLists.txt` use `add_hmm_test()` helper organized into semantic groups:

| Level | Content |
| Group | Content |
|---|---|
| 1 | Math & Numerics |
| 2 | Linear Algebra |
| 3 | Distributions (all 15 + traits/header/type_safety) |
| 4 | Core HMM |
| 5 | Calculators (canonical + continuous + edge cases) |
| 6 | Trainers (canonical + training + edge cases + BW convergence) |
| 7 | IO + Integration (stream IO + end-to-end casino) |
| Platform Capabilities | No tests yet (placeholder) |
| Math & Numerics | constants, numerical stability, common types |
| Performance Primitives | transcendental kernels (SIMD parity vs `std::exp`) |
| Distributions | all 15 + traits/header/type_safety |
| Core HMM | HMM construction and state management |
| Calculators | canonical + continuous + edge cases + FB mode parity |
| Trainers | canonical + training + edge cases + BW convergence + BW parity |
| IO & Integration | stream IO + end-to-end casino |

Custom targets: `check` (correctness, parallel), `check_timing` (serial).
Note: named `check` not `run_tests` to avoid cmake's built-in `RUN_TESTS` on Windows.
Expand Down
23 changes: 23 additions & 0 deletions benchmark-analysis/fb_contour_sweep_adaptive_static_v1.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
mode,n,t,runs,warmup,recurrence_work,emission_work,transition_ms,obs_copy_ms,emission_ms,alloc_ms,forward_ms,backward_ms,reduction_ms,total_ms
adaptive_static_v1,2,1000,5,2,3996,2000,0.0002,0.0006,0.0006,0.0005,0.0555,0.053,0.0001,0.1109
adaptive_static_v1,2,10000,5,2,39996,20000,0.0007,0.0071,0.0045,0.043,0.3578,0.3551,0,0.7707
adaptive_static_v1,2,100000,5,2,399996,200000,0.0026,0.1488,0.2834,0.508,3.8598,3.6578,0.0003,9.0083
adaptive_static_v1,2,1000000,5,2,3999996,2000000,0.0031,2.0429,3.4685,3.7612,36.9812,36.2041,0.0002,82.1594
adaptive_static_v1,4,1000,5,2,15984,4000,0.001,0.0007,0.0106,0.0154,0.2256,0.2209,0.0001,0.4701
adaptive_static_v1,4,10000,5,2,159984,40000,0.0018,0.0104,0.014,0.0139,1.4938,1.5459,0.0005,3.0504
adaptive_static_v1,4,100000,5,2,1599984,400000,0.0036,0.1141,0.58,0.9126,14.5554,14.3194,0.0007,30.568
adaptive_static_v1,8,1000,5,2,63936,8000,0.0012,0.0024,0.0157,0.0294,0.3975,0.3908,0.0002,0.8399
adaptive_static_v1,8,5000,5,2,319936,40000,0.0006,0.0022,0.007,0.0059,1.9524,1.9707,0.0002,3.9503
adaptive_static_v1,8,10000,5,2,639936,80000,0.002,0.0087,0.019,0.2104,3.9859,4.0981,0.0006,8.434
adaptive_static_v1,16,1000,5,2,255744,16000,0.0024,0.0036,0.0276,0.0427,1.4421,1.4556,0.0005,2.9893
adaptive_static_v1,16,2000,5,2,511744,32000,0.0015,0.0017,0.0057,0.0056,2.8761,2.9113,0.0005,5.7923
adaptive_static_v1,16,5000,5,2,1279744,80000,0.0029,0.005,0.0262,0.1948,7.2773,7.3363,0.0007,14.8745
adaptive_static_v1,32,500,5,2,510976,16000,0.0102,0.0007,0.0276,0.0519,4.0494,4.2193,0.0008,8.3801
adaptive_static_v1,32,1000,5,2,1022976,32000,0.0134,0.0031,0.044,0.0831,8.221,8.6986,0.001,17.1867
adaptive_static_v1,32,2000,5,2,2046976,64000,0.0158,0.0056,0.0887,0.1513,16.2641,16.9673,0.001,33.4698
adaptive_static_v1,64,200,5,2,815104,12800,0.0268,0.0006,0.0238,0.0412,8.7132,8.7867,0.0017,17.5748
adaptive_static_v1,64,500,5,2,2043904,32000,0.0417,0.0027,0.0657,0.1169,36.6388,36.9101,0.0019,74.5554
adaptive_static_v1,64,1000,5,2,4091904,64000,0.0355,0.0045,0.1179,0.1798,45.2402,47.7388,0.0015,93.3553
adaptive_static_v1,128,100,5,2,1622016,12800,0.0678,0.0005,0.0268,0.0428,21.5884,25.9046,0.0023,50.4003
adaptive_static_v1,128,250,5,2,4079616,32000,0.0685,0.001,0.0247,0.0602,54.7442,59.1274,0.0025,111.21
adaptive_static_v1,128,500,5,2,8175616,64000,0.0821,0.0013,0.0333,0.032,115.191,122.896,0.0026,231.18
23 changes: 23 additions & 0 deletions benchmark-analysis/fb_contour_sweep_max_reduce.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
mode,n,t,runs,warmup,recurrence_work,emission_work,transition_ms,obs_copy_ms,emission_ms,alloc_ms,forward_ms,backward_ms,reduction_ms,total_ms
max_reduce,2,1000,5,2,3996,2000,0.0001,0.0003,0.0004,0.0003,0.0541,0.0557,0,0.1112
max_reduce,2,10000,5,2,39996,20000,0.0003,0.0033,0.0036,0.0029,0.5451,0.5607,0.0001,1.1176
max_reduce,2,100000,5,2,399996,200000,0.0024,0.1024,0.292,0.5074,5.9164,5.8783,0.0006,12.7317
max_reduce,2,1000000,5,2,3999996,2000000,0.0019,1.5644,3.6518,4.0798,61.6187,65.8737,0.0008,138.632
max_reduce,4,1000,5,2,15984,4000,0.0002,0.0003,0.0072,0.0148,0.1365,0.1401,0.0001,0.3002
max_reduce,4,10000,5,2,159984,40000,0.0005,0.0036,0.0072,0.0061,1.3655,1.4421,0.0002,2.8389
max_reduce,4,100000,5,2,1599984,400000,0.0039,0.1803,0.544,0.8251,14.3255,14.7261,0.0007,30.5996
max_reduce,8,1000,5,2,63936,8000,0.0005,0.0024,0.015,0.0308,0.3906,0.4051,0.0002,0.8435
max_reduce,8,5000,5,2,319936,40000,0.0015,0.0127,0.0492,0.094,1.9496,2.0359,0.0003,4.1927
max_reduce,8,10000,5,2,639936,80000,0.0024,0.0097,0.0191,0.1943,3.9162,4.15,0.0005,8.2942
max_reduce,16,1000,5,2,255744,16000,0.0012,0.0027,0.0325,0.045,1.4214,1.4575,0.0004,2.963
max_reduce,16,2000,5,2,511744,32000,0.0018,0.0063,0.0454,0.0944,2.8557,2.9186,0.0006,6.0147
max_reduce,16,5000,5,2,1279744,80000,0.0036,0.0147,0.1311,0.186,7.0892,7.4272,0.0006,15.147
max_reduce,32,500,5,2,510976,16000,0.0045,0.0023,0.0257,0.0451,4.0341,4.1987,0.0008,8.3059
max_reduce,32,1000,5,2,1022976,32000,0.0064,0.0067,0.0439,0.0748,8.1545,8.4885,0.0008,16.8164
max_reduce,32,2000,5,2,2046976,64000,0.0069,0.0067,0.0793,0.151,16.8425,17.4785,0.0013,35.1039
max_reduce,64,200,5,2,815104,12800,0.0297,0.0025,0.0322,0.0434,9.1157,9.1911,0.0018,18.3756
max_reduce,64,500,5,2,2043904,32000,0.0483,0.0029,0.0804,0.1053,27.1055,28.3244,0.0024,55.0267
max_reduce,64,1000,5,2,4091904,64000,0.0318,0.0042,0.1039,0.1689,62.8022,63.4727,0.0016,120.995
max_reduce,128,100,5,2,1622016,12800,0.071,0.0007,0.0337,0.0426,21.6621,21.5886,0.0024,43.8249
max_reduce,128,250,5,2,4079616,32000,0.0696,0.0008,0.0513,0.0852,77.0032,61.7649,0.0023,137.852
max_reduce,128,500,5,2,8175616,64000,0.0756,0.0031,0.085,0.1356,128.719,119.591,0.0025,243.712
23 changes: 23 additions & 0 deletions benchmark-analysis/fb_contour_sweep_pairwise.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
mode,n,t,runs,warmup,recurrence_work,emission_work,transition_ms,obs_copy_ms,emission_ms,alloc_ms,forward_ms,backward_ms,reduction_ms,total_ms
pairwise,2,1000,5,1,3996,2000,0.0001,0.0003,0.0004,0.0003,0.0343,0.0336,0.0001,0.0693
pairwise,2,10000,5,1,39996,20000,0.0001,0.0024,0.0047,0.0023,0.3434,0.3354,0,0.6895
pairwise,2,100000,5,1,399996,200000,0.001,0.1048,0.2501,0.4206,3.461,3.3926,0.0001,7.6391
pairwise,2,1000000,5,1,3999996,2000000,0.0049,1.5373,2.8471,3.7466,34.7657,34.3781,0.0004,78.5542
pairwise,4,1000,5,1,15984,4000,0.0003,0.0004,0.0101,0.0187,0.2189,0.2153,0.0001,0.4634
pairwise,4,10000,5,1,159984,40000,0.0019,0.0122,0.0167,0.0218,3.4942,3.2695,0.0002,6.8535
pairwise,4,100000,5,1,1599984,400000,0.0033,0.1415,0.6652,1.1502,29.2175,26.0248,0.0002,58.7034
pairwise,8,1000,5,1,63936,8000,0.0005,0.0034,0.0159,0.0316,1.166,1.1765,0.0002,2.3957
pairwise,8,5000,5,1,319936,40000,0.0016,0.0156,0.052,0.1019,5.8452,5.8658,0.0002,11.8913
pairwise,8,10000,5,1,639936,80000,0.0022,0.0079,0.0197,0.204,11.6961,11.7406,0.0002,23.715
pairwise,16,1000,5,1,255744,16000,0.0019,0.0042,0.0326,0.0477,5.3054,5.3313,0.0004,10.7288
pairwise,16,2000,5,1,511744,32000,0.0033,0.0073,0.0434,0.0883,10.6612,10.8194,0.0005,21.7072
pairwise,16,5000,5,1,1279744,80000,0.0051,0.0149,0.0966,0.2077,26.5814,26.6937,0.0005,53.6173
pairwise,32,500,5,1,510976,16000,0.0047,0.0028,0.029,0.044,9.7704,9.8929,0.0006,19.7958
pairwise,32,1000,5,1,1022976,32000,0.0058,0.0047,0.0453,0.0761,19.5781,19.7934,0.0007,39.505
pairwise,32,2000,5,1,2046976,64000,0.0064,0.0065,0.0791,0.1424,39.3132,40.2802,0.0008,80.4737
pairwise,64,200,5,1,815104,12800,0.0311,0.0022,0.0302,0.0409,14.4688,14.2692,0.0014,28.7968
pairwise,64,500,5,1,2043904,32000,0.0293,0.002,0.0509,0.0823,37.0369,38.7809,0.0014,76.2688
pairwise,64,1000,5,1,4091904,64000,0.0298,0.0036,0.0765,0.1626,70.9994,71.0655,0.0013,142.836
pairwise,128,100,5,1,1622016,12800,0.0658,0.0008,0.0361,0.044,27.5451,27.7767,0.002,55.5736
pairwise,128,250,5,1,4079616,32000,0.0637,0.0008,0.0164,0.0593,66.9222,67.2184,0.002,134.272
pairwise,128,500,5,1,8175616,64000,0.0677,0.001,0.0482,0.0731,133.704,135.611,0.0023,269.665
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
mode,n,t,runs,warmup,fb_total_ms,forward_ms,backward_ms
max_reduce,2,500,5,2,0.3,0.114,0.13
max_reduce,2,1000,5,2,0.637,0.233,0.252
max_reduce,2,2000,5,2,1.217,0.467,0.527
max_reduce,2,5000,5,2,3.092,1.191,1.347
max_reduce,2,10000,5,2,6.021,2.482,2.443
max_reduce,2,100000,5,2,63.802,26.135,26.283
max_reduce,3,500,5,2,0.589,0.234,0.258
max_reduce,3,1000,5,2,1.107,0.455,0.501
max_reduce,3,2000,5,2,2.289,0.94,1.034
max_reduce,3,5000,5,2,5.686,2.326,2.592
max_reduce,3,10000,5,2,12.027,4.796,5.664
max_reduce,3,100000,5,2,120.989,49.523,55.446
max_reduce,4,500,5,2,0.884,0.372,0.416
max_reduce,4,1000,5,2,1.879,0.792,0.877
max_reduce,4,2000,5,2,3.776,1.606,1.767
max_reduce,4,5000,5,2,9.505,4.148,4.381
max_reduce,4,10000,5,2,19.404,8.402,8.949
max_reduce,4,100000,5,2,201.829,84.693,96.849
max_reduce,5,500,5,2,1.317,0.568,0.632
max_reduce,5,1000,5,2,2.775,1.196,1.337
max_reduce,5,2000,5,2,5.672,2.391,2.801
max_reduce,5,5000,5,2,13.83,5.923,6.682
max_reduce,5,10000,5,2,29.043,12.056,14.445
max_reduce,5,100000,5,2,291.988,124.124,142.458
max_reduce,6,500,5,2,1.933,0.836,0.951
max_reduce,6,1000,5,2,4.947,2.178,2.407
max_reduce,6,2000,5,2,8.027,3.517,3.891
max_reduce,6,5000,5,2,19.475,8.439,9.547
max_reduce,6,10000,5,2,39.116,17.027,19.181
max_reduce,6,100000,5,2,410.151,176.87,203.052
max_reduce,7,500,5,2,2.623,1.146,1.304
max_reduce,7,1000,5,2,5.839,2.317,3.179
max_reduce,7,2000,5,2,10.765,4.824,5.204
max_reduce,7,5000,5,2,25.732,11.46,12.566
max_reduce,7,10000,5,2,53.622,23.214,27.048
max_reduce,7,100000,5,2,548.109,240.248,271.739
max_reduce,8,500,5,2,3.935,1.592,2.096
max_reduce,8,1000,5,2,7.416,3.137,3.887
max_reduce,8,2000,5,2,13.338,5.863,6.718
max_reduce,8,5000,5,2,35.927,14.932,19.053
max_reduce,8,10000,5,2,67.716,29.651,34.379
max_reduce,8,100000,5,2,707.026,309.823,357.473
Loading
Loading