Skip to content

perf: copy optimization#799

Merged
jodavies merged 1 commit intoform-dev:masterfrom
jodavies:memmove
Mar 10, 2026
Merged

perf: copy optimization#799
jodavies merged 1 commit intoform-dev:masterfrom
jodavies:memmove

Conversation

@jodavies
Copy link
Collaborator

Here is an optimisation experiment, replacing all NCOPY/WCOPY macros with memmove (we can't be sure that memory regions never overlap in use of the macro for memcpy). The replacement alone is a negligible performance improvement (tentatively 1%?) but it is hard to detect within the usual run-to-run variation.

The followup commits improve some existing copies within the code by using the macros instead, and moving some conditionals outside of the copies. I identified the expensive copies with a profiler running the Forcer benchmark.

On my system (Ryzen 7900X, Ubuntu 24.04, GCC 13.3.0, tform -w12,), the results for the usual benchmarks are as follows:

Benchmark Speedup w.r.t. v5.0.0
chromatic 1.05 ± 0.01
color 1.01 ± 0.01
fmft 1.03 ± 0.01
forcer 1.07 ± 0.00
forcer-exp 1.08 ± 0.01
mass-fact 1.00 ± 0.05
mbox1l 1.01 ± 0.02
minceex 1.07 ± 0.02
mincer 1.00 ± 0.05
sort-disk 0.98 ± 0.02
sort-large 0.99 ± 0.01
sort-small 1.01 ± 0.01
trace 1.02 ± 0.01

@vermaseren
Copy link
Collaborator

vermaseren commented Feb 15, 2026 via email

@coveralls
Copy link

coveralls commented Feb 16, 2026

Coverage Status

coverage: 58.029% (-0.01%) from 58.043%
when pulling c5931f9 on jodavies:memmove
into c134010 on form-dev:master.

Hoist conditionals out of some data copying loops or simplify while loop
termination conditions. Use of memmove does not measurably affect performance,
leave a comment about this.
@jodavies
Copy link
Collaborator Author

I ran benchmarks with more samples, I think indeed the use of memmove doesn't lead to a measurable performance difference. I cleaned up the commits to include only the obvious wins and use the original macros.

This results in 6-7% improvement for Forcer, 3% for mincer-exact, 0-1% for everything else.

@jodavies jodavies changed the title WIP copy optimisation perf: copy optimization Feb 26, 2026
@tueda
Copy link
Collaborator

tueda commented Feb 26, 2026

Coveralls is completely down... See: https://status.coveralls.io/

@jodavies
Copy link
Collaborator Author

jodavies commented Mar 5, 2026

Here are benchmark numbers for a much older Intel system with 2x Xeon E5-2667 v4, running tform -w16, Ubuntu 24.04, GCC 13.3.

Benchmark Speedup w.r.t. v5.0.0
chromatic 1.03 ± 0.01
color 1.08 ± 0.01
fmft 1.04 ± 0.02
forcer-exp 1.13 ± 0.00
forcer 1.11 ± 0.01
mass-fact 1.03 ± 0.01
mbox1l 1.01 ± 0.02
minceex 1.08 ± 0.02
mincer 1.02 ± 0.01
sort-disk 1.04 ± 0.02
sort-large 0.98 ± 0.03
sort-small 1.05 ± 0.03
trace 1.00 ± 0.05

@jodavies jodavies merged commit 2f63692 into form-dev:master Mar 10, 2026
230 of 255 checks passed
@tueda
Copy link
Collaborator

tueda commented Mar 11, 2026

Benchmark results for my system (Intel Core i9-12900, Ubuntu 20.04, x86_64) with tform -w8. Used /tmp instead of /dev/shm for technical reasons. Clear improvements, especially for forcer, forcer-exp and minceex.

Benchmark Speedup 95% bootstrap CI
chromatic 1.05 [1.05, 1.05]
color 1.08 [1.07, 1.08]
fmft 1.05 [1.04, 1.05]
forcer 1.20 [1.20, 1.21]
forcer-exp 1.26 [1.26, 1.27]
mbox1l 1.03 [1.03, 1.04]
minceex 1.12 [1.11, 1.12]
mincer 1.03 [1.03, 1.03]
sort-disk 1.00 [1.00, 1.01]
sort-large 1.00 [0.99, 1.01]
sort-small 1.01 [1.01, 1.02]
trace 0.99 [0.98, 0.99]
Details

Speedup of B over A (mean) = (mean time of A) / (mean time of B)

A:

TFORM 5.0.0 (Jan 27 2026, v5.0.0)
-backtrace  +flint=3.4.0  +gmp=6.3.0   -mpi    +pthreads  +zlib=1.2.11
-debugging  +float        +mpfr=4.2.2  +posix  -windows   +zstd=1.4.4
Compiler: GCC 10.5.0
Architecture: x86_64

B:

TFORM 5.0.0 (Feb 24 2026, v5.0.0-1-gc5931f9)
-backtrace  +flint=3.4.0  +gmp=6.3.0   -mpi    +pthreads  +zlib=1.2.11
-debugging  +float        +mpfr=4.2.2  +posix  -windows   +zstd=1.4.4
Compiler: GCC 10.5.0
Architecture: x86_64

Paired runs with n = 30 per benchmark with /tmp instead of /dev/shm. Used the scripts from this snapshot. The binaries were built for the x86-64-v1 baseline.

Environment:

OS Ubuntu 20.04.6 LTS
Kernel Linux 5.15.0-84-generic
Architecture x86_64
CPU Intel Core i9-12900
CPU configuration 16 cores / 24 threads (8 P-cores + 8 E-cores)
Memory 62.6 GiB
Storage WD_BLACK SN770 1TB NVMe SSD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants