Conversation
|
I think that most of the gain comes from copies with very many words, while the smaller copies give a loss.
memcpy has to perform some tests, and hence has an overhead. I looked into this with Jan Kuipers because he liked memcpy
but we had to find a number of bad bugs because of overwriting, and because of that we had to study those routines.
… On 14 Feb 2026, at 14:20, jodavies ***@***.***> wrote:
Here is an optimisation experiment, replacing all NCOPY/WCOPY macros with memmove (we can't be sure that memory regions never overlap in use of the macro for memcpy). The replacement alone is a negligible performance improvement (tentatively 1%?) but it is hard to detect within the usual run-to-run variation.
The followup commits improve some existing copies within the code by using the macros instead, and moving some conditionals outside of the copies. I identified the expensive copies with a profiler running the Forcer benchmark.
On my system (Ryzen 7900X, Ubuntu 24.04, GCC 13.3.0, tform -w12,), the results for the usual benchmarks are as follows:
Benchmark Speedup w.r.t. v5.0.0
chromatic 1.05 ± 0.01
color 1.01 ± 0.01
fmft 1.03 ± 0.01
forcer 1.07 ± 0.00
forcer-exp 1.08 ± 0.01
mass-fact 1.00 ± 0.05
mbox1l 1.01 ± 0.02
minceex 1.07 ± 0.02
mincer 1.00 ± 0.05
sort-disk 0.98 ± 0.02
sort-large 0.99 ± 0.01
sort-small 1.01 ± 0.01
trace 1.02 ± 0.01
You can view, comment on, or merge this pull request online at:
#799
Commit Summary
d2e8ae4 <d2e8ae4> perf: use memmove for all NCOPY/WCOPY macros
2442e58 <2442e58> perf: tform: use NCOPY for an expensive copy in PutToMaster
d93c202 <d93c202> perf: improve copies in InFunction
25d1630 <25d1630> perf: improve copies in PrepPoly
8941fac <8941fac> perf: improve copies in PutBracket, DoIfStatement, Generator, PrepPoly
File Changes (6 files <https://github.com/form-dev/form/pull/799/files>)
M sources/declare.h <https://github.com/form-dev/form/pull/799/files#diff-6f8d49bdc8f58224062fa24277a820c1bea7ab29bb935d4f755d0697912f50b6> (12)
M sources/execute.c <https://github.com/form-dev/form/pull/799/files#diff-f625dc8f12d0c293df79fc18df8fb1340aeaabdbc09efe3e9a06d7c88773bf26> (4)
M sources/if.c <https://github.com/form-dev/form/pull/799/files#diff-5f3f8762053cab97ba35c5105c4cde000ac3a0f93f3d8028da1649f55cf550f1> (2)
M sources/proces.c <https://github.com/form-dev/form/pull/799/files#diff-7d9b915201d0ad01f5242de4aa4a2660d84d7c66ffe4b2a3b44e68628709f931> (62)
M sources/sort.c <https://github.com/form-dev/form/pull/799/files#diff-fe8d53f77e0481de37e05715c6377529b32237680a601ddf3ea1b4530b15d604> (6)
M sources/threads.c <https://github.com/form-dev/form/pull/799/files#diff-bc875f2135c9865c088b27fae715e0b913cef31c269e97c0b79fcf84b4c77f8c> (6)
Patch Links:
https://github.com/form-dev/form/pull/799.patch
https://github.com/form-dev/form/pull/799.diff
—
Reply to this email directly, view it on GitHub <#799>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABJPCETWJLAARBOMXN2SNIT4L4OJXAVCNFSM6AAAAACVEDYHOKVHI2DSMVQWIX3LMV43ASLTON2WKOZTHE2DCNBUGMZTCNA>.
You are receiving this because you are subscribed to this thread.
|
Hoist conditionals out of some data copying loops or simplify while loop termination conditions. Use of memmove does not measurably affect performance, leave a comment about this.
|
I ran benchmarks with more samples, I think indeed the use of memmove doesn't lead to a measurable performance difference. I cleaned up the commits to include only the obvious wins and use the original macros. This results in 6-7% improvement for Forcer, 3% for mincer-exact, 0-1% for everything else. |
|
Coveralls is completely down... See: https://status.coveralls.io/ |
|
Here are benchmark numbers for a much older Intel system with 2x Xeon E5-2667 v4, running
|
|
Benchmark results for my system (Intel Core i9-12900, Ubuntu 20.04, x86_64) with
DetailsSpeedup of B over A (mean) = (mean time of A) / (mean time of B) A: B: Paired runs with n = 30 per benchmark with Environment:
|
Here is an optimisation experiment, replacing all NCOPY/WCOPY macros with memmove (we can't be sure that memory regions never overlap in use of the macro for memcpy). The replacement alone is a negligible performance improvement (tentatively 1%?) but it is hard to detect within the usual run-to-run variation.
The followup commits improve some existing copies within the code by using the macros instead, and moving some conditionals outside of the copies. I identified the expensive copies with a profiler running the Forcer benchmark.
On my system (Ryzen 7900X, Ubuntu 24.04, GCC 13.3.0, tform -w12,), the results for the usual benchmarks are as follows: