vsggroups flaws as a benchmarking tool #1698
AnyOldName3
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
The vsggroups example is a pretty useful tool for evaluating the impact of changes to the core VSG classes, and as I understand it, was key to evaluating the performance of vsg::ref_ptr
vsstd::shared_ptrand the VSG allocator versus just usingnew`. However, I've noticed some flaws, so thought they better be documented.std::shared_ptrcontrol blockWhen created with
std::make_shared, the control block forstd::shared_ptris collocated with the object, so in between instances of the object, there'll be a 64-bit vtable pointer, and two 64-bit atomic integer reference counts. My testing suggests that this is about as impactful to cache hit rates and performance during scenegraph traversal as the fact thatstd::shared_ptris typically twice the size ofvsg::ref_ptror a plain C pointer. That means that the VSG could opt out of about half the cost of usingstd::shared_ptrby makingvsg::Inherit::createconstruct an instance and let ashared_ptradopt it rather than havingmake_sharedconstruct the instance. This isn't mentioned in the page on the website explaining the justification forref_ptr, so it possibly wasn't taken into account when the original decision was made.std::shared_ptrallocatorWe know the VSG allocator helps cache hit rates, so it would be easier to do a like-for-like comparison against
vsg::ref_ptrif thestd::shared_ptrtests usedstd::allocate_sharedwith the nodes allocator affinity adapter. As expected, doing so improves performance.std::shared_ptrquad groupIt's not especially obvious from the command line option name that the
shared_ptrtest is using a fixed group size, and therefore is only comparable against the quad group test rather than the default test. As the quad group ends up with the pointers inline with the node instead of living in a separate allocation, this can balance out the performance hit from not usingref_ptr, so anyone who didn't notice that might get results they'd interpret as saying thatshared_ptris no slower.Based on my testing, using a non-fixed-size
shared_ptr-based group node, the performance gap is bigger compared toref_ptrthan it is with the existing quad group classes.Sensitivity to binary layout
As this is a cache-bound benchmark, and all modern CPUs have shared instruction and data caches outside the first level, any L1 instruction cache misses result in a trip to main memory because nodes and node pointers have been fetched into any L2 or L3 cache lines that might have held instructions. Ideally, as not very many unique functions get called during a
vsggroupstraversal test, they'd remain in the instruction cache for the whole duration of the run, but based on my testing, even when everything should fit and there are no associativity issues due to instructions being forced into the same cache line, instructions are getting evicted and need refetching, and it happens often enough to have a measurable impact on performance.I noticed this after trying to measure the impact of some changes, and realising that the performance with MSVC Release versus RelWithDebInfo builds was significantly different. The differences between these build configurations are:
/Ob1versus/Ob2affecting inlining eligibility, but not for any of the functions called during traversal - virtual calls aren't eligible under either flag, and everything else used is eligible under both flags/debugflag, as well as controlling the generation of debug symbols, also changing the default values for/OPT:REFand/OPT:ICFwhen they're not explicitly set./OPT:REFexcludes provably unreachable functions from the generated binary, and/OPT:ICFdeduplicates functions that compile to identical assembly.When I tried them individually, both
/OPT:REFvs/OPT:NOREFand/OPT:ICFvs/OPT:NOICFhave significant but inconsistent impacts on the traversal time reported by the vsggroups example, despite not changing the sequence of instructions executed or memory addresses accessed by those instructions. Ignoring its large magnitude,REF's impact is relatively unsurprising - if you exclude the uncalled functions from a binary, it's more likely that a function will end up adjacent to another that it calls, and therefore end up in the same cache line, so its callee doesn't need fetching separately when it's first called.ICFis less predictable - if two functions are merged, then when one's in the L1 instruction cache, the other one will be there, too, but simple functions are likely to end up equivalent to many other simple functions, and any of these could end up being the one that's kept, but only some of them will be adjacent to other functions called at the same time.Intuitively, this doesn't seem like something that's worth worrying about, but I ended up with some pretty surprising results, e.g. a pair of binaries, one Release and one RelWithDebInfo, where traversal was consistently a whole 22% slower with the Release build, and then when I used the same build settings later to make a pair of Release and RelWithDebInfo builds with a minor change to code not called during traversal, the performance difference disappeared, i.e. both builds were as fast as the older RelWithDebInfo build. 22% is quite a big number to be caused by what is essentially just luck, and if the wrong combination of experimental changes and luck happened, they could lead to very misleading performance results when someone ran the benchmark.
I've spent a while trying to investigate this without coming up with a satisfying explanation for why the impact is sometimes suspiciously large and less large other times, but I've not been able to do more than debunking plenty of hypotheses. The reported L1 instruction cache miss rate doesn't end up wildly different for builds that are fast versus slow, but given that so many other things are identical, it's the only thing left that makes sense. Possibly there's just something, e.g. a Spectre mitigation, causing the cache to be cleared without the resulting misses contributing to the performance counters, but I'm in territory where the CPU's stopped being something that can be definitively reasoned about and is just a magic rock with lightning trapped inside that's been tricked into doing maths.
In real-world VSG-based applications, I don't think there's any need to worry about any of this, but it does mean that we can't trust vsggroups traversal timings to be especially correlated with real-world performance.
Based on my results and reading the manuals for the big three compilers and their linkers, while they're all capable of doing the relevant optimisations, it's only MSVC Windows builds where the VSG/vsgExamples CMake will ever enable them. It's pretty reasonable to expect that VSG-based applications will do so (e.g. by enabling link-time optimisation), but I wouldn't expect the effect to be so inconsistent and unpredictable with a larger binary.
However, even without these two specific optimisations being enabled, there are plenty of ways to subtly affect where functions end up in the final binary. I've not seen double-digit percentage swings without them, but have seen swings in the region of 1-3% (that can be consistently reproduced when rerunning the same binary) from changing the implementation of functions not called during traversal that should have no impact at all, e.g. logging something after traversal has finished, or even changing a function that doesn't get called at all but the linker will include anyway. I'd also suspect that it's possible to get the same impact from things like renaming source files or class names that have absolutely no business changing the runtime performance, but will affect the order the linker consumes things in, and therefore the order the order they appear in the binary.
The only conclusion to draw from this is that comparisons of vsggroups traversal times aren't reliable at all for MSVC Release builds, and aren't reliable when they only show a couple of percentage points of difference for all other platforms and configurations. For changes that have a small impact, you can still get a reasonable indication of their real-world performance by building the example several times with small tweaks in each translation unit to add some randomness into the binary layouts, and also with more than one compiler, and comparing the results for a collection of builds with the change versus without it, but obviously that's much more time consuming and laborious than a simple before and after.
As with all performance testing, synthetic benchmarks just aren't as indicative of real-world performance as they'd ideally be.
Outcome
Some of the issues I've discussed will be addressed by a PR I'll submit soon. However, there's nothing that can be done about the unwanted sensitivity to the binary layout. If there are decisions that have been made based mainly on vsggroups timings, it might be worth validating whether they can be reproduced with a more realistic test. I still don't have access to any VSG-based apps that are bound by scenegraph traversal rather than application logic, so I wouldn't expect to get any good data if I did this on my end.
Beta Was this translation helpful? Give feedback.
All reactions