vsggroups flaws as a benchmarking tool #1698

AnyOldName3 · 2026-04-16T17:31:33Z

AnyOldName3
Apr 16, 2026

The vsggroups example is a pretty useful tool for evaluating the impact of changes to the core VSG classes, and as I understand it, was key to evaluating the performance of vsg::ref_ptrvsstd::shared_ptrand the VSG allocator versus just usingnew`. However, I've noticed some flaws, so thought they better be documented.

`std::shared_ptr` control block

When created with std::make_shared, the control block for std::shared_ptr is collocated with the object, so in between instances of the object, there'll be a 64-bit vtable pointer, and two 64-bit atomic integer reference counts. My testing suggests that this is about as impactful to cache hit rates and performance during scenegraph traversal as the fact that std::shared_ptr is typically twice the size of vsg::ref_ptr or a plain C pointer. That means that the VSG could opt out of about half the cost of using std::shared_ptr by making vsg::Inherit::create construct an instance and let a shared_ptr adopt it rather than having make_shared construct the instance. This isn't mentioned in the page on the website explaining the justification for ref_ptr, so it possibly wasn't taken into account when the original decision was made.

`std::shared_ptr` allocator

We know the VSG allocator helps cache hit rates, so it would be easier to do a like-for-like comparison against vsg::ref_ptr if the std::shared_ptr tests used std::allocate_shared with the nodes allocator affinity adapter. As expected, doing so improves performance.

`std::shared_ptr` quad group

It's not especially obvious from the command line option name that the shared_ptr test is using a fixed group size, and therefore is only comparable against the quad group test rather than the default test. As the quad group ends up with the pointers inline with the node instead of living in a separate allocation, this can balance out the performance hit from not using ref_ptr, so anyone who didn't notice that might get results they'd interpret as saying that shared_ptr is no slower.

Based on my testing, using a non-fixed-size shared_ptr-based group node, the performance gap is bigger compared to ref_ptr than it is with the existing quad group classes.

Sensitivity to binary layout

As this is a cache-bound benchmark, and all modern CPUs have shared instruction and data caches outside the first level, any L1 instruction cache misses result in a trip to main memory because nodes and node pointers have been fetched into any L2 or L3 cache lines that might have held instructions. Ideally, as not very many unique functions get called during a vsggroups traversal test, they'd remain in the instruction cache for the whole duration of the run, but based on my testing, even when everything should fit and there are no associativity issues due to instructions being forced into the same cache line, instructions are getting evicted and need refetching, and it happens often enough to have a measurable impact on performance.

I noticed this after trying to measure the impact of some changes, and realising that the performance with MSVC Release versus RelWithDebInfo builds was significantly different. The differences between these build configurations are:

Obviously, whether debug symbols are generated, which I established wasn't causing a measurable runtime overhead
/Ob1 versus /Ob2 affecting inlining eligibility, but not for any of the functions called during traversal - virtual calls aren't eligible under either flag, and everything else used is eligible under both flags
The linker /debug flag, as well as controlling the generation of debug symbols, also changing the default values for /OPT:REF and /OPT:ICF when they're not explicitly set. /OPT:REF excludes provably unreachable functions from the generated binary, and /OPT:ICF deduplicates functions that compile to identical assembly.

When I tried them individually, both /OPT:REF vs /OPT:NOREF and /OPT:ICF vs /OPT:NOICF have significant but inconsistent impacts on the traversal time reported by the vsggroups example, despite not changing the sequence of instructions executed or memory addresses accessed by those instructions. Ignoring its large magnitude, REF's impact is relatively unsurprising - if you exclude the uncalled functions from a binary, it's more likely that a function will end up adjacent to another that it calls, and therefore end up in the same cache line, so its callee doesn't need fetching separately when it's first called. ICF is less predictable - if two functions are merged, then when one's in the L1 instruction cache, the other one will be there, too, but simple functions are likely to end up equivalent to many other simple functions, and any of these could end up being the one that's kept, but only some of them will be adjacent to other functions called at the same time.

Intuitively, this doesn't seem like something that's worth worrying about, but I ended up with some pretty surprising results, e.g. a pair of binaries, one Release and one RelWithDebInfo, where traversal was consistently a whole 22% slower with the Release build, and then when I used the same build settings later to make a pair of Release and RelWithDebInfo builds with a minor change to code not called during traversal, the performance difference disappeared, i.e. both builds were as fast as the older RelWithDebInfo build. 22% is quite a big number to be caused by what is essentially just luck, and if the wrong combination of experimental changes and luck happened, they could lead to very misleading performance results when someone ran the benchmark.

I've spent a while trying to investigate this without coming up with a satisfying explanation for why the impact is sometimes suspiciously large and less large other times, but I've not been able to do more than debunking plenty of hypotheses. The reported L1 instruction cache miss rate doesn't end up wildly different for builds that are fast versus slow, but given that so many other things are identical, it's the only thing left that makes sense. Possibly there's just something, e.g. a Spectre mitigation, causing the cache to be cleared without the resulting misses contributing to the performance counters, but I'm in territory where the CPU's stopped being something that can be definitively reasoned about and is just a magic rock with lightning trapped inside that's been tricked into doing maths.

In real-world VSG-based applications, I don't think there's any need to worry about any of this, but it does mean that we can't trust vsggroups traversal timings to be especially correlated with real-world performance.

Based on my results and reading the manuals for the big three compilers and their linkers, while they're all capable of doing the relevant optimisations, it's only MSVC Windows builds where the VSG/vsgExamples CMake will ever enable them. It's pretty reasonable to expect that VSG-based applications will do so (e.g. by enabling link-time optimisation), but I wouldn't expect the effect to be so inconsistent and unpredictable with a larger binary.

However, even without these two specific optimisations being enabled, there are plenty of ways to subtly affect where functions end up in the final binary. I've not seen double-digit percentage swings without them, but have seen swings in the region of 1-3% (that can be consistently reproduced when rerunning the same binary) from changing the implementation of functions not called during traversal that should have no impact at all, e.g. logging something after traversal has finished, or even changing a function that doesn't get called at all but the linker will include anyway. I'd also suspect that it's possible to get the same impact from things like renaming source files or class names that have absolutely no business changing the runtime performance, but will affect the order the linker consumes things in, and therefore the order the order they appear in the binary.

The only conclusion to draw from this is that comparisons of vsggroups traversal times aren't reliable at all for MSVC Release builds, and aren't reliable when they only show a couple of percentage points of difference for all other platforms and configurations. For changes that have a small impact, you can still get a reasonable indication of their real-world performance by building the example several times with small tweaks in each translation unit to add some randomness into the binary layouts, and also with more than one compiler, and comparing the results for a collection of builds with the change versus without it, but obviously that's much more time consuming and laborious than a simple before and after.

As with all performance testing, synthetic benchmarks just aren't as indicative of real-world performance as they'd ideally be.

Outcome

Some of the issues I've discussed will be addressed by a PR I'll submit soon. However, there's nothing that can be done about the unwanted sensitivity to the binary layout. If there are decisions that have been made based mainly on vsggroups timings, it might be worth validating whether they can be reproduced with a more realistic test. I still don't have access to any VSG-based apps that are bound by scenegraph traversal rather than application logic, so I wouldn't expect to get any good data if I did this on my end.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vsggroups flaws as a benchmarking tool #1698

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

vsggroups flaws as a benchmarking tool #1698

Uh oh!

AnyOldName3 Apr 16, 2026

std::shared_ptr control block

std::shared_ptr allocator

std::shared_ptr quad group

Sensitivity to binary layout

Outcome

Replies: 0 comments

AnyOldName3
Apr 16, 2026

`std::shared_ptr` control block

`std::shared_ptr` allocator

`std::shared_ptr` quad group