[SharedCache] All proposed shared cache improvements by bdash · Pull Request #2 · bdash/binaryninja-api

bdash · 2024-11-22T02:43:39Z

This PR exists to document the dsc-proposed branch. This branch aggregates a number of pending improvements related to the dyld shared cache plug-in from https://github.com/Vector35/binaryninja-api. It is intended to provide a single branch that can be built to take advantage of a number significant bug fixes and performance improvements that have not yet been merged upstream.

This branch will periodically be rebased and its contents will change as pull requests are merged upstream or new changes are made.

Included on this branch

Bug fixes:

Performance improvements:

Changes for which no pull requests have been created at this time:

https://github.com/bdash/binaryninja-api/tree/dsc-persistent-data-structures

Please base further improvements on dev rather than this branch when possible to avoid problems when this branch is rebased. Pull requests should be submitted against the upstream repository unless you're fixing an issue introduced in a change that has not yet been merged.

WeiN76LQh · 2024-11-23T19:10:57Z

Wasn't sure where to put this information, but considering these are the majority of changes I'm testing the DSC plugin with, I thought I'd mention here some other areas where there's performance issues that I've found. I'm not an expert on C++ or sampling execution times, so I timed things by using std::chrono::high_resolution_clock::now() around sections of code to see how long they're taking. I presume thats good enough to determine bottlenecks when we're talking seconds of execution time. This is also with a debug build so things maybe running a little slower but obviously these are still areas of concern.

In SharedCache::SaveToDSCView the line auto data = AsMetadata(); is taking multiple seconds consistently. This is called at least once when either loading a section or image.
When using the Python API to load sections by address, in SharedCache::LoadSectionAtAddress the line auto id = m_dscView->BeginUndoActions(); is often taking at least a second and in some cases has taken 10s of seconds. Subsequently in SharedCache::LoadSectionAtAddress the line m_dscView->CommitUndoActions(id); is usually taking 100ms. There appear to be some underlying issues with undo actions when using Binary Ninja via the Python API.
By far the biggest performance issue is lock contention. During phase 1 of analysis there is a huge amount of time spent blocking analysis threads on the first line in SharedCache::LoadSectionAtAddress:
std::unique_lock<std::mutex> lock(viewSpecificMutexes[m_dscView->GetFile()->GetSessionId()].viewOperationsThatInfluenceMetadataMutex);
This lock contention is definitely whats causing DSC analysis to be less parallelised than it could be.
The line shortly after auto vm = GetVMMap(); (in SharedCache::LoadSectionAtAddress) is consistently taking 100s of milliseconds. Which is much less problematic but SharedCache::LoadSectionAtAddress is getting called alot by phase 1 analysis threads so it adds up.

I think whilst the lock contention is a major issue it might be being exacerbated by excessive calls to SharedCache::LoadSectionAtAddress. I'm looking into this because loading one image is causing 71 calls (not likely a consistent number but a real world example) to SharedCache::LoadSectionAtAddress. Often with addresses that are very close by or the same. Its probable that no lock needs to be taken here and a quick return can be done in the cases where the section is already loaded.

bdash · 2024-11-23T22:07:53Z

In SharedCache::SaveToDSCView the line auto data = AsMetadata(); is taking multiple seconds consistently. This is called at least once when either loading a section or image.

This method serializes the entire in-memory state to a giant JSON string. With the macOS cache I've been testing with this results in around 500MB of JSON data and it takes around 400-500ms.

I have been experimenting with replacing the JSON serialization with Flatbuffers as that should be more efficient both in terms of the size of the resulting data, the time it takes to generate it, and the temporary memory allocations during serialization. It cuts the data size and serialization time in half, but fundamentally it's always going to be slow to be serializing this much state.

The real solution is to not to serialize this data as often as it does. Sadly I don't really understand why it was implemented how it is so I'm not sure of the implications of changing that.

By far the biggest performance issue is lock contention. During phase 1 of analysis there is a huge amount of time spent blocking analysis threads on the first line in SharedCache::LoadSectionAtAddress:

I also see that when loading an iOS shared cache. It barely shows up with a macOS shared cache.

I'm not sure whether it's possible to make the locking finer grained to improve parallelism since there's not much information about data what the locks protect and why.

One thing I will look at is whether the work done with the lock held can be made faster. SharedCache::GetVMMap looks to be the culprit with an iOS shared cache. There's room for improvement there. I was working on improving it yesterday before I got sidetracked with Objective-C selectors misbehaving. It works for macOS shared caches, but not iOS at this point. I'll revisit this shortly.

Often with addresses that are very close by or the same. It's probable that no lock needs to be taken here and a quick return can be done in the cases where the section is already loaded.

That's a good point. A lot of work is done with the viewOperationsThatInfluenceMetadataMutex lock held prior to even determining whether there's anything to load. If viewOperationsThatInfluenceMetadataMutex only needs to be held while the operations that interact with BinaryView are being performed, a lot of that work could happen in parallel.

WeiN76LQh · 2024-11-23T22:28:06Z

This method serializes the entire in-memory state to a giant JSON string. With the macOS cache I've been testing with this results in around 500MB of JSON data and it takes around 400-500ms.

I have been experimenting with replacing the JSON serialization with Flatbuffers as that should be more efficient both in terms of the size of the resulting data, the time it takes to generate it, and the temporary memory allocations during serialization. It cuts the data size and serialization time in half, but fundamentally it's always going to be slow to be serializing this much state.

The real solution is to not to serialize this data as often as it does. Sadly I don't really understand why it was implemented how it is so I'm not sure of the implications of changing that.

Yeah thats kind of the same conclusion I've come to. I'm guessing its about saving state in the database using the metadata API. This would also be useful to plugin makers to access information about the DSC from the metadata. Tbh it probably doesn't matter a huge amount (given other issues) but it does add up if you were to bulk load 100s of images.

That's a good point. A lot of work is done with the viewOperationsThatInfluenceMetadataMutex lock held prior to even determining whether there's anything to load. If viewOperationsThatInfluenceMetadataMutex only needs to be held while the operations that interact with BinaryView are being performed, a lot of that work could happen in parallel.

I've just been working on this and have a solution, although it's kind of messy due to my lack of C++ skills. But basically I had to implement a global lock (per DSC view) for each memory region. If a memory region is not loaded then its respective global lock has to be acquired. The global state is updated to the latest copy and then another check is done to see if the region has been loaded whilst acquiring the lock. If it has then bail, otherwise grab the viewOperationsThatInfluenceMetadataMutex lock and load the region whilst still holding the global region lock. This ensures only 1 thread attempts to load the region but also means once its loaded threads spend very little time in SharedCache::LoadSectionAtAddress. It does speed up initial loading of an image but it seems alot of phase 1 analysis is still being slowed down elsewhere.

I'm finding there are a number of SharedCache functions being called frequently in phase 1 and 3 analysis that may also be problematic. SharedCache::FindSymbolAtAddrAndApplyToAddr seems to be a popular but slow path.

bdash · 2024-11-23T22:53:31Z

Are you loading a single additional image at a time, or multiple at once via the Python API? If the latter, could you please share a Python snippet so I can also take look at the behavior in that case?

WeiN76LQh · 2024-11-23T23:28:29Z

Are you loading a single additional image at a time, or multiple at once via the Python API? If the latter, could you please share a Python snippet so I can also take look at the behavior in that case?

I'm just loading single images at a time using the UI

WeiN76LQh · 2024-11-23T23:31:09Z

I'm finding there are a number of SharedCache functions being called frequently in phase 1 and 3 analysis that may also be problematic. SharedCache::FindSymbolAtAddrAndApplyToAddr seems to be a popular but slow path.

I'm currently looking at SharedCache::FindSymbolAtAddrAndApplyToAddr. SharedCache::ParseExportTrie is called every time and the results are cached but the cache (State().exportInfos) is never actually being used. I think I can get some improvements here by actually using the cache, amongst some other minor code changes to avoid locks and repetitive code.

bdash · 2024-11-24T00:45:54Z

I'd missed that FindSymbolAtAddrAndApplyToAddr doesn't consult exportInfos. Amazing!

WeiN76LQh · 2024-11-24T00:58:41Z

I'd missed that FindSymbolAtAddrAndApplyToAddr doesn't consult exportInfos. Amazing!

It actually gets more wild. Using exportInfos only provided around a 10% performance improvement for loading the Foundation library. What actually turned out to be the issue with that function is the call to GetTypeLibrary. Caching the references to the type libraries provided over a 50% performance increase and dramatically gets up multi-core utilisation.

For Foundation:

Analysis phase 1 went from ~450 seconds (with exportInfos optimisation) -> ~200 seconds
Analysis phase 3 went from ~60 seconds (with exportInfos optimisation) -> ~20 seconds

WeiN76LQh · 2024-11-24T02:13:41Z

Do you want me to do pull requests against your dsc-proposed branch? At this point my changes are intertwined with yours so if I do a pull request against the main dev branch, they aren't going to neatly merge with dsc-proposed without changes.

bdash · 2024-11-24T02:44:15Z

Feel free to send a pull request. I'd love to see what you've come up with!

I might end up rebasing it so it applies cleanly to dev and then merging it here, using your PR as the end state to see how to resolve the resulting merge conflicts.

I have personally been trying to stick to branches that apply cleanly to dev where there's not an explicit dependency on earlier work. This will hopefully avoid ending up with significant bug fixes or performance improvements being blocked on invasive changes such as the persistent data structures via immer.

It does mean a bunch of extra work, though. I develop the changes against this branch, then have to port them back to dev, and then deal with the merge conflicts when merging everything together here. I honestly don't know if it's worth the effort. Hopefully we'll see some of the PRs against upstream merged soon.

WeiN76LQh · 2024-11-26T00:44:22Z

I did some heap profiling and it seems that when Binary Ninja is consuming 30GB of RAM due to loading 30 libraries, 90% of the RAM usage is coming from core Binary Ninja API allocation. So, like I think you mentioned before, we're at the mercy of Vector35 solving that issue. The biggest chunk of RAM usage from the DSC plugin are the symbol export list which will naturally consume 1 or 2 GBs because of the number of strings, so I don't think anything worthwhile can be done there.

I also did some more performance testing and during image loading the bulk of the execution time is in SharedCache::LoadSectionAtAddress, and some in SharedCache::LoadImageWithInstallName. I think we've solved SharedCache::FindSymbolAtAddrAndApplyToAddr which was also slowing loading. A couple of other functions have a small impact; SharedCache::SaveToDSCView, SharedCache::Store and SharedCache::InitializeHeader.

During the analysis phase most of the execution time is in fixObjCCallTypes and to a lesser extent in SharedCacheWorkflow::FixupStubs.

bdash · 2024-11-26T03:14:26Z

I think I'll personally hold off on trying to optimize anything further until we start seeing PRs merged upstream. As it stands it is now fast enough for my typical usage (opening a shared cache to poke at the couple of libraries I happened to be interested in at that moment, then throwing it all away). Once changes start to get merged and I'm not trying to work across a giant stack of changes like this I might revisit things again, just because making things faster is a lot of fun.

I'll merge the remaining PRs we've each submitted upstream into this branch, and I'll continue to keep an eye out for any other improvements you make so I can merge them here. That will ensure that there's an easy way for anyone that's interested to get a working, fast shared cache plug-in.

WeiN76LQh · 2024-11-26T13:41:14Z

Yes I agree, I'm pretty much in the same boat. I appreciate all the work you've done to make working with the DSC in Binary Ninja as performant as it is with this branch.

WeiN76LQh · 2024-11-28T22:20:40Z

Not sure what happened between now and this comment but I'm seeing pretty major performance regressions using this branch. Foundation is taking more like 600 seconds to load on the latest commit of this branch. If I jump back to when I made that comment I'm still getting the numbers I stated.

bdash · 2024-11-28T23:02:06Z

I've been consistently seeing times of 2-3 minutes for Foundation. The only time I've seen anything under 30 seconds it was due to a bug resulting in a bunch of analysis work being skipped.

WeiN76LQh · 2024-11-29T15:29:49Z

I think I know whats going. Using BeginBulkModifySymbols in SharedCache::FindSymbolAtAddrAndApplyToAddr is proving to be very expensive, as in removing the BeginBulkModifySymbols and FinishBulkModifySymbols, is resulting in a 2.5x to 3x speed increase when loading Foundation.

binaryninja-api/view/sharedcache/core/SharedCache.cpp

Line 3198 in 74f0607

m_dscView->BeginBulkModifySymbols();

What I don't understand is why BeginBulkModifySymbols is being used in that location in the first place. It seems like its probably a mistake as only one symbol is likely to be defined in that situation. My expectation is you use that function when a block of code is adding a number of symbols in one go.

bdash · 2024-11-29T16:41:30Z

Yeah, skipping BeginBulkModifySymbols when modifying only a single symbol was a decent speedup when I tested it, as is avoiding creating an undo action in cases where the code skips doing work.

WeiN76LQh · 2024-11-29T17:18:47Z

Yes I cleaned up that function a little bit as well. The if, else if at the start also seems to be duplicated work so I think the else if can be removed entirely.

`SharedCache::m_symbolInfos` isn't being serialized but there is an attempt to deserialize it. This commit adds in the code to serialize it. I slightly modified the format because I didn't really understand how it was expected to be serialized based on the deserialization code. The deserialization code looked wrong to me but its likely a misunderstanding on my part. I kept it similar to how `m_exportInfos` is serialized.

…ad v3 symbol list We should think through a better way of handling upgrades, as v3 can load v2 dbs just fine, but there is no clean way to upgrade the info currently. We need SharedCache ser/des functions in DSCView.cpp

From what I can tell the lock being taken in `SharedCache::~SharedCache` was purely for the decrement of `sharedCacheReferences`, however its an atomic so a lock isn't necessary. The lock being taken is extremely contentious and therefore often slow to be acquired. This resulted in a surprising amount of execution time spent in the `SharedCache` destructor. Nothing hugely significant but a quick and easy win to remove this single line of code.

Does not seem to do anything other than cause a crash!

…ion value in an mlil dataflow query

…by `BNDSCViewGetAllImages`

…criptor-based classes.

The existing view-specific state was stored in several global unordered maps. Many of these were accessed without locking, including `viewSpecificMutexes`, which is racy in the face of multiple threads. View-specific state is stored in a new heap-allocated `ViewSpecificState` struct that is reference counted via `std::shared_ptr`. A static map holds a `std::weak_ptr` to each view-specific state, keyed by session id. `SharedCache` retrieves its view-specific state during its constructor. Since `ViewSpecificState` is reference counted it will naturally be deallocated when the last `SharedCache` instance that references it goes away. Its corresponding entry will remain in the static map, though since it only holds a `std::weak_ptr` rather than any state it will not use much memory. The next time view-specific state is retrieved any expired entries will be removed from the map.

They're surprisingly expensive to look up.

`BackingCache` now tracks the `dyld_cache_mapping_info` for its mappings so it has access to the memory protections for the region. This means it can avoid marking some regions as containing code when they don't, reducing the amount of analysis work that has to be done. Using `dyld_cache_mapping_info` also makes references to mappings easier to understand due to its named fields vs the nested `std::pair`s that were previously in use.

Find the relative selector base address in the Objective-C optimization data pointed to by the shared cache header, rather than via `__objc_scoffs`. This is only present on iOS, and not for every iOS version that encodes selectors via direct offsets. This also includes some related improvements: 1. Direct selectors get their own pointer type so they're rendered correctly in the view. 2. Method lists encoded as lists of lists are now handled. 3. The `dyld_cache_header` type added to the view is truncated to the length in the loaded cache. This ensures it is applied to the view. 4. A couple of methods that process method IMPs and selectors are updated to check whether the address is valid before attempting to process them. They would otherwise fail by throwing an exception if they proceed, but checking for validity is quicker and makes exception breakpoints usable.

…eser, fix compilation issue on linux

…tion entry Also fix type clobbering on anonymous class names

1. Use moves where possible to avoid unnecessary copies. 2. Remove redundant work within SymbolTableModel::updateSymbols. It calls setFilter which immediately clears then repopulates m_symbols. 3. Use unordered_map rather than map in `VM`. It is faster and the order isn't significant. 4. Avoid multiple accesses to the map with `VM` in the common cases. 5. Optimize the common case within SharedCache::HeaderForAddress. 6. Change return type of SharedCache::HeaderForAddress to avoid copying SharedCacheMachOHeaders. It is a large type that is expensive to copy.

[immer](https://github.com/arximboldi/immer) provides persistent, immutable data structures such as vectors and maps. These data structures support passing by value without copying any data and structural sharing to copy only a subset of data when a data structure is mutated. immer is published under the Boost Software License which should be compatible with its use in this context. Using these data structures eliminates a lot of the unnecessary copying of the shared cache's state when retrieving it from the view cache and beginning to mutate it. Instead of all of the vectors and maps contained within the state being copied, only the portions of the vectors or maps that are mutated end up being copied. The downside is that the APIs used when mutating are less ergonomic than using the native C++ types. The upside is that this cuts the time taken for the initial load and analysis of a macOS shared cache to around 45 seconds (from 70 seconds with the basic CoW implementation in Vector35#6129) and cuts the time taken to load and analyze AppKit from 14 minutes to around 8.5 minutes.

…nto dsc-proposed

This change is mostly motivated by simplifying the code, but it also brings minor correctness and performance benefits. 1. The pointer returned by mmap is stored as a uint8_t* rather than void* as that is how it is used. This reduces how often it needs to be cast to a different type before it is used. 2. Read methods for primitives delegate to a new Read template function that in turn delegates to the general-purpose `Read(void* dest, size_t address, size_t length)`. This improves the consistency of bounds checking and simplifies the code. The compiler is more than willing to inline this so we get less repetition with no overhead. 3. ReadNullTermString now uses std::find to find the nul byte and directly constructs the string from that range of bytes. This removes an unnecessary allocation that was previously being forced by the use reserve followed by shrink_to_fit. It also avoids repeated reallocation for longer strings as they grew past the the reserved size as they were being built up a character at a time.

…-proposed

bdash · 2025-03-05T06:11:16Z

I'm going to close this out. All of the changes proposed here have since been merged upstream, with the exception of the use of persistent data structures. That was dropped in favor of Vector35#6393.

bdash force-pushed the dsc-proposed branch from 610aad0 to f0470ec Compare November 24, 2024 00:33

bdash mentioned this pull request Nov 25, 2024

[SharedCache] Cache references to type libraries Vector35/binaryninja-api#6195

Closed

bdash force-pushed the dsc-proposed branch from f0470ec to 74f0607 Compare November 26, 2024 06:28

WeiN76LQh mentioned this pull request Nov 28, 2024

[SharedCache] Process the .symbols cache file Vector35/binaryninja-api#6210

Closed

bdash force-pushed the dsc-proposed branch from 74f0607 to 24fd842 Compare November 30, 2024 05:53

bdash force-pushed the dsc-proposed branch from 24fd842 to 82259b6 Compare December 19, 2024 06:53

Linear View: Persist highlight during renames

3429718

WeiN76LQh mentioned this pull request Dec 22, 2024

BinaryView::AddUserSection gets increasingly slower for sections with semantics that aren't DefaultSectionSemantics Vector35/binaryninja-api#6271

Closed

WeiN76LQh and others added 4 commits December 23, 2024 11:38

Fix Shared Cache Plugin out-of-tree build

ce76c08

CouleeApps and others added 26 commits December 27, 2024 14:41

Add default implementations to Python TypePrinter to dual C++

5922737

Fix trying to construct a FunctionViewType from a FunctionViewType

7323628

add note to UI plugin bullet about the qt-build repo

80fee0a

Move Changelog filtering to UI

958c87b

Delete ViewNavigationMode

3c05103

Does not seem to do anything other than cause a crash!

Bump UI ABI version due to previous commit.

8b65cc2

Add information on the Portal to the Getting Started guide.

7a6b85b

Verbiage change in Getting Started doc.

4e32100

Add setting to limit maximum recursion depth for computing an express…

cc763dc

…ion value in an mlil dataflow query

Prevent crash when Mach-O view doesn't exist

b34393c

[SharedCache] Fix uninitialized loaded field for mappings returned …

faac469

…by `BNDSCViewGetAllImages`

Deprecate Segment.serialize and Section.serialize in favor of des…

f4cebd8

…criptor-based classes.

Add BNAllocStringWithLength API.

7630dcd

[SharedCache] Cache type libraries in the view-specific state

1f0be23

They're surprisingly expensive to look up.

[SharedCache] Avoid crashing the product whenever bugs occur in ser/d…

49bafa9

…eser, fix compilation issue on linux

MSVC RTTI: Fix crash due to function creation on invalid virtual func…

e95dc2c

…tion entry Also fix type clobbering on anonymous class names

Move all FeatureMap processing to the background.

768bf0d

Add 'ui.log.wordWrap' setting with a default of disabled.

17158e9

Merge remote-tracking branch 'bdash/dsc-persistent-data-structures' i…

9021c2c

…nto dsc-proposed

Merge remote-tracking branch 'bdash/dsc-vm-map-read-cleanup' into dsc…

ed4fc87

…-proposed

bdash force-pushed the dsc-proposed branch from 82259b6 to ed4fc87 Compare January 14, 2025 05:01

bdash closed this Mar 5, 2025

bdash deleted the dsc-proposed branch March 5, 2025 06:11

Conversation

bdash commented Nov 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Included on this branch

Uh oh!

WeiN76LQh commented Nov 23, 2024

Uh oh!

bdash commented Nov 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WeiN76LQh commented Nov 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bdash commented Nov 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WeiN76LQh commented Nov 23, 2024

Uh oh!

WeiN76LQh commented Nov 23, 2024

Uh oh!

bdash commented Nov 24, 2024

Uh oh!

WeiN76LQh commented Nov 24, 2024

Uh oh!

WeiN76LQh commented Nov 24, 2024

Uh oh!

bdash commented Nov 24, 2024

Uh oh!

WeiN76LQh commented Nov 26, 2024

Uh oh!

bdash commented Nov 26, 2024

Uh oh!

WeiN76LQh commented Nov 26, 2024

Uh oh!

WeiN76LQh commented Nov 28, 2024

Uh oh!

bdash commented Nov 28, 2024

Uh oh!

WeiN76LQh commented Nov 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bdash commented Nov 29, 2024

Uh oh!

WeiN76LQh commented Nov 29, 2024

Uh oh!

bdash commented Mar 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

bdash commented Nov 22, 2024 •

edited

Loading

bdash commented Nov 23, 2024 •

edited

Loading

WeiN76LQh commented Nov 23, 2024 •

edited

Loading

bdash commented Nov 23, 2024 •

edited

Loading

WeiN76LQh commented Nov 29, 2024 •

edited

Loading