Skip to content

[SharedCache] All proposed shared cache improvements#2

Closed
bdash wants to merge 37 commits into
devfrom
dsc-proposed
Closed

[SharedCache] All proposed shared cache improvements#2
bdash wants to merge 37 commits into
devfrom
dsc-proposed

Conversation

@bdash
Copy link
Copy Markdown
Owner

@bdash bdash commented Nov 22, 2024

This PR exists to document the dsc-proposed branch. This branch aggregates a number of pending improvements related to the dyld shared cache plug-in from https://github.com/Vector35/binaryninja-api. It is intended to provide a single branch that can be built to take advantage of a number significant bug fixes and performance improvements that have not yet been merged upstream.

This branch will periodically be rebased and its contents will change as pull requests are merged upstream or new changes are made.

Included on this branch

Bug fixes:

Performance improvements:

Changes for which no pull requests have been created at this time:

Please base further improvements on dev rather than this branch when possible to avoid problems when this branch is rebased. Pull requests should be submitted against the upstream repository unless you're fixing an issue introduced in a change that has not yet been merged.

@WeiN76LQh
Copy link
Copy Markdown

Wasn't sure where to put this information, but considering these are the majority of changes I'm testing the DSC plugin with, I thought I'd mention here some other areas where there's performance issues that I've found. I'm not an expert on C++ or sampling execution times, so I timed things by using std::chrono::high_resolution_clock::now() around sections of code to see how long they're taking. I presume thats good enough to determine bottlenecks when we're talking seconds of execution time. This is also with a debug build so things maybe running a little slower but obviously these are still areas of concern.

  • In SharedCache::SaveToDSCView the line auto data = AsMetadata(); is taking multiple seconds consistently. This is called at least once when either loading a section or image.
  • When using the Python API to load sections by address, in SharedCache::LoadSectionAtAddress the line auto id = m_dscView->BeginUndoActions(); is often taking at least a second and in some cases has taken 10s of seconds. Subsequently in SharedCache::LoadSectionAtAddress the line m_dscView->CommitUndoActions(id); is usually taking 100ms. There appear to be some underlying issues with undo actions when using Binary Ninja via the Python API.
  • By far the biggest performance issue is lock contention. During phase 1 of analysis there is a huge amount of time spent blocking analysis threads on the first line in SharedCache::LoadSectionAtAddress:
    std::unique_lock<std::mutex> lock(viewSpecificMutexes[m_dscView->GetFile()->GetSessionId()].viewOperationsThatInfluenceMetadataMutex);
    This lock contention is definitely whats causing DSC analysis to be less parallelised than it could be.
  • The line shortly after auto vm = GetVMMap(); (in SharedCache::LoadSectionAtAddress) is consistently taking 100s of milliseconds. Which is much less problematic but SharedCache::LoadSectionAtAddress is getting called alot by phase 1 analysis threads so it adds up.

I think whilst the lock contention is a major issue it might be being exacerbated by excessive calls to SharedCache::LoadSectionAtAddress. I'm looking into this because loading one image is causing 71 calls (not likely a consistent number but a real world example) to SharedCache::LoadSectionAtAddress. Often with addresses that are very close by or the same. Its probable that no lock needs to be taken here and a quick return can be done in the cases where the section is already loaded.

@bdash
Copy link
Copy Markdown
Owner Author

bdash commented Nov 23, 2024

In SharedCache::SaveToDSCView the line auto data = AsMetadata(); is taking multiple seconds consistently. This is called at least once when either loading a section or image.

This method serializes the entire in-memory state to a giant JSON string. With the macOS cache I've been testing with this results in around 500MB of JSON data and it takes around 400-500ms.

I have been experimenting with replacing the JSON serialization with Flatbuffers as that should be more efficient both in terms of the size of the resulting data, the time it takes to generate it, and the temporary memory allocations during serialization. It cuts the data size and serialization time in half, but fundamentally it's always going to be slow to be serializing this much state.

The real solution is to not to serialize this data as often as it does. Sadly I don't really understand why it was implemented how it is so I'm not sure of the implications of changing that.

By far the biggest performance issue is lock contention. During phase 1 of analysis there is a huge amount of time spent blocking analysis threads on the first line in SharedCache::LoadSectionAtAddress:

I also see that when loading an iOS shared cache. It barely shows up with a macOS shared cache.

I'm not sure whether it's possible to make the locking finer grained to improve parallelism since there's not much information about data what the locks protect and why.

One thing I will look at is whether the work done with the lock held can be made faster. SharedCache::GetVMMap looks to be the culprit with an iOS shared cache. There's room for improvement there. I was working on improving it yesterday before I got sidetracked with Objective-C selectors misbehaving. It works for macOS shared caches, but not iOS at this point. I'll revisit this shortly.

Often with addresses that are very close by or the same. It's probable that no lock needs to be taken here and a quick return can be done in the cases where the section is already loaded.

That's a good point. A lot of work is done with the viewOperationsThatInfluenceMetadataMutex lock held prior to even determining whether there's anything to load. If viewOperationsThatInfluenceMetadataMutex only needs to be held while the operations that interact with BinaryView are being performed, a lot of that work could happen in parallel.

@WeiN76LQh
Copy link
Copy Markdown

WeiN76LQh commented Nov 23, 2024

This method serializes the entire in-memory state to a giant JSON string. With the macOS cache I've been testing with this results in around 500MB of JSON data and it takes around 400-500ms.

I have been experimenting with replacing the JSON serialization with Flatbuffers as that should be more efficient both in terms of the size of the resulting data, the time it takes to generate it, and the temporary memory allocations during serialization. It cuts the data size and serialization time in half, but fundamentally it's always going to be slow to be serializing this much state.

The real solution is to not to serialize this data as often as it does. Sadly I don't really understand why it was implemented how it is so I'm not sure of the implications of changing that.

Yeah thats kind of the same conclusion I've come to. I'm guessing its about saving state in the database using the metadata API. This would also be useful to plugin makers to access information about the DSC from the metadata. Tbh it probably doesn't matter a huge amount (given other issues) but it does add up if you were to bulk load 100s of images.

That's a good point. A lot of work is done with the viewOperationsThatInfluenceMetadataMutex lock held prior to even determining whether there's anything to load. If viewOperationsThatInfluenceMetadataMutex only needs to be held while the operations that interact with BinaryView are being performed, a lot of that work could happen in parallel.

I've just been working on this and have a solution, although it's kind of messy due to my lack of C++ skills. But basically I had to implement a global lock (per DSC view) for each memory region. If a memory region is not loaded then its respective global lock has to be acquired. The global state is updated to the latest copy and then another check is done to see if the region has been loaded whilst acquiring the lock. If it has then bail, otherwise grab the viewOperationsThatInfluenceMetadataMutex lock and load the region whilst still holding the global region lock. This ensures only 1 thread attempts to load the region but also means once its loaded threads spend very little time in SharedCache::LoadSectionAtAddress. It does speed up initial loading of an image but it seems alot of phase 1 analysis is still being slowed down elsewhere.

I'm finding there are a number of SharedCache functions being called frequently in phase 1 and 3 analysis that may also be problematic. SharedCache::FindSymbolAtAddrAndApplyToAddr seems to be a popular but slow path.

@bdash
Copy link
Copy Markdown
Owner Author

bdash commented Nov 23, 2024

Are you loading a single additional image at a time, or multiple at once via the Python API? If the latter, could you please share a Python snippet so I can also take look at the behavior in that case?

@WeiN76LQh
Copy link
Copy Markdown

Are you loading a single additional image at a time, or multiple at once via the Python API? If the latter, could you please share a Python snippet so I can also take look at the behavior in that case?

I'm just loading single images at a time using the UI

@WeiN76LQh
Copy link
Copy Markdown

I'm finding there are a number of SharedCache functions being called frequently in phase 1 and 3 analysis that may also be problematic. SharedCache::FindSymbolAtAddrAndApplyToAddr seems to be a popular but slow path.

I'm currently looking at SharedCache::FindSymbolAtAddrAndApplyToAddr. SharedCache::ParseExportTrie is called every time and the results are cached but the cache (State().exportInfos) is never actually being used. I think I can get some improvements here by actually using the cache, amongst some other minor code changes to avoid locks and repetitive code.

@bdash
Copy link
Copy Markdown
Owner Author

bdash commented Nov 24, 2024

I'd missed that FindSymbolAtAddrAndApplyToAddr doesn't consult exportInfos. Amazing!

@WeiN76LQh
Copy link
Copy Markdown

I'd missed that FindSymbolAtAddrAndApplyToAddr doesn't consult exportInfos. Amazing!

It actually gets more wild. Using exportInfos only provided around a 10% performance improvement for loading the Foundation library. What actually turned out to be the issue with that function is the call to GetTypeLibrary. Caching the references to the type libraries provided over a 50% performance increase and dramatically gets up multi-core utilisation.

For Foundation:

  • Analysis phase 1 went from ~450 seconds (with exportInfos optimisation) -> ~200 seconds
  • Analysis phase 3 went from ~60 seconds (with exportInfos optimisation) -> ~20 seconds

@WeiN76LQh
Copy link
Copy Markdown

Do you want me to do pull requests against your dsc-proposed branch? At this point my changes are intertwined with yours so if I do a pull request against the main dev branch, they aren't going to neatly merge with dsc-proposed without changes.

@bdash
Copy link
Copy Markdown
Owner Author

bdash commented Nov 24, 2024

Feel free to send a pull request. I'd love to see what you've come up with!

I might end up rebasing it so it applies cleanly to dev and then merging it here, using your PR as the end state to see how to resolve the resulting merge conflicts.

I have personally been trying to stick to branches that apply cleanly to dev where there's not an explicit dependency on earlier work. This will hopefully avoid ending up with significant bug fixes or performance improvements being blocked on invasive changes such as the persistent data structures via immer.

It does mean a bunch of extra work, though. I develop the changes against this branch, then have to port them back to dev, and then deal with the merge conflicts when merging everything together here. I honestly don't know if it's worth the effort. Hopefully we'll see some of the PRs against upstream merged soon.

@WeiN76LQh
Copy link
Copy Markdown

I did some heap profiling and it seems that when Binary Ninja is consuming 30GB of RAM due to loading 30 libraries, 90% of the RAM usage is coming from core Binary Ninja API allocation. So, like I think you mentioned before, we're at the mercy of Vector35 solving that issue. The biggest chunk of RAM usage from the DSC plugin are the symbol export list which will naturally consume 1 or 2 GBs because of the number of strings, so I don't think anything worthwhile can be done there.

I also did some more performance testing and during image loading the bulk of the execution time is in SharedCache::LoadSectionAtAddress, and some in SharedCache::LoadImageWithInstallName. I think we've solved SharedCache::FindSymbolAtAddrAndApplyToAddr which was also slowing loading. A couple of other functions have a small impact; SharedCache::SaveToDSCView, SharedCache::Store and SharedCache::InitializeHeader.

During the analysis phase most of the execution time is in fixObjCCallTypes and to a lesser extent in SharedCacheWorkflow::FixupStubs.

@bdash
Copy link
Copy Markdown
Owner Author

bdash commented Nov 26, 2024

I think I'll personally hold off on trying to optimize anything further until we start seeing PRs merged upstream. As it stands it is now fast enough for my typical usage (opening a shared cache to poke at the couple of libraries I happened to be interested in at that moment, then throwing it all away). Once changes start to get merged and I'm not trying to work across a giant stack of changes like this I might revisit things again, just because making things faster is a lot of fun.

I'll merge the remaining PRs we've each submitted upstream into this branch, and I'll continue to keep an eye out for any other improvements you make so I can merge them here. That will ensure that there's an easy way for anyone that's interested to get a working, fast shared cache plug-in.

@WeiN76LQh
Copy link
Copy Markdown

Yes I agree, I'm pretty much in the same boat. I appreciate all the work you've done to make working with the DSC in Binary Ninja as performant as it is with this branch.

@WeiN76LQh
Copy link
Copy Markdown

Not sure what happened between now and this comment but I'm seeing pretty major performance regressions using this branch. Foundation is taking more like 600 seconds to load on the latest commit of this branch. If I jump back to when I made that comment I'm still getting the numbers I stated.

@bdash
Copy link
Copy Markdown
Owner Author

bdash commented Nov 28, 2024

I've been consistently seeing times of 2-3 minutes for Foundation. The only time I've seen anything under 30 seconds it was due to a bug resulting in a bunch of analysis work being skipped.

@WeiN76LQh
Copy link
Copy Markdown

WeiN76LQh commented Nov 29, 2024

I think I know whats going. Using BeginBulkModifySymbols in SharedCache::FindSymbolAtAddrAndApplyToAddr is proving to be very expensive, as in removing the BeginBulkModifySymbols and FinishBulkModifySymbols, is resulting in a 2.5x to 3x speed increase when loading Foundation.

m_dscView->BeginBulkModifySymbols();

What I don't understand is why BeginBulkModifySymbols is being used in that location in the first place. It seems like its probably a mistake as only one symbol is likely to be defined in that situation. My expectation is you use that function when a block of code is adding a number of symbols in one go.

@bdash
Copy link
Copy Markdown
Owner Author

bdash commented Nov 29, 2024

Yeah, skipping BeginBulkModifySymbols when modifying only a single symbol was a decent speedup when I tested it, as is avoiding creating an undo action in cases where the code skips doing work.

@WeiN76LQh
Copy link
Copy Markdown

Yes I cleaned up that function a little bit as well. The if, else if at the start also seems to be duplicated work so I think the else if can be removed entirely.

WeiN76LQh and others added 4 commits December 23, 2024 11:38
`SharedCache::m_symbolInfos` isn't being serialized but there is an attempt to deserialize it. This commit adds in the code to serialize it.

I slightly modified the format because I didn't really understand how it was expected to be serialized based on the deserialization code. The deserialization code looked wrong to me but its likely a misunderstanding on my part. I kept it similar to how `m_exportInfos` is serialized.
…ad v3 symbol list

We should think through a better way of handling upgrades, as v3 can load v2 dbs just fine, but there is no clean way to upgrade the info currently. We need SharedCache ser/des functions in DSCView.cpp
From what I can tell the lock being taken in `SharedCache::~SharedCache` was purely for the decrement of `sharedCacheReferences`, however its an atomic so a lock isn't necessary. The lock being taken is extremely contentious and therefore often slow to be acquired. This resulted in a surprising amount of execution time spent in the `SharedCache` destructor. Nothing hugely significant but a quick and easy win to remove this single line of code.
CouleeApps and others added 26 commits December 27, 2024 14:41
Does not seem to do anything other than cause a crash!
The existing view-specific state was stored in several global unordered
maps. Many of these were accessed without locking, including
`viewSpecificMutexes`, which is racy in the face of multiple threads.

View-specific state is stored in a new heap-allocated
`ViewSpecificState` struct that is reference counted via
`std::shared_ptr`. A static map holds a `std::weak_ptr` to each
view-specific state, keyed by session id. `SharedCache` retrieves its
view-specific state during its constructor.

Since `ViewSpecificState` is reference counted it will naturally be
deallocated when the last `SharedCache` instance that references it goes
away. Its corresponding entry will remain in the static map, though
since it only holds a `std::weak_ptr` rather than any state it will not
use much memory. The next time view-specific state is retrieved any
expired entries will be removed from the map.
They're surprisingly expensive to look up.
`BackingCache` now tracks the `dyld_cache_mapping_info` for its mappings
so it has access to the memory protections for the region. This means it
can avoid marking some regions as containing code when they don't,
reducing the amount of analysis work that has to be done.

Using `dyld_cache_mapping_info` also makes references to mappings easier
to understand due to its named fields vs the nested `std::pair`s that
were previously in use.
Find the relative selector base address in the Objective-C optimization
data pointed to by the shared cache header, rather than via
`__objc_scoffs`. This is only present on iOS, and not for every iOS
version that encodes selectors via direct offsets.

This also includes some related improvements:
1. Direct selectors get their own pointer type so they're rendered
   correctly in the view.
2. Method lists encoded as lists of lists are now handled.
3. The `dyld_cache_header` type added to the view is truncated to the
   length in the loaded cache. This ensures it is applied to the view.
4. A couple of methods that process method IMPs and selectors are
   updated to check whether the address is valid before attempting to
   process them. They would otherwise fail by throwing an exception if
   they proceed, but checking for validity is quicker and makes
   exception breakpoints usable.
…tion entry

Also fix type clobbering on anonymous class names
1. Use moves where possible to avoid unnecessary copies.
2. Remove redundant work within SymbolTableModel::updateSymbols. It
   calls setFilter which immediately clears then repopulates m_symbols.
3. Use unordered_map rather than map in `VM`. It is faster and the order
   isn't significant.
4. Avoid multiple accesses to the map with `VM` in the common cases.
5. Optimize the common case within SharedCache::HeaderForAddress.
6. Change return type of SharedCache::HeaderForAddress to avoid copying
   SharedCacheMachOHeaders. It is a large type that is expensive to
   copy.
[immer](https://github.com/arximboldi/immer) provides persistent,
immutable data structures such as vectors and maps. These data
structures support passing by value without copying any data and
structural sharing to copy only a subset of data when a data structure
is mutated. immer is published under the Boost Software License which
should be compatible with its use in this context.

Using these data structures eliminates a lot of the unnecessary copying
of the shared cache's state when retrieving it from the view cache and
beginning to mutate it. Instead of all of the vectors and maps contained
within the state being copied, only the portions of the vectors or maps
that are mutated end up being copied.

The downside is that the APIs used when mutating are less ergonomic than
using the native C++ types.

The upside is that this cuts the time taken for the initial load and
analysis of a macOS shared cache to around 45 seconds (from 70 seconds
with the basic CoW implementation in Vector35#6129) and cuts the time taken to
load and analyze AppKit from 14 minutes to around 8.5 minutes.
This change is mostly motivated by simplifying the code, but it also
brings minor correctness and performance benefits.

1. The pointer returned by mmap is stored as a uint8_t* rather than
   void* as that is how it is used. This reduces how often it needs to
   be cast to a different type before it is used.
2. Read methods for primitives delegate to a new Read template function
   that in turn delegates to the general-purpose `Read(void* dest, size_t
   address, size_t length)`. This improves the consistency of bounds
   checking and simplifies the code. The compiler is more than willing
   to inline this so we get less repetition with no overhead.
3. ReadNullTermString now uses std::find to find the nul byte and
   directly constructs the string from that range of bytes. This removes
   an unnecessary allocation that was previously being forced by the use
   reserve followed by shrink_to_fit. It also avoids repeated
   reallocation for longer strings as they grew past the the reserved
   size as they were being built up a character at a time.
@bdash
Copy link
Copy Markdown
Owner Author

bdash commented Mar 5, 2025

I'm going to close this out. All of the changes proposed here have since been merged upstream, with the exception of the use of persistent data structures. That was dropped in favor of Vector35#6393.

@bdash bdash closed this Mar 5, 2025
@bdash bdash deleted the dsc-proposed branch March 5, 2025 06:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.