[SharedCache] All proposed shared cache improvements#2
Conversation
|
Wasn't sure where to put this information, but considering these are the majority of changes I'm testing the DSC plugin with, I thought I'd mention here some other areas where there's performance issues that I've found. I'm not an expert on C++ or sampling execution times, so I timed things by using
I think whilst the lock contention is a major issue it might be being exacerbated by excessive calls to |
This method serializes the entire in-memory state to a giant JSON string. With the macOS cache I've been testing with this results in around 500MB of JSON data and it takes around 400-500ms. I have been experimenting with replacing the JSON serialization with Flatbuffers as that should be more efficient both in terms of the size of the resulting data, the time it takes to generate it, and the temporary memory allocations during serialization. It cuts the data size and serialization time in half, but fundamentally it's always going to be slow to be serializing this much state. The real solution is to not to serialize this data as often as it does. Sadly I don't really understand why it was implemented how it is so I'm not sure of the implications of changing that.
I also see that when loading an iOS shared cache. It barely shows up with a macOS shared cache. I'm not sure whether it's possible to make the locking finer grained to improve parallelism since there's not much information about data what the locks protect and why. One thing I will look at is whether the work done with the lock held can be made faster.
That's a good point. A lot of work is done with the |
Yeah thats kind of the same conclusion I've come to. I'm guessing its about saving state in the database using the metadata API. This would also be useful to plugin makers to access information about the DSC from the metadata. Tbh it probably doesn't matter a huge amount (given other issues) but it does add up if you were to bulk load 100s of images.
I've just been working on this and have a solution, although it's kind of messy due to my lack of C++ skills. But basically I had to implement a global lock (per DSC view) for each memory region. If a memory region is not loaded then its respective global lock has to be acquired. The global state is updated to the latest copy and then another check is done to see if the region has been loaded whilst acquiring the lock. If it has then bail, otherwise grab the I'm finding there are a number of |
|
Are you loading a single additional image at a time, or multiple at once via the Python API? If the latter, could you please share a Python snippet so I can also take look at the behavior in that case? |
I'm just loading single images at a time using the UI |
I'm currently looking at |
|
I'd missed that |
It actually gets more wild. Using For
|
|
Do you want me to do pull requests against your |
|
Feel free to send a pull request. I'd love to see what you've come up with! I might end up rebasing it so it applies cleanly to I have personally been trying to stick to branches that apply cleanly to It does mean a bunch of extra work, though. I develop the changes against this branch, then have to port them back to |
|
I did some heap profiling and it seems that when Binary Ninja is consuming 30GB of RAM due to loading 30 libraries, 90% of the RAM usage is coming from core Binary Ninja API allocation. So, like I think you mentioned before, we're at the mercy of Vector35 solving that issue. The biggest chunk of RAM usage from the DSC plugin are the symbol export list which will naturally consume 1 or 2 GBs because of the number of strings, so I don't think anything worthwhile can be done there. I also did some more performance testing and during image loading the bulk of the execution time is in During the analysis phase most of the execution time is in |
|
I think I'll personally hold off on trying to optimize anything further until we start seeing PRs merged upstream. As it stands it is now fast enough for my typical usage (opening a shared cache to poke at the couple of libraries I happened to be interested in at that moment, then throwing it all away). Once changes start to get merged and I'm not trying to work across a giant stack of changes like this I might revisit things again, just because making things faster is a lot of fun. I'll merge the remaining PRs we've each submitted upstream into this branch, and I'll continue to keep an eye out for any other improvements you make so I can merge them here. That will ensure that there's an easy way for anyone that's interested to get a working, fast shared cache plug-in. |
|
Yes I agree, I'm pretty much in the same boat. I appreciate all the work you've done to make working with the DSC in Binary Ninja as performant as it is with this branch. |
|
Not sure what happened between now and this comment but I'm seeing pretty major performance regressions using this branch. |
|
I've been consistently seeing times of 2-3 minutes for Foundation. The only time I've seen anything under 30 seconds it was due to a bug resulting in a bunch of analysis work being skipped. |
|
I think I know whats going. Using What I don't understand is why |
|
Yeah, skipping |
|
Yes I cleaned up that function a little bit as well. The |
`SharedCache::m_symbolInfos` isn't being serialized but there is an attempt to deserialize it. This commit adds in the code to serialize it. I slightly modified the format because I didn't really understand how it was expected to be serialized based on the deserialization code. The deserialization code looked wrong to me but its likely a misunderstanding on my part. I kept it similar to how `m_exportInfos` is serialized.
…ad v3 symbol list We should think through a better way of handling upgrades, as v3 can load v2 dbs just fine, but there is no clean way to upgrade the info currently. We need SharedCache ser/des functions in DSCView.cpp
From what I can tell the lock being taken in `SharedCache::~SharedCache` was purely for the decrement of `sharedCacheReferences`, however its an atomic so a lock isn't necessary. The lock being taken is extremely contentious and therefore often slow to be acquired. This resulted in a surprising amount of execution time spent in the `SharedCache` destructor. Nothing hugely significant but a quick and easy win to remove this single line of code.
Does not seem to do anything other than cause a crash!
…ion value in an mlil dataflow query
…by `BNDSCViewGetAllImages`
…criptor-based classes.
The existing view-specific state was stored in several global unordered maps. Many of these were accessed without locking, including `viewSpecificMutexes`, which is racy in the face of multiple threads. View-specific state is stored in a new heap-allocated `ViewSpecificState` struct that is reference counted via `std::shared_ptr`. A static map holds a `std::weak_ptr` to each view-specific state, keyed by session id. `SharedCache` retrieves its view-specific state during its constructor. Since `ViewSpecificState` is reference counted it will naturally be deallocated when the last `SharedCache` instance that references it goes away. Its corresponding entry will remain in the static map, though since it only holds a `std::weak_ptr` rather than any state it will not use much memory. The next time view-specific state is retrieved any expired entries will be removed from the map.
They're surprisingly expensive to look up.
`BackingCache` now tracks the `dyld_cache_mapping_info` for its mappings so it has access to the memory protections for the region. This means it can avoid marking some regions as containing code when they don't, reducing the amount of analysis work that has to be done. Using `dyld_cache_mapping_info` also makes references to mappings easier to understand due to its named fields vs the nested `std::pair`s that were previously in use.
Find the relative selector base address in the Objective-C optimization data pointed to by the shared cache header, rather than via `__objc_scoffs`. This is only present on iOS, and not for every iOS version that encodes selectors via direct offsets. This also includes some related improvements: 1. Direct selectors get their own pointer type so they're rendered correctly in the view. 2. Method lists encoded as lists of lists are now handled. 3. The `dyld_cache_header` type added to the view is truncated to the length in the loaded cache. This ensures it is applied to the view. 4. A couple of methods that process method IMPs and selectors are updated to check whether the address is valid before attempting to process them. They would otherwise fail by throwing an exception if they proceed, but checking for validity is quicker and makes exception breakpoints usable.
…eser, fix compilation issue on linux
…tion entry Also fix type clobbering on anonymous class names
1. Use moves where possible to avoid unnecessary copies. 2. Remove redundant work within SymbolTableModel::updateSymbols. It calls setFilter which immediately clears then repopulates m_symbols. 3. Use unordered_map rather than map in `VM`. It is faster and the order isn't significant. 4. Avoid multiple accesses to the map with `VM` in the common cases. 5. Optimize the common case within SharedCache::HeaderForAddress. 6. Change return type of SharedCache::HeaderForAddress to avoid copying SharedCacheMachOHeaders. It is a large type that is expensive to copy.
[immer](https://github.com/arximboldi/immer) provides persistent, immutable data structures such as vectors and maps. These data structures support passing by value without copying any data and structural sharing to copy only a subset of data when a data structure is mutated. immer is published under the Boost Software License which should be compatible with its use in this context. Using these data structures eliminates a lot of the unnecessary copying of the shared cache's state when retrieving it from the view cache and beginning to mutate it. Instead of all of the vectors and maps contained within the state being copied, only the portions of the vectors or maps that are mutated end up being copied. The downside is that the APIs used when mutating are less ergonomic than using the native C++ types. The upside is that this cuts the time taken for the initial load and analysis of a macOS shared cache to around 45 seconds (from 70 seconds with the basic CoW implementation in Vector35#6129) and cuts the time taken to load and analyze AppKit from 14 minutes to around 8.5 minutes.
This change is mostly motivated by simplifying the code, but it also brings minor correctness and performance benefits. 1. The pointer returned by mmap is stored as a uint8_t* rather than void* as that is how it is used. This reduces how often it needs to be cast to a different type before it is used. 2. Read methods for primitives delegate to a new Read template function that in turn delegates to the general-purpose `Read(void* dest, size_t address, size_t length)`. This improves the consistency of bounds checking and simplifies the code. The compiler is more than willing to inline this so we get less repetition with no overhead. 3. ReadNullTermString now uses std::find to find the nul byte and directly constructs the string from that range of bytes. This removes an unnecessary allocation that was previously being forced by the use reserve followed by shrink_to_fit. It also avoids repeated reallocation for longer strings as they grew past the the reserved size as they were being built up a character at a time.
|
I'm going to close this out. All of the changes proposed here have since been merged upstream, with the exception of the use of persistent data structures. That was dropped in favor of Vector35#6393. |
This PR exists to document the
dsc-proposedbranch. This branch aggregates a number of pending improvements related to the dyld shared cache plug-in from https://github.com/Vector35/binaryninja-api. It is intended to provide a single branch that can be built to take advantage of a number significant bug fixes and performance improvements that have not yet been merged upstream.This branch will periodically be rebased and its contents will change as pull requests are merged upstream or new changes are made.
Included on this branch
Bug fixes:
SharedCache::m_symbolInfosVector35/binaryninja-api#6174 by WeiN76LQhDSCObjCProcessor::PostProcessObjCSectionsand improve Objective-C processing Vector35/binaryninja-api#6198 by WeiN76LQhPerformance improvements:
Changes for which no pull requests have been created at this time:
Please base further improvements on
devrather than this branch when possible to avoid problems when this branch is rebased. Pull requests should be submitted against the upstream repository unless you're fixing an issue introduced in a change that has not yet been merged.