Serialization fixes#1
Conversation
Copying the state from the cache into a new `SharedCache` object is done with a global lock held and is so expensive that it results in much of the shared cache analysis running on a single thread, with others blocked waiting to acquire the lock. The cache now holds a `std::shared_ptr` to the state. New `SharedCache` objects take a reference to the cached state and only create their own copy of it the first time they perform an operation that would mutate it. The cached copy is never mutated, only replaced, so there is no danger of modifying the state out from under a `SharedCache` object. Since the copy happens at first mutation, it is performed without any global locks held. This avoids blocking other threads. This cuts the initial load time of a macOS shared cache from 3 minutes to 70 seconds, and cuts the time taken to load and analyze AppKit from multiple hours to around 14 minutes.
1. Use moves where possible to avoid unnecessary copies. 2. Remove redundant work within SymbolTableModel::updateSymbols. It calls setFilter which immediately clears then repopulates m_symbols. 3. Use unordered_map rather than map in `VM`. It is faster and the order isn't significant. 4. Avoid multiple accesses to the map with `VM` in the common cases. 5. Optimize the common case within SharedCache::HeaderForAddress. 6. Change return type of SharedCache::HeaderForAddress to avoid copying SharedCacheMachOHeaders. It is a large type that is expensive to copy.
[immer](https://github.com/arximboldi/immer) provides persistent, immutable data structures such as vectors and maps. These data structures support passing by value without copying any data and structural sharing to copy only a subset of data when a data structure is mutated. immer is published under the Boost Software License which should be compatible with its use in this context. Using these data structures eliminates a lot of the unnecessary copying of the shared cache's state when retrieving it from the view cache and beginning to mutate it. Instead of all of the vectors and maps contained within the state being copied, only the portions of the vectors or maps that are mutated end up being copied. The downside is that the APIs used when mutating are less ergonomic than using the native C++ types. The upside is that this cuts the time taken for the initial load and analysis of a macOS shared cache to around 45 seconds (from 70 seconds with the basic CoW implementation in Vector35#6129) and cuts the time taken to load and analyze AppKit from 14 minutes to around 8.5 minutes.
There was an issue when deserializing a `mach_header_64` where `cpusubtype` (and another field I think) were being deserialized to the wrong integer type and causing an assertion failure. I've added a macro to avoid these simple issues in the future. Additionally there was another bug with deserializing `std::pair<uint64_t, std::pair<uint64_t, uint64_t>>` where it was not being serialized in an expected way. The deserialization has been corrected and the unused serialization function for it has been removed.
The `symbolInfos` field was not being serialized but on load it attempts to be deserialized, which fails because its missing. This adds in the serialization.
|
Thanks for taking this branch for a test drive and tracking down these issues. I hadn't been testing with debug builds because I had hit assertion failures on a clean build of It looks like there are three main things that this PR addresses:
Additionally, there are the following changes:
For now what I plan to do tomorrow is to make the simple fixes I mentioned above for the I'd recommend you send a fix for the missing Thanks again for testing out my changes. I'd love to hear what impact it had on performance and memory usage in your testing. |
|
Yeah thats a solid summary of everything thats going. In regards to your question for additional change 1, that change was made because I assumed that would trigger an assertion failure and may not read the data out correctly but my understanding of the underlying library is basically none. There doesn't seem to be any code that exercises those small integer serialization paths anyway at the moment. I initially tried out your first draft of the PR you sent to Vector35 and that did exactly as you said, massively improved core utilization and load times were actually pratical. I then tried out this branch, which meant I haven't tested on any of the commits inbetween. In testing I'm currently trying to load around 100 libraries. From my basic observations (which may not be too accurate) CPU utilization might actually be slightly less, with this newest branch using immer, than the first PR you did, but from what I can tell load times are actually faster. IIRC there was a more consistent ~19/20 core load, where as with these latest commits it fluctuates alot between ~5-18 cores, just from eyeballing CPU utilization. Loading of the first 10 libraries in most cases was taking a couple of minutes max (obviously dependent on the libraries being loaded), so I think despite core utilization its actually faster. In regards to memory utilization I can't remember well enough to know how much this branch has improved on your first draft PR. In either case, its again, much much better than the initial implementation by Vector35 (no hate intended). What I am seeing with memory utilization is that the amount being used per library is increasing with each library being loaded. With the first couple of library loads each one is using in the MBs of memory. Later on, when 20/30 have been loaded, its getting into the multiple GBs of additional memory usage everytime a library is loaded. I'm seeing 20-30 GBs of memory usage with ~20 libraries loaded, so its not likely going to be sustainable loading many more. Slower load and analysis times of additional libraries once many are loaded is probably linked to increases in memory usage. Potentially there is a data structure causing exponential memory usage? ~20 libraries being loaded results in a database thats 6GBs on disk. During loading of the database it balloons to using 27GBs of RAM. Most of that consumption seems to come during the deserialization stage. One thing I didn't mention above is this is with analysis set to |
Actually this might be kind of wrong. Eventually that memory usage dips down to ~10GBs. Something doesn't seem to be getting cleared up immediately. I think it might be a Binary Ninja thing. Its not a particularly consistent problem; sometimes it spikes up and then immediately back down once fully loaded. ~10GBs of memory usage from a 6GB database doesn't seem unreasonable. There's still probably a scaling issue here given 10s of libraries loaded leads to that amount of memory usage. |
I've noticed this as well. I think the decrease in CPU utilization is that some of the work these changes eliminate was able to be done in parallel. The core analysis work that remains spends a lot of time blocked while waiting to acquire mutexes. This has the effect of reducing CPU utilization. Some of this is due to API calls made by the shared cache code (
I haven't looked at the behavior when loading many libraries in a row. I'll take a look to see if anything else jumps out at me in the shared cache code. It's possible that there's some poor scaling that remains. |
I'm finding bulk loading libraries as well as saving a database, closing and reloading Binary Ninja and then continuing to bulk load libraries in the same database, is causing issues to crop up. Seems to be a relatively good way to stress test things and ultimately Binary Ninja needs to be able to do this. From my perspective the ultimate test is can it load the entire of the DSC and within a reasonable time frame (sub 1 week). This is something IDA can actually do, although obviously the UI is fucked if you open up the resulting database in it. I know Binary Ninja and IDA are kind of an apples to orange comparison but I think it would be good for Vector35 to aim for that as an end result. If it can't do that then its unlikely its going to be able to load a couple 100 libraries and I think loading a couple 100 libraries is not an unreasonable expectation. |
5e765f7 to
2decb89
Compare
|
FYI, I pushed fixes for the |
|
It looks like even if the shared cache loading performance is improved, analyzing a large portion of the shared cache won't be practical without significant performance improvements from Vector35 to the core analysis. I tried opening all ~350 public frameworks in the macOS shared cache last night. It took close to an hour and around 35GB of RAM to do the initial load from the dyld shared cache, with all analysis deferred. The UI spent a lot of time unresponsive during the initial loading due to Vector35#6169. The analysis ran after the final library was loaded and was still running this morning, around 10 hours later. RAM usage had risen to around 75GB. My machine has 128GB of RAM so swapping wasn't responsible for the slow analysis. Sadly Binary Ninja crashed deep inside core analysis logic a little while after I checked on it :-( |
|
Yes my experience is somewhat similar. I managed to load ~150 libraries with analysis on hold in ~15 minutes. I think it used ~30GB but can't quite remember. IIRC full analysis completed in ~6 hours with RAM usage at ~70GB. Thats at least doable, although that wasn't with your latest commits (from last night) or mine from PR 6172. I'm going to do another run tonight with all the latest stuff and see what happens. It should at least be usable at that point. I had the same issues with the UI. The load this is putting on Binary Ninja is definitely unearthing UI freezing issues as well as RAM usage. Even after all the analysis had completed it would freeze every now and then through usage. More usable than IDA in that regard though. Like you say, without further internal fixes analyzing a large portion of the shared cache won't be practical. I think with your performance improvements its practical from a time duration perspective (maybe sub 1 week for full analysis) but the RAM usage is too problematic at this point. I also experience crashes but fairly rarely when analyzing the DSC by itself. As soon as I do it within a project it crashes all the time, hence this issue. |
There were a couple of issues with serialization causing assertion crashes. With these commits AFAICT the DSC can be saved to a .bndb and then that .bndb can be loaded again without any crashes. There still seem to be issues with doing that in a project if Binary Ninja is reloaded inbetween, but for a standalone .bndb it works fine.