Skip to content

Serialization fixes#1

Closed
WeiN76LQh wants to merge 7 commits into
bdash:dsc-persistent-data-structuresfrom
WeiN76LQh:dsc-persistent-data-structures
Closed

Serialization fixes#1
WeiN76LQh wants to merge 7 commits into
bdash:dsc-persistent-data-structuresfrom
WeiN76LQh:dsc-persistent-data-structures

Conversation

@WeiN76LQh
Copy link
Copy Markdown

@WeiN76LQh WeiN76LQh commented Nov 18, 2024

There were a couple of issues with serialization causing assertion crashes. With these commits AFAICT the DSC can be saved to a .bndb and then that .bndb can be loaded again without any crashes. There still seem to be issues with doing that in a project if Binary Ninja is reloaded inbetween, but for a standalone .bndb it works fine.

bdash and others added 7 commits November 16, 2024 22:56
Copying the state from the cache into a new `SharedCache` object is done
with a global lock held and is so expensive that it results in much of
the shared cache analysis running on a single thread, with others
blocked waiting to acquire the lock.

The cache now holds a `std::shared_ptr` to the state. New `SharedCache`
objects take a reference to the cached state and only create their own
copy of it the first time they perform an operation that would mutate
it. The cached copy is never mutated, only replaced, so there is no
danger of modifying the state out from under a `SharedCache` object.
Since the copy happens at first mutation, it is performed without any
global locks held. This avoids blocking other threads.

This cuts the initial load time of a macOS shared cache from 3 minutes
to 70 seconds, and cuts the time taken to load and analyze AppKit from
multiple hours to around 14 minutes.
1. Use moves where possible to avoid unnecessary copies.
2. Remove redundant work within SymbolTableModel::updateSymbols. It
   calls setFilter which immediately clears then repopulates m_symbols.
3. Use unordered_map rather than map in `VM`. It is faster and the order
   isn't significant.
4. Avoid multiple accesses to the map with `VM` in the common cases.
5. Optimize the common case within SharedCache::HeaderForAddress.
6. Change return type of SharedCache::HeaderForAddress to avoid copying
   SharedCacheMachOHeaders. It is a large type that is expensive to
   copy.
[immer](https://github.com/arximboldi/immer) provides persistent,
immutable data structures such as vectors and maps. These data
structures support passing by value without copying any data and
structural sharing to copy only a subset of data when a data structure
is mutated. immer is published under the Boost Software License which
should be compatible with its use in this context.

Using these data structures eliminates a lot of the unnecessary copying
of the shared cache's state when retrieving it from the view cache and
beginning to mutate it. Instead of all of the vectors and maps contained
within the state being copied, only the portions of the vectors or maps
that are mutated end up being copied.

The downside is that the APIs used when mutating are less ergonomic than
using the native C++ types.

The upside is that this cuts the time taken for the initial load and
analysis of a macOS shared cache to around 45 seconds (from 70 seconds
with the basic CoW implementation in Vector35#6129) and cuts the time taken to
load and analyze AppKit from 14 minutes to around 8.5 minutes.
There was an issue when deserializing a `mach_header_64` where `cpusubtype` (and another field I think) were being deserialized to the wrong integer type and causing an assertion failure. I've added a macro to avoid these simple issues in the future.

Additionally there was another bug with deserializing `std::pair<uint64_t, std::pair<uint64_t, uint64_t>>` where it was not being serialized in an expected way. The deserialization has been corrected and the unused serialization function for it has been removed.
The `symbolInfos` field was not being serialized but on load it attempts to be deserialized, which fails because its missing. This adds in the serialization.
@bdash
Copy link
Copy Markdown
Owner

bdash commented Nov 19, 2024

Thanks for taking this branch for a test drive and tracking down these issues. I hadn't been testing with debug builds because I had hit assertion failures on a clean build of dev. I now see that was due to the missing symbolsIssue serialization.

It looks like there are three main things that this PR addresses:

  1. symbolsInfo is no longer serialized after 6bd0f07, and this causes an assertion failure in debug builds when the deserialization fails to find that member.

    Since that problem already exists in https://github.com/Vector35/binaryninja-api/tree/dev, it'd make sense to submit that as its own bug fix PR upstream.

  2. cputype / cpusubtype were previously being serialized / deserialized as if they were unsigned, despite being signed fields. My changes from [SharedCache] Switch to SAX-based writer API for JSON serialization Vector35/binaryninja-api#6139 resulted in them being serialized as signed, which triggers an assertion failure in debug builds with Deserialize attempts to deserialize them as unsigned.

    The right fix here is probably to update the Deserialize to look for a signed integer, matching the field type. This will require bumping the metadata version number. Alternatively, we could leave the version number alone and force the fields to be serialized as unsigned as they were prior to my changes.

  3. My attempt at preserving the custom serialization for std::pair<uint64_t, std::pair<uint64_t, uint64_t>> failed.

    This was a result of "cleanup" that moved the Serialize implementations into the .cpp file. I neglected to add the declaration of this custom function to the header so it was never selected when instantiating Serialize(…, immer::vector<std::pair<uint64_t, std::pair<uint64_t, uint64_t>>>).

    Declaring the function in the header is sufficient to fix this.

Additionally, there are the following changes:

  1. cff089f updates Deserialize for integer fields to use Uint / Int rather than Uint64 / Int64.

    Does this matter in practice? It looks totally fine to use the 64 variant to retrieve a value of a smaller size. It's only a problem if you go the other way, and try to use the non-64 variant functions when the parsed JSON contained a 64-bit value.

  2. The addition of a macro to be used during deserialization to automatically select the correct type to deserialize. I like the idea. Personally I'd implement this via a template function rather than a macro. Something like:

    template <typename T>
    void DeserializeTo(T& dest, const Value& value) {
        dest = value.Get<T>();
    }
    

    This seems like a good change that would avoid the signed vs unsigned mismatch issue that we saw with cputype / cpusubtype. It's probably best submitted as an independent PR once [SharedCache] Switch to SAX-based writer API for JSON serialization Vector35/binaryninja-api#6139 is merged.

For now what I plan to do tomorrow is to make the simple fixes I mentioned above for the cputype / cpusubtype and std::pair<uint64_t, std::pair<uint64_t, uint64_t>> cases in Vector35#6139, which is the PR that introduces those problems. This will result in me rebasing my branches that depend on that branch (dsc-cash-cow and dsc-persistent-data-structures).

I'd recommend you send a fix for the missing symbolsInfo serialization upstream.

Thanks again for testing out my changes. I'd love to hear what impact it had on performance and memory usage in your testing.

@WeiN76LQh
Copy link
Copy Markdown
Author

Yeah thats a solid summary of everything thats going. In regards to your question for additional change 1, that change was made because I assumed that would trigger an assertion failure and may not read the data out correctly but my understanding of the underlying library is basically none. There doesn't seem to be any code that exercises those small integer serialization paths anyway at the moment.

I initially tried out your first draft of the PR you sent to Vector35 and that did exactly as you said, massively improved core utilization and load times were actually pratical. I then tried out this branch, which meant I haven't tested on any of the commits inbetween.

In testing I'm currently trying to load around 100 libraries. From my basic observations (which may not be too accurate) CPU utilization might actually be slightly less, with this newest branch using immer, than the first PR you did, but from what I can tell load times are actually faster. IIRC there was a more consistent ~19/20 core load, where as with these latest commits it fluctuates alot between ~5-18 cores, just from eyeballing CPU utilization. Loading of the first 10 libraries in most cases was taking a couple of minutes max (obviously dependent on the libraries being loaded), so I think despite core utilization its actually faster.

In regards to memory utilization I can't remember well enough to know how much this branch has improved on your first draft PR. In either case, its again, much much better than the initial implementation by Vector35 (no hate intended). What I am seeing with memory utilization is that the amount being used per library is increasing with each library being loaded. With the first couple of library loads each one is using in the MBs of memory. Later on, when 20/30 have been loaded, its getting into the multiple GBs of additional memory usage everytime a library is loaded. I'm seeing 20-30 GBs of memory usage with ~20 libraries loaded, so its not likely going to be sustainable loading many more. Slower load and analysis times of additional libraries once many are loaded is probably linked to increases in memory usage. Potentially there is a data structure causing exponential memory usage? ~20 libraries being loaded results in a database thats 6GBs on disk. During loading of the database it balloons to using 27GBs of RAM. Most of that consumption seems to come during the deserialization stage.

One thing I didn't mention above is this is with analysis set to full, so the fact that, initially at least, libraries can be loaded and full analyzed in less than 2 minutes is awesome.

@WeiN76LQh
Copy link
Copy Markdown
Author

~20 libraries being loaded results in a database thats 6GBs on disk. During loading of the database it balloons to using 27GBs of RAM. Most of that consumption seems to come during the deserialization stage.

Actually this might be kind of wrong. Eventually that memory usage dips down to ~10GBs. Something doesn't seem to be getting cleared up immediately. I think it might be a Binary Ninja thing. Its not a particularly consistent problem; sometimes it spikes up and then immediately back down once fully loaded. ~10GBs of memory usage from a 6GB database doesn't seem unreasonable. There's still probably a scaling issue here given 10s of libraries loaded leads to that amount of memory usage.

@bdash
Copy link
Copy Markdown
Owner

bdash commented Nov 19, 2024

From my basic observations (which may not be too accurate) CPU utilization might actually be slightly less, with this newest branch using immer, than the first PR you did, but from what I can tell load times are actually faster. IIRC there was a more consistent ~19/20 core load, where as with these latest commits it fluctuates alot between ~5-18 cores, just from eyeballing CPU utilization. Loading of the first 10 libraries in most cases was taking a couple of minutes max (obviously dependent on the libraries being loaded), so I think despite core utilization its actually faster.

I've noticed this as well. I think the decrease in CPU utilization is that some of the work these changes eliminate was able to be done in parallel. The core analysis work that remains spends a lot of time blocked while waiting to acquire mutexes. This has the effect of reducing CPU utilization.

Some of this is due to API calls made by the shared cache code (BNGetAnalysisFunctionsForAddress, called while parsing the export trie). There may be some room for algorithmic improvement to reduce the number of calls and how long is spent blocked. In practice, most of the remaining work is in libbinaryninjacore.dylib. Since that's closed source there's not much to be done about it.

In testing I'm currently trying to load around 100 libraries.

I haven't looked at the behavior when loading many libraries in a row. I'll take a look to see if anything else jumps out at me in the shared cache code. It's possible that there's some poor scaling that remains.

@WeiN76LQh
Copy link
Copy Markdown
Author

I haven't looked at the behavior when loading many libraries in a row. I'll take a look to see if anything else jumps out at me in the shared cache code. It's possible that there's some poor scaling that remains.

I'm finding bulk loading libraries as well as saving a database, closing and reloading Binary Ninja and then continuing to bulk load libraries in the same database, is causing issues to crop up. Seems to be a relatively good way to stress test things and ultimately Binary Ninja needs to be able to do this.

From my perspective the ultimate test is can it load the entire of the DSC and within a reasonable time frame (sub 1 week). This is something IDA can actually do, although obviously the UI is fucked if you open up the resulting database in it. I know Binary Ninja and IDA are kind of an apples to orange comparison but I think it would be good for Vector35 to aim for that as an end result. If it can't do that then its unlikely its going to be able to load a couple 100 libraries and I think loading a couple 100 libraries is not an unreasonable expectation.

@bdash bdash force-pushed the dsc-persistent-data-structures branch from 5e765f7 to 2decb89 Compare November 21, 2024 02:18
@bdash
Copy link
Copy Markdown
Owner

bdash commented Nov 21, 2024

FYI, I pushed fixes for the cputype / cpusubtype and std::pair<uint64_t, std::pair<uint64_t, uint64_t>> cases to Vector35#6139 and rebased Vector35#6129 and dsc-persistent-data-structures on top of it.

@WeiN76LQh WeiN76LQh closed this by deleting the head repository Nov 21, 2024
@bdash
Copy link
Copy Markdown
Owner

bdash commented Nov 21, 2024

It looks like even if the shared cache loading performance is improved, analyzing a large portion of the shared cache won't be practical without significant performance improvements from Vector35 to the core analysis.

I tried opening all ~350 public frameworks in the macOS shared cache last night. It took close to an hour and around 35GB of RAM to do the initial load from the dyld shared cache, with all analysis deferred. The UI spent a lot of time unresponsive during the initial loading due to Vector35#6169.

The analysis ran after the final library was loaded and was still running this morning, around 10 hours later. RAM usage had risen to around 75GB. My machine has 128GB of RAM so swapping wasn't responsible for the slow analysis. Sadly Binary Ninja crashed deep inside core analysis logic a little while after I checked on it :-(

@WeiN76LQh
Copy link
Copy Markdown
Author

Yes my experience is somewhat similar. I managed to load ~150 libraries with analysis on hold in ~15 minutes. I think it used ~30GB but can't quite remember. IIRC full analysis completed in ~6 hours with RAM usage at ~70GB. Thats at least doable, although that wasn't with your latest commits (from last night) or mine from PR 6172. I'm going to do another run tonight with all the latest stuff and see what happens. It should at least be usable at that point.

I had the same issues with the UI. The load this is putting on Binary Ninja is definitely unearthing UI freezing issues as well as RAM usage. Even after all the analysis had completed it would freeze every now and then through usage. More usable than IDA in that regard though. Like you say, without further internal fixes analyzing a large portion of the shared cache won't be practical. I think with your performance improvements its practical from a time duration perspective (maybe sub 1 week for full analysis) but the RAM usage is too problematic at this point.

I also experience crashes but fairly rarely when analyzing the DSC by itself. As soon as I do it within a project it crashes all the time, hence this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants