[SharedCache] Use basic copy-on-write for viewStateCache#6129
Conversation
|
Thank you so much for your work on this! |
|
Just wanted to give you a heads up on our plan for your PRs. We're in the process of releasing, 4.2 and this won't make the cut off. We're going to accept this PR after the release. So it might be a week or 2 before it gets accepted. Again thanks for the great commits! |
api/MetadataSerializable.hpp is removed in favor of including core/MetadataSerializable.hpp. Both headers defined types with the same name leading to One Definition Rule violations and surprising behavior. The serialization and deserialization context are now created on-demand during serialization rather than being a member of `MetadataSerializable`. This reduces the size of every serializable object by ~220 bytes. The context is passed explicitly as an argument to `Serialize` / `Deserialize`. As a result, `Serialize` / `Deserialize` can now be free functions rather than member functions. Since `MetadataSerializable` is not used for dynamic dispatch, the virtual methods are removed and the class is updated to be a class template using CRTP. This allows delegating to the derived class's `Load` and `Store` methods without the additional size overhead of the vtable pointer in every serializable object. These changes reduce the memory footprint of Binary Ninja after loading the macOS shared cache and loading a single dylib from it from 8.3GB to 4.6GB.
This ensures only one definition ends up in the final binary and makes compilation a little faster.
…o MetadataSerializable
Building up an in-memory representation of the JSON document is expensive in both CPU and memory. Instead of doing that we can directly write the appropriate types.
[immer](https://github.com/arximboldi/immer) provides persistent, immutable data structures such as vectors and maps. These data structures support passing by value without copying any data and structural sharing to copy only a subset of data when a data structure is mutated. immer is published under the Boost Software License which should be compatible with its use in this context. Using these data structures eliminates a lot of the unnecessary copying of the shared cache's state when retrieving it from the view cache and beginning to mutate it. Instead of all of the vectors and maps contained within the state being copied, only the portions of the vectors or maps that are mutated end up being copied. The downside is that the APIs used when mutating are less ergonomic than using the native C++ types. The upside is that this cuts the time taken for the initial load and analysis of a macOS shared cache to around 45 seconds (from 70 seconds with the basic CoW implementation in Vector35#6129) and cuts the time taken to load and analyze AppKit from 14 minutes to around 8.5 minutes.
[immer](https://github.com/arximboldi/immer) provides persistent, immutable data structures such as vectors and maps. These data structures support passing by value without copying any data and structural sharing to copy only a subset of data when a data structure is mutated. immer is published under the Boost Software License which should be compatible with its use in this context. Using these data structures eliminates a lot of the unnecessary copying of the shared cache's state when retrieving it from the view cache and beginning to mutate it. Instead of all of the vectors and maps contained within the state being copied, only the portions of the vectors or maps that are mutated end up being copied. The downside is that the APIs used when mutating are less ergonomic than using the native C++ types. The upside is that this cuts the time taken for the initial load and analysis of a macOS shared cache to around 45 seconds (from 70 seconds with the basic CoW implementation in Vector35#6129) and cuts the time taken to load and analyze AppKit from 14 minutes to around 8.5 minutes.
|
bdash@dsc-persistent-data-structures takes copy-on-write one step further by using persistent data structures from immer for the various vectors and maps that make up the in-memory state related to the shared cache. Their copy-on-write and structural sharing significantly reduces the amount of copying that occurs, cutting the time spent loading from the shared cache by nearly 50% vs this PR. It is a more invasive change that could benefit from some refactoring. |
1. Continue to serialize the `cputype` / `cpusubtype` fields of `mach_header_64` as unsigned, despite them being signed. This preserves compatibility with the existing metadata version. 2. Add the `Serialize` declaration for the special `std::pair<uint64_t, std::pair<uint64_t, uint64_t>>` overload to the header. This ensures it will be favored over the generic `std::pair<First, Second>` template function and preserves the serialization used with the existing metadata version.
Copying the state from the cache into a new `SharedCache` object is done with a global lock held and is so expensive that it results in much of the shared cache analysis running on a single thread, with others blocked waiting to acquire the lock. The cache now holds a `std::shared_ptr` to the state. New `SharedCache` objects take a reference to the cached state and only create their own copy of it the first time they perform an operation that would mutate it. The cached copy is never mutated, only replaced, so there is no danger of modifying the state out from under a `SharedCache` object. Since the copy happens at first mutation, it is performed without any global locks held. This avoids blocking other threads. This cuts the initial load time of a macOS shared cache from 3 minutes to 70 seconds, and cuts the time taken to load and analyze AppKit from multiple hours to around 14 minutes.
[immer](https://github.com/arximboldi/immer) provides persistent, immutable data structures such as vectors and maps. These data structures support passing by value without copying any data and structural sharing to copy only a subset of data when a data structure is mutated. immer is published under the Boost Software License which should be compatible with its use in this context. Using these data structures eliminates a lot of the unnecessary copying of the shared cache's state when retrieving it from the view cache and beginning to mutate it. Instead of all of the vectors and maps contained within the state being copied, only the portions of the vectors or maps that are mutated end up being copied. The downside is that the APIs used when mutating are less ergonomic than using the native C++ types. The upside is that this cuts the time taken for the initial load and analysis of a macOS shared cache to around 45 seconds (from 70 seconds with the basic CoW implementation in Vector35#6129) and cuts the time taken to load and analyze AppKit from 14 minutes to around 8.5 minutes.
|
Merged via 8024cbe3 |
[immer](https://github.com/arximboldi/immer) provides persistent, immutable data structures such as vectors and maps. These data structures support passing by value without copying any data and structural sharing to copy only a subset of data when a data structure is mutated. immer is published under the Boost Software License which should be compatible with its use in this context. Using these data structures eliminates a lot of the unnecessary copying of the shared cache's state when retrieving it from the view cache and beginning to mutate it. Instead of all of the vectors and maps contained within the state being copied, only the portions of the vectors or maps that are mutated end up being copied. The downside is that the APIs used when mutating are less ergonomic than using the native C++ types. The upside is that this cuts the time taken for the initial load and analysis of a macOS shared cache to around 45 seconds (from 70 seconds with the basic CoW implementation in Vector35#6129) and cuts the time taken to load and analyze AppKit from 14 minutes to around 8.5 minutes.
[immer](https://github.com/arximboldi/immer) provides persistent, immutable data structures such as vectors and maps. These data structures support passing by value without copying any data and structural sharing to copy only a subset of data when a data structure is mutated. immer is published under the Boost Software License which should be compatible with its use in this context. Using these data structures eliminates a lot of the unnecessary copying of the shared cache's state when retrieving it from the view cache and beginning to mutate it. Instead of all of the vectors and maps contained within the state being copied, only the portions of the vectors or maps that are mutated end up being copied. The downside is that the APIs used when mutating are less ergonomic than using the native C++ types. The upside is that this cuts the time taken for the initial load and analysis of a macOS shared cache to around 45 seconds (from 70 seconds with the basic CoW implementation in Vector35#6129) and cuts the time taken to load and analyze AppKit from 14 minutes to around 8.5 minutes.
[immer](https://github.com/arximboldi/immer) provides persistent, immutable data structures such as vectors and maps. These data structures support passing by value without copying any data and structural sharing to copy only a subset of data when a data structure is mutated. immer is published under the Boost Software License which should be compatible with its use in this context. Using these data structures eliminates a lot of the unnecessary copying of the shared cache's state when retrieving it from the view cache and beginning to mutate it. Instead of all of the vectors and maps contained within the state being copied, only the portions of the vectors or maps that are mutated end up being copied. The downside is that the APIs used when mutating are less ergonomic than using the native C++ types. The upside is that this cuts the time taken for the initial load and analysis of a macOS shared cache to around 45 seconds (from 70 seconds with the basic CoW implementation in Vector35#6129) and cuts the time taken to load and analyze AppKit from 14 minutes to around 8.5 minutes.
The initial state is initialized during `PerformInitialLoad` and is immutable after that point. This required some slight restructuring of how information about memory regions is tracked as that was previously modified as regions were loaded. Memory regions are now stored in a map from their address range to the `MemoryRegion` object. This makes it cheap to look them up by address which is a common operation. The modified state consists of changes since the last save to the `DSCView` / `ViewSpecificState`. This means it is no longer necessary to copy any state when mutating a `SharedCache` instance for the first time. Instead, its data structures start off empty and are populated as images, sections, or symbol information is loaded. The loaded state consists of all modified state that has since been saved. It lives on the `ViewSpecificState`. Saving modified state merges it into the the existing loaded state. This pattern is carried over to the `Metadata` stored on the `DSCView`. The initial state is stored under its own metadata key, and each modified state is stored under a key with an incrementing number. This means each save of the state only needs to serialize the state that changed, rather than reserializing all of the state all of the time. There are two huge benefits from these changes: 1. At no point does `SharedCache` have to copy its in memory state. The basic copy-on-write approach introduced in Vector35#6129 reduced how often these copies are made, but they're still frequent and very expensive. 1. At no point does `SharedCache` have to re-serialize state to JSON that it has already serialized. JSON serialization previously added hundreds of milliseconds to any mutating operation on `SharedCache`. As a result, this speeds up the initial load of the shared cache by around 2x and loading of subsequent images improves by about the same. One trade-off is that the serialization / deserialization logic is more complicated. There are two reasons for this: 1. The state is now split across multiple metadata keys and needs to be merged when it is loaded. 2. The in-memory representation uses pointers to identify memory regions. These relationships have to be re-established after the JSON is deserialized. As a future direction it is worth considering whether the logic owned by `SharedCache` could be split in a similar manner to the data. The initial loading of the cache header, loading of images, and handling of symbol information are all mostly independent and work on separate data. If the logic were split into separate classes it would be easier to reason about which data is valid when, and would easily permit concurrent loading of multiple images from the shared library in a thread-safe manner.
The initial state is initialized during `PerformInitialLoad` and is immutable after that point. This required some slight restructuring of how information about memory regions is tracked as that was previously modified as regions were loaded. Memory regions are now stored in a map from their address range to the `MemoryRegion` object. This makes it cheap to look them up by address which is a common operation. The modified state consists of changes since the last save to the `DSCView` / `ViewSpecificState`. This means it is no longer necessary to copy any state when mutating a `SharedCache` instance for the first time. Instead, its data structures start off empty and are populated as images, sections, or symbol information is loaded. The loaded state consists of all modified state that has since been saved. It lives on the `ViewSpecificState`. Saving modified state merges it into the the existing loaded state. This pattern is carried over to the `Metadata` stored on the `DSCView`. The initial state is stored under its own metadata key, and each modified state is stored under a key with an incrementing number. This means each save of the state only needs to serialize the state that changed, rather than reserializing all of the state all of the time. There are two huge benefits from these changes: 1. At no point does `SharedCache` have to copy its in memory state. The basic copy-on-write approach introduced in Vector35#6129 reduced how often these copies are made, but they're still frequent and very expensive. 1. At no point does `SharedCache` have to re-serialize state to JSON that it has already serialized. JSON serialization previously added hundreds of milliseconds to any mutating operation on `SharedCache`. As a result, this speeds up the initial load of the shared cache by around 2x and loading of subsequent images improves by about the same. One trade-off is that the serialization / deserialization logic is more complicated. There are two reasons for this: 1. The state is now split across multiple metadata keys and needs to be merged when it is loaded. 2. The in-memory representation uses pointers to identify memory regions. These relationships have to be re-established after the JSON is deserialized. As a future direction it is worth considering whether the logic owned by `SharedCache` could be split in a similar manner to the data. The initial loading of the cache header, loading of images, and handling of symbol information are all mostly independent and work on separate data. If the logic were split into separate classes it would be easier to reason about which data is valid when, and would easily permit concurrent loading of multiple images from the shared library in a thread-safe manner.
The initial state is initialized during `PerformInitialLoad` and is immutable after that point. This required some slight restructuring of how information about memory regions is tracked as that was previously modified as regions were loaded. Memory regions are now stored in a map from their address range to the `MemoryRegion` object. This makes it cheap to look them up by address which is a common operation. The modified state consists of changes since the last save to the `DSCView` / `ViewSpecificState`. This means it is no longer necessary to copy any state when mutating a `SharedCache` instance for the first time. Instead, its data structures start off empty and are populated as images, sections, or symbol information is loaded. The loaded state consists of all modified state that has since been saved. It lives on the `ViewSpecificState`. Saving modified state merges it into the the existing loaded state. This pattern is carried over to the `Metadata` stored on the `DSCView`. The initial state is stored under its own metadata key, and each modified state is stored under a key with an incrementing number. This means each save of the state only needs to serialize the state that changed, rather than reserializing all of the state all of the time. There are two huge benefits from these changes: 1. At no point does `SharedCache` have to copy its in memory state. The basic copy-on-write approach introduced in Vector35#6129 reduced how often these copies are made, but they're still frequent and very expensive. 1. At no point does `SharedCache` have to re-serialize state to JSON that it has already serialized. JSON serialization previously added hundreds of milliseconds to any mutating operation on `SharedCache`. As a result, this speeds up the initial load of the shared cache by around 2x and loading of subsequent images improves by about the same. One trade-off is that the serialization / deserialization logic is more complicated. There are two reasons for this: 1. The state is now split across multiple metadata keys and needs to be merged when it is loaded. 2. The in-memory representation uses pointers to identify memory regions. These relationships have to be re-established after the JSON is deserialized. As a future direction it is worth considering whether the logic owned by `SharedCache` could be split in a similar manner to the data. The initial loading of the cache header, loading of images, and handling of symbol information are all mostly independent and work on separate data. If the logic were split into separate classes it would be easier to reason about which data is valid when, and would easily permit concurrent loading of multiple images from the shared library in a thread-safe manner.
The initial state is initialized during `PerformInitialLoad` and is immutable after that point. This required some slight restructuring of how information about memory regions is tracked as that was previously modified as regions were loaded. Memory regions are now stored in a map from their address range to the `MemoryRegion` object. This makes it cheap to look them up by address which is a common operation. The modified state consists of changes since the last save to the `DSCView` / `ViewSpecificState`. This means it is no longer necessary to copy any state when mutating a `SharedCache` instance for the first time. Instead, its data structures start off empty and are populated as images, sections, or symbol information is loaded. The loaded state consists of all modified state that has since been saved. It lives on the `ViewSpecificState`. Saving modified state merges it into the the existing loaded state. This pattern is carried over to the `Metadata` stored on the `DSCView`. The initial state is stored under its own metadata key, and each modified state is stored under a key with an incrementing number. This means each save of the state only needs to serialize the state that changed, rather than reserializing all of the state all of the time. There are two huge benefits from these changes: 1. At no point does `SharedCache` have to copy its in memory state. The basic copy-on-write approach introduced in Vector35#6129 reduced how often these copies are made, but they're still frequent and very expensive. 1. At no point does `SharedCache` have to re-serialize state to JSON that it has already serialized. JSON serialization previously added hundreds of milliseconds to any mutating operation on `SharedCache`. As a result, this speeds up the initial load of the shared cache by around 2x and loading of subsequent images improves by about the same. One trade-off is that the serialization / deserialization logic is more complicated. There are two reasons for this: 1. The state is now split across multiple metadata keys and needs to be merged when it is loaded. 2. The in-memory representation uses pointers to identify memory regions. These relationships have to be re-established after the JSON is deserialized. As a future direction it is worth considering whether the logic owned by `SharedCache` could be split in a similar manner to the data. The initial loading of the cache header, loading of images, and handling of symbol information are all mostly independent and work on separate data. If the logic were split into separate classes it would be easier to reason about which data is valid when, and would easily permit concurrent loading of multiple images from the shared library in a thread-safe manner.
The initial state is initialized during `PerformInitialLoad` and is immutable after that point. This required some slight restructuring of how information about memory regions is tracked as that was previously modified as regions were loaded. Memory regions are now stored in a map from their address range to the `MemoryRegion` object. This makes it cheap to look them up by address which is a common operation. The modified state consists of changes since the last save to the `DSCView` / `ViewSpecificState`. This means it is no longer necessary to copy any state when mutating a `SharedCache` instance for the first time. Instead, its data structures start off empty and are populated as images, sections, or symbol information is loaded. The loaded state consists of all modified state that has since been saved. It lives on the `ViewSpecificState`. Saving modified state merges it into the the existing loaded state. This pattern is carried over to the `Metadata` stored on the `DSCView`. The initial state is stored under its own metadata key, and each modified state is stored under a key with an incrementing number. This means each save of the state only needs to serialize the state that changed, rather than reserializing all of the state all of the time. There are two huge benefits from these changes: 1. At no point does `SharedCache` have to copy its in memory state. The basic copy-on-write approach introduced in Vector35#6129 reduced how often these copies are made, but they're still frequent and very expensive. 1. At no point does `SharedCache` have to re-serialize state to JSON that it has already serialized. JSON serialization previously added hundreds of milliseconds to any mutating operation on `SharedCache`. As a result, this speeds up the initial load of the shared cache by around 2x and loading of subsequent images improves by about the same. One trade-off is that the serialization / deserialization logic is more complicated. There are two reasons for this: 1. The state is now split across multiple metadata keys and needs to be merged when it is loaded. 2. The in-memory representation uses pointers to identify memory regions. These relationships have to be re-established after the JSON is deserialized. As a future direction it is worth considering whether the logic owned by `SharedCache` could be split in a similar manner to the data. The initial loading of the cache header, loading of images, and handling of symbol information are all mostly independent and work on separate data. If the logic were split into separate classes it would be easier to reason about which data is valid when, and would easily permit concurrent loading of multiple images from the shared library in a thread-safe manner.
The initial state is initialized during `PerformInitialLoad` and is immutable after that point. This required some slight restructuring of how information about memory regions is tracked as that was previously modified as regions were loaded. Memory regions are now stored in a map from their address range to the `MemoryRegion` object. This makes it cheap to look them up by address which is a common operation. The modified state consists of changes since the last save to the `DSCView` / `ViewSpecificState`. This means it is no longer necessary to copy any state when mutating a `SharedCache` instance for the first time. Instead, its data structures start off empty and are populated as images, sections, or symbol information is loaded. The loaded state consists of all modified state that has since been saved. It lives on the `ViewSpecificState`. Saving modified state merges it into the the existing loaded state. This pattern is carried over to the `Metadata` stored on the `DSCView`. The initial state is stored under its own metadata key, and each modified state is stored under a key with an incrementing number. This means each save of the state only needs to serialize the state that changed, rather than reserializing all of the state all of the time. There are two huge benefits from these changes: 1. At no point does `SharedCache` have to copy its in memory state. The basic copy-on-write approach introduced in Vector35#6129 reduced how often these copies are made, but they're still frequent and very expensive. 1. At no point does `SharedCache` have to re-serialize state to JSON that it has already serialized. JSON serialization previously added hundreds of milliseconds to any mutating operation on `SharedCache`. As a result, this speeds up the initial load of the shared cache by around 2x and loading of subsequent images improves by about the same. One trade-off is that the serialization / deserialization logic is more complicated. There are two reasons for this: 1. The state is now split across multiple metadata keys and needs to be merged when it is loaded. 2. The in-memory representation uses pointers to identify memory regions. These relationships have to be re-established after the JSON is deserialized. As a future direction it is worth considering whether the logic owned by `SharedCache` could be split in a similar manner to the data. The initial loading of the cache header, loading of images, and handling of symbol information are all mostly independent and work on separate data. If the logic were split into separate classes it would be easier to reason about which data is valid when, and would easily permit concurrent loading of multiple images from the shared library in a thread-safe manner.
The initial state is initialized during `PerformInitialLoad` and is immutable after that point. This required some slight restructuring of how information about memory regions is tracked as that was previously modified as regions were loaded. Memory regions are now stored in a map from their address range to the `MemoryRegion` object. This makes it cheap to look them up by address which is a common operation. The modified state consists of changes since the last save to the `DSCView` / `ViewSpecificState`. This means it is no longer necessary to copy any state when mutating a `SharedCache` instance for the first time. Instead, its data structures start off empty and are populated as images, sections, or symbol information is loaded. The loaded state consists of all modified state that has since been saved. It lives on the `ViewSpecificState`. Saving modified state merges it into the the existing loaded state. This pattern is carried over to the `Metadata` stored on the `DSCView`. The initial state is stored under its own metadata key, and each modified state is stored under a key with an incrementing number. This means each save of the state only needs to serialize the state that changed, rather than reserializing all of the state all of the time. There are two huge benefits from these changes: 1. At no point does `SharedCache` have to copy its in memory state. The basic copy-on-write approach introduced in Vector35#6129 reduced how often these copies are made, but they're still frequent and very expensive. 1. At no point does `SharedCache` have to re-serialize state to JSON that it has already serialized. JSON serialization previously added hundreds of milliseconds to any mutating operation on `SharedCache`. As a result, this speeds up the initial load of the shared cache by around 2x and loading of subsequent images improves by about the same. One trade-off is that the serialization / deserialization logic is more complicated. There are two reasons for this: 1. The state is now split across multiple metadata keys and needs to be merged when it is loaded. 2. The in-memory representation uses pointers to identify memory regions. These relationships have to be re-established after the JSON is deserialized. As a future direction it is worth considering whether the logic owned by `SharedCache` could be split in a similar manner to the data. The initial loading of the cache header, loading of images, and handling of symbol information are all mostly independent and work on separate data. If the logic were split into separate classes it would be easier to reason about which data is valid when, and would easily permit concurrent loading of multiple images from the shared library in a thread-safe manner.
The initial state is initialized during `PerformInitialLoad` and is immutable after that point. This required some slight restructuring of how information about memory regions is tracked as that was previously modified as regions were loaded. Memory regions are now stored in a map from their address range to the `MemoryRegion` object. This makes it cheap to look them up by address which is a common operation. The modified state consists of changes since the last save to the `DSCView` / `ViewSpecificState`. This means it is no longer necessary to copy any state when mutating a `SharedCache` instance for the first time. Instead, its data structures start off empty and are populated as images, sections, or symbol information is loaded. The loaded state consists of all modified state that has since been saved. It lives on the `ViewSpecificState`. Saving modified state merges it into the the existing loaded state. This pattern is carried over to the `Metadata` stored on the `DSCView`. The initial state is stored under its own metadata key, and each modified state is stored under a key with an incrementing number. This means each save of the state only needs to serialize the state that changed, rather than reserializing all of the state all of the time. There are two huge benefits from these changes: 1. At no point does `SharedCache` have to copy its in memory state. The basic copy-on-write approach introduced in Vector35#6129 reduced how often these copies are made, but they're still frequent and very expensive. 1. At no point does `SharedCache` have to re-serialize state to JSON that it has already serialized. JSON serialization previously added hundreds of milliseconds to any mutating operation on `SharedCache`. As a result, this speeds up the initial load of the shared cache by around 2x and loading of subsequent images improves by about the same. One trade-off is that the serialization / deserialization logic is more complicated. There are two reasons for this: 1. The state is now split across multiple metadata keys and needs to be merged when it is loaded. 2. The in-memory representation uses pointers to identify memory regions. These relationships have to be re-established after the JSON is deserialized. As a future direction it is worth considering whether the logic owned by `SharedCache` could be split in a similar manner to the data. The initial loading of the cache header, loading of images, and handling of symbol information are all mostly independent and work on separate data. If the logic were split into separate classes it would be easier to reason about which data is valid when, and would easily permit concurrent loading of multiple images from the shared library in a thread-safe manner.
Copying the state from the cache into a new
SharedCacheobject is done with a global lock held and is so expensive that it results in much of the shared cache analysis running on a single thread, with others blocked waiting to acquire the lock.The cache now holds a
std::shared_ptrto the state. NewSharedCacheobjects take a reference to the cached state and only create their own copy of it the first time they perform an operation that would mutate it. The cached copy is never mutated, only replaced, so there is no danger of modifying the state out from under aSharedCacheobject. Since the copy happens at first mutation, it is performed without any global locks held. This avoids blocking other threads.This cuts the initial load time of a macOS shared cache from 3 minutes to 70 seconds, and cuts the time taken to load and analyze AppKit from multiple hours to around 14 minutes. The process now consistently uses 8-14 CPU cores rather than being limited to 1 core.
A couple of notes:
Statewhen performing the first mutation is still quite expensive. There are a number of large vectors and hash maps whose items are not cheap to copy. There's more scope for improving performance by copying only the parts of the state being mutated. I went down that path but it was hard to enforce invariants (ensuring that everything we're mutating has been copied) and so harder to reason about correctness.