[SharedCache] Use m_exportInfos as an export list cache#6197
Conversation
| return std::vector<std::pair<uint64_t, std::pair<BNSymbolType, std::string>>>(); | ||
|
|
||
| auto exportList = SharedCache::ParseExportTrie(linkeditFile, header); | ||
| std::vector<std::pair<uint64_t, std::pair<BNSymbolType, std::string>>> exportMapping(exportList.size()); |
There was a problem hiding this comment.
Since this is now the only caller of ParseExportTrie, it'd be beneficial to have it produce data in the format that this expects. This will eliminate the need for ParseExportTrie / ReadExportNode to allocate a bunch of Symbol instances only for this code to copy data out of them and then discard them. All those memory allocations and copying all of the names around is surprisingly expensive.
There was a problem hiding this comment.
Yes that was lazy of me, I'll add another commit to do that
| } | ||
|
|
||
|
|
||
| std::vector<std::pair<uint64_t, std::pair<BNSymbolType, std::string>>> SharedCache::GetExportListForHeader(SharedCacheMachOHeader header, std::function<std::shared_ptr<MMappedFileAccessor>()> provideLinkeditFile) |
There was a problem hiding this comment.
This returns a copy of this chonky vector each time it is called. If the thread-safety story permits it, it would be better to return a const reference instead.
It might be worth replacing this whacky nested std::pair with a struct with named fields that describe what they contain. That'd clear up a lot of the confusing pair.first / pair.second.first stuff. There's an ExportNode structure declared in this file that looks to be intended for this purpose, but is currently never used for anything.
There was a problem hiding this comment.
I think this should be possible as once a vector is created it shouldn't be reallocated by m_exportInfos?
There was a problem hiding this comment.
The thread safety problems arise when m_exportInfos is modified. If the hash table needs to grow it will reallocate and moves its contents. This would invalidate any existing references to data it stores.
One solution that doesn't require any additional locking would be to add a layer of indirection: have m_exportInfos store a std::unique_ptr to the vector. Then the vector itself will never move, only the pointer that owns it. You could then safely return a reference to the pointed-at vector from this function.
There was a problem hiding this comment.
I'm not sure using std::unique_ptr is pratical here because m_exportInfos in ViewStateCacheStore causes compiler errors due to copying mechanics when m_exportInfos has a unique_ptr type in its template. shared_ptr works fine and I don't see much of an issue in using that instead.
| auto exportList = GetExportListForHeader(*header, [&]() { | ||
| try { | ||
| auto mapping = MMappedFileAccessor::Open(m_dscView, m_dscView->GetFile()->GetSessionId(), header->exportTriePath)->lock(); | ||
| doSave = true; |
There was a problem hiding this comment.
Is this the only difference between the various "open a file" lambdas passed to GetExportListForHeader? If so you could simplify these call sites by having GetExportListForHeader take bool* didModifyExportList = nullptr. Callers that are interested in knowing that m_exportInfos was modified could pass a pointer to bool in. Others would omit it. This would let you move the file opening logic into GetExportListForHeader.
There was a problem hiding this comment.
I would have done this but the call in SharedCache::InitializeHeader has a different way of loading a mapped file.
bool* didModifyExportList = nullptr - I didn't do this just cause it seemed this is the only instance where it mattered so it seemed 50/50 whether the extra parameter made sense or what I did. For future stuff though it might make more sense to have the parameter so its clear to a caller that they need to do a save if they didn't know.
There was a problem hiding this comment.
Ah, I missed that SharedCache::InitializeHeader is doing something different. Bummer!
|
I think these latest commits fix all the issues @bdash mentioned in the comments. Also there maybe some race issues here with accessing |
| c.m_nonImageRegions = m_nonImageRegions; | ||
| c.m_baseFilePath = m_baseFilePath; | ||
| c.m_exportInfos = m_exportInfos; | ||
| //c.m_exportInfos = m_exportInfos; |
There was a problem hiding this comment.
I guess you commented this out while testing the std::unique_ptr approach.
A significant number of symbols are not being defined because there is currently no support for parsing the .symbols cache file. This commit adds that support. `m_symbolInfos` has been modified to be a vector of references to `Symbol`s which is similar to [this PR](Vector35#6197). It makes more sense for the use case where `m_symbolInfos` is used as a symbol cache, otherwise a bunch of time is spent transforming between the old style of `m_symbolInfos` entries to Binary Ninja `Symbol`s. This commit does require a metadata version bump. I felt this was necessary to determine which symbols to load from the symbols cache. The problem is that the `m_images` container does not store the images in the order they are found in the DSC. The index they are at determines the location of their symbols in the symbols cache file. Rather than converting `m_images` to a vector and relying on its ordering being correct, it seemed more prudent to store the index of the image in the `CacheImage` structure. As this is serialized, the metadata version has to be bumped to accomodate the change.
|
Thank you so much for this PR, this was the intended original functionality of the export trie list in memory! Glad to merge this sometime over the next couple of weeks, may need to be looked at in conjunction with the things listed on #6210. |
|
Hi! Would you be able to rebase this on the latest commit on dev? |
|
Yes, I'll sort that out now |
`SharedCache::ParseExportTrie` is getting called a lot during DSC library loading and analysis. In large part due to the hot path `SharedCache::FindSymbolAtAddrAndApplyToAddr`. Its unnecessary for it to be being called more than once per DSC header as the export list symbol information is stored in `SharedCache::m_exportInfos`. This commit adds the function `SharedCache::GetExportListForHeader`, which will either return the header's list of symbol information cached in `SharedCache::m_exportInfos` or call `SharedCache::ParseExportTrie` and cache the results in `SharedCache::m_exportInfos`. This should also improve the execution time of `SharedCache::LoadAllSymbolsAndWait`. Further improvement here would be to add locking to `SharedCache::GetExportListForHeader` so that races don't result in redundant parsing of the export trie for the same header if multiple threads call `SharedCache::GetExportListForHeader` at the same time for the same header. This only really matters during initial loading because from what I can tell that parses all the export trie's anyway.
…ck if `m_exportInfos` was modified This probably makes more sense than the current solution of using execution of the callback parameter to determine if `m_exportInfos` was modified.
This commit changes 2 things; 1. `m_exportInfos` is now a map where its values are also a map rather than a vector of pairs. The reason for this is that `SharedCache::FindSymbolAtAddrAndApplyToAddr` is a hot path which does by far the most accesses to `m_exportInfos`. In that function it must find the correct symbol for a given address so a map lookup will be much quicker than iterating a vector. The other use cases of `m_exportInfos` would prefer a vector but they are executed very infrequently. 2. The symbols are stored in `m_exportInfos` as references to the `Symbol` type. This makes more sense because otherwise there is a lot of time spent converting to and from a `Symbol` type and a pair of `BNSymbolType` + a `std::string`.
…::FindSymbolAtAddrAndApplyToAddr` if a symbol is found
This avoids expensive copying when returning a value from the map in `SharedCache::GetExportListForHeader`. Additionally it ensures that the value stays alive and at the same location in memory if `m_exportInfos` is modified and requires its storage to be re-allocated. I was unable to use a `unique_ptr` instead of a `shared_ptr` because of copy semantics with `m_exportInfos` in `ViewStateCacheStore`. I don't see things being any worse using `shared_ptr` instead of `unique_ptr` anyway and it means less code changes.
e0d623d to
0f0801a
Compare
SharedCache::ParseExportTrieis getting called a lot during DSC library loading and analysis. In large part due to the hot pathSharedCache::FindSymbolAtAddrAndApplyToAddr. Its unnecessary for it to be being called more than once per DSC header as the export list symbol information is stored inSharedCache::m_exportInfos.This commit adds the function
SharedCache::GetExportListForHeader, which will either return the header's list of symbol information cached inSharedCache::m_exportInfosor callSharedCache::ParseExportTrieand cache the results inSharedCache::m_exportInfos.This should also improve the execution time of
SharedCache::LoadAllSymbolsAndWait.Further improvement here would be to add locking to
SharedCache::GetExportListForHeaderso that races don't result in redundant parsing of the export trie for the same header if multiple threads callSharedCache::GetExportListForHeaderat the same time for the same header. This only really matters during initial loading of the first image because from what I can tell that results in parsing all the export trie's anyway.