Add built flag and fallback path for unbuilt tries in Search/Has#6
Add built flag and fallback path for unbuilt tries in Search/Has#6WJiangzhi wants to merge 2 commits into
built flag and fallback path for unbuilt tries in Search/Has#6Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request improves the robustness of the Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a built flag to the Search and Has classes to allow them to be used without an explicit call to Build(). However, a critical security vulnerability exists where the built flag is set at the start of the Build() method; if an exception occurs during the build, the trie can be left in an incomplete state while the flag is true, potentially leading to out-of-bounds memory access in Do() methods. Additionally, the fallback logic for un-built tries in Search::Do and Has::Do is flawed, as it incorrectly uses uninitialized fail links, resulting in an incorrect Aho-Corasick search. The absence of automated tests for this functionality is also a concern.
| if (!built) { | ||
| while (state != 0 && trie[state].next[c] == -1) { | ||
| state = trie[state].fail; | ||
| } | ||
| if (trie[state].next[c] != -1) { | ||
| state = trie[state].next[c]; | ||
| } else { | ||
| state = 0; | ||
| } | ||
| } else { | ||
| state = 0; | ||
| state = trie[state].next[c]; | ||
| } |
There was a problem hiding this comment.
The fallback logic for when the trie is not built (!built) is incorrect. It relies on fail links (trie[state].fail) to traverse the automaton, but these links are only computed within the Build() method. Before Build() is called, all fail links are 0 (as initialized by the Node constructor).
This means the !built path behaves as follows:
while (state != 0 && trie[state].next[c] == -1) {
state = 0; // Since trie[state].fail is 0
}
if (trie[state].next[c] != -1) {
state = trie[state].next[c];
} else {
state = 0;
}This degenerates the Aho-Corasick search into a simple trie search that resets to the root on a mismatch. This will fail to find many patterns that an Aho-Corasick automaton is designed to find (e.g., overlapping patterns).
For example, with patterns "ab" and "bc", and input "abc", it will find "ab" but miss "bc".
This is a critical correctness issue. The fallback path does not function as a proper Aho-Corasick search. The same issue exists in Has::Do.
| if (!built) { | ||
| while (state != 0 && trie[state].next[c] == -1) { | ||
| state = trie[state].fail; | ||
| } | ||
| if (trie[state].next[c] != -1) { | ||
| state = trie[state].next[c]; | ||
| } else { | ||
| state = 0; | ||
| } | ||
| } else { | ||
| state = trie[state].next[c]; | ||
| } |
There was a problem hiding this comment.
| if (!built) { | ||
| while (state != 0 && trie[state].next[c] == -1) { | ||
| state = trie[state].fail; | ||
| } | ||
| if (trie[state].next[c] != -1) { | ||
| state = trie[state].next[c]; | ||
| } else { | ||
| state = 0; | ||
| } | ||
| } else { | ||
| state = 0; | ||
| state = trie[state].next[c]; | ||
| } |
There was a problem hiding this comment.
The fallback logic for when the trie is not built (!built) is incorrect. It relies on fail links (trie[state].fail) which are not computed until Build() is called. Before Build(), fail links are 0, causing the search to behave like a simple trie search and miss matches. This is a critical correctness issue. For a more detailed explanation, please refer to the comment on the same issue in ZACLib/ZACLib.cpp.
| if (!built) { | ||
| while (state != 0 && trie[state].next[c] == -1) { | ||
| state = trie[state].fail; | ||
| } | ||
| if (trie[state].next[c] != -1) { | ||
| state = trie[state].next[c]; | ||
| } else { | ||
| state = 0; | ||
| } | ||
| } else { | ||
| state = trie[state].next[c]; | ||
| } |
| } | ||
|
|
||
| void Search::Build() { | ||
| built = true; |
There was a problem hiding this comment.
The built flag is set to true at the beginning of the Build() method. If an exception occurs during the execution of Build() (e.g., std::bad_alloc when allocating the queue or during trie traversal), the built flag will remain true even though the trie has not been fully constructed. Subsequent calls to Do() will then use the direct next lookup path, which assumes all next entries are valid indices. Since the trie is incomplete, some next entries may still be -1, leading to an out-of-bounds access when trie[state] is called with state = -1.
void Search::Build() {
std::queue<int> q;
for (int c = 0; c < 256; ++c) {
int nxt = trie[0].next[c];
if (nxt != -1) {
trie[nxt].fail = 0;
q.push(nxt);
} else trie[0].next[c] = 0;
}
while (!q.empty()) {
const int u = q.front();
q.pop();
for (int c = 0; c < 256; ++c) {
int v = trie[u].next[c];
if (v != -1) {
trie[v].fail = trie[trie[u].fail].next[c];
int f = trie[v].fail;
if (trie[f].match_len > trie[v].match_len) {
trie[v].match_len = trie[f].match_len;
trie[v].output_id = trie[f].output_id;
}
q.push(v);
} else {
trie[u].next[c] = trie[trie[u].fail].next[c];
}
}
}
built = true;
}| } | ||
|
|
||
| void Has::Build() { | ||
| built = true; |
There was a problem hiding this comment.
Similar to the Search::Build implementation, the built flag is set to true before the trie is actually built. If the build process fails, the object is left in an inconsistent state where built is true but the trie contains -1 entries, leading to out-of-bounds access in Do().
void Has::Build() {
std::queue<int> q;
for (int c = 0; c < 256; ++c) {
int nxt = trie[0].next[c];
if (nxt != -1) {
trie[nxt].fail = 0;
q.push(nxt);
} else {
trie[0].next[c] = 0;
}
}
while (!q.empty()) {
const int u = q.front(); q.pop();
for (int c = 0; c < 256; ++c) {
int v = trie[u].next[c];
if (v != -1) {
trie[v].fail = trie[trie[u].fail].next[c];
q.push(v);
} else {
trie[u].next[c] = trie[trie[u].fail].next[c];
}
}
}
built = true;
}| } | ||
|
|
||
| void Build() { | ||
| built = true; |
There was a problem hiding this comment.
The built flag is set to true at the beginning of the Build() method in the single-header variant as well. This should be moved to the end of the method to ensure the trie is fully built before the flag is set.
void Build() {
std::queue<int> q;
for (int c = 0; c < 256; ++c) {
int nxt = trie[0].next[c];
if (nxt != -1) {
trie[nxt].fail = 0;
q.push(nxt);
} else trie[0].next[c] = 0;
}
while (!q.empty()) {
const int u = q.front();
q.pop();
for (int c = 0; c < 256; ++c) {
int v = trie[u].next[c];
if (v != -1) {
trie[v].fail = trie[trie[u].fail].next[c];
const int f = trie[v].fail;
if (trie[f].match_len > trie[v].match_len) {
trie[v].match_len = trie[f].match_len;
trie[v].output_id = trie[f].output_id;
}
q.push(v);
} else {
trie[u].next[c] = trie[trie[u].fail].next[c];
}
}
}
built = true;
}| } | ||
|
|
||
| void Build() { | ||
| built = true; |
There was a problem hiding this comment.
The built flag is set to true at the beginning of the Build() method for the Has class in the single-header variant. This should be moved to the end of the method.
void Build() {
std::queue<int> q;
for (int c = 0; c < 256; ++c) {
int nxt = trie[0].next[c];
if (nxt != -1) {
trie[nxt].fail = 0;
q.push(nxt);
} else {
trie[0].next[c] = 0;
}
}
while (!q.empty()) {
const int u = q.front();
q.pop();
for (int c = 0; c < 256; ++c) {
int v = trie[u].next[c];
if (v != -1) {
trie[v].fail = trie[trie[u].fail].next[c];
q.push(v);
} else {
trie[u].next[c] = trie[trie[u].fail].next[c];
}
}
}
built = true;
}|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a built flag to the Search and Has classes, enabling a fallback mechanism for Do() operations when the Aho-Corasick trie has not been explicitly built via Build(). The built flag is correctly managed, being set to false on AddRule() and true on Build(). The optimized Do() path for built tries is also correctly implemented. However, the fallback paths for unbuilt tries in both Search::Do and Has::Do implement a brute-force string matching algorithm. This approach is highly inefficient and deviates from the stated motivation of handling uncompacted trie states by leveraging the trie structure and fail links, instead bypassing the trie entirely.
| if (!built) { | ||
| for (size_t i = 0; i < input.size(); ++i) { | ||
| size_t best_len = 0; | ||
| size_t best_rule = Node::kInvalidOutput; | ||
| for (size_t rule_id = 0; rule_id < outputs.size(); ++rule_id) { | ||
| const auto& rule = outputs[rule_id]; | ||
| const size_t len = rule.size(); | ||
| if (len == 0 || len > i + 1) continue; | ||
| if (std::memcmp(input.data() + i + 1 - len, rule.data(), len) == 0 && len > best_len) { | ||
| best_len = len; | ||
| best_rule = rule_id; | ||
| } | ||
| } | ||
|
|
||
| if (best_rule != Node::kInvalidOutput) { | ||
| result.push_back(Match{i + 1 - best_len, best_len, best_rule}); | ||
| } | ||
| } | ||
| return result; |
There was a problem hiding this comment.
The fallback path for !built in Search::Do performs a brute-force string comparison for each rule against the input. This approach is highly inefficient, with a time complexity of O(N * M * L) (where N is input length, M is number of rules, L is max rule length). The original Aho-Corasick algorithm, even in an uncompacted state, would typically offer better performance by leveraging the trie structure and fail links. The current implementation completely bypasses the trie for unbuilt states, which contradicts the motivation of handling -1 next entries and following fail links in an uncompacted trie. Consider implementing a less optimized, but still trie-based, search for unbuilt tries, or explicitly documenting this performance characteristic.
| if (!built) { | ||
| for (size_t i = 0; i < input.size(); ++i) { | ||
| for (const auto& rule : outputs) { | ||
| const size_t len = rule.size(); | ||
| if (len == 0 || len > i + 1) continue; | ||
| if (std::memcmp(input.data() + i + 1 - len, rule.data(), len) == 0) return true; | ||
| } | ||
| } | ||
| return false; |
There was a problem hiding this comment.
The fallback path for !built in Has::Do also performs a brute-force string comparison for each rule. This is highly inefficient, with a time complexity of O(N * M * L) (where N is input length, M is number of rules, L is max rule length). This approach completely bypasses the trie structure and its potential, even in an unbuilt state, for more efficient searching. Consider implementing a trie-based search for unbuilt tries, or explicitly documenting this performance characteristic.
| if (!built) { | ||
| for (size_t i = 0; i < input.size(); ++i) { | ||
| size_t best_len = 0; | ||
| size_t best_rule = Node::kInvalidOutput; | ||
| for (size_t rule_id = 0; rule_id < outputs.size(); ++rule_id) { | ||
| const auto& rule = outputs[rule_id]; | ||
| const size_t len = rule.size(); | ||
| if (len == 0 || len > i + 1) continue; | ||
| if (std::memcmp(input.data() + i + 1 - len, rule.data(), len) == 0 && len > best_len) { | ||
| best_len = len; | ||
| best_rule = rule_id; | ||
| } | ||
| } | ||
|
|
||
| if (best_rule != Node::kInvalidOutput) { | ||
| result.push_back(Match{i + 1 - best_len, best_len, best_rule}); | ||
| } | ||
| } | ||
| return result; |
There was a problem hiding this comment.
The fallback path for !built in Search::Do (single-header version) performs a brute-force string comparison, which is highly inefficient (O(N * M * L)). This bypasses the trie structure entirely for unbuilt states, which contradicts the stated motivation of handling -1 next entries and following fail links in an uncompacted trie. Consider a trie-based search for unbuilt tries, or clearly document this performance characteristic.
| if (!built) { | ||
| for (size_t i = 0; i < input.size(); ++i) { | ||
| for (const auto& rule : outputs) { | ||
| const size_t len = rule.size(); | ||
| if (len == 0 || len > i + 1) continue; | ||
| if (std::memcmp(input.data() + i + 1 - len, rule.data(), len) == 0) return true; | ||
| } | ||
| } | ||
| return false; |
There was a problem hiding this comment.
The fallback path for !built in Has::Do (single-header version) performs a brute-force string comparison, which is highly inefficient (O(N * M * L)). This bypasses the trie structure entirely for unbuilt states, which contradicts the stated motivation of handling -1 next entries and following fail links in an uncompacted trie. Consider a trie-based search for unbuilt tries, or clearly document this performance characteristic.
Motivation
SearchandHascan be used safely without callingBuild()by providing a fallback runtime path when the trie hasn't been compacted (i.e.nextentries may be-1).Description
bool builtmember toSearchandHasand mark itfalseonAddRule()andtrueinBuild()to track whether the trie was built.Do()implementations to branch onbuiltso that when not built they followfaillinks and handle-1nextentries, and when built they use the fast directnextlookup.emplace_back) instead of an inline initializer to keep initialization consistent with the newbuilthandling.Testing
Codex Task