-
Notifications
You must be signed in to change notification settings - Fork 66
[ML] Truncate oversized field values to prevent autodetect process crash #2929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
valeriy42
wants to merge
11
commits into
elastic:main
Choose a base branch
from
valeriy42:fix/is-2796
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+687
−9
Open
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
05704f7
Fix autodetect process crash from oversized field values by truncatin…
valeriy42 4c6e66e
refactor
valeriy42 310f228
implement hash suffix
valeriy42 50379b4
Change hash suffix to 16 characters
valeriy42 4e3a430
Merge branch 'main' of https://github.com/elastic/ml-cpp into fix/is-…
valeriy42 dd955e5
Add tests
valeriy42 3f6471f
formatting
valeriy42 ac73b63
clean up
valeriy42 2226971
fix unit test
valeriy42 72c3d6b
review comments
valeriy42 691e8cf
formatting
valeriy42 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,144 @@ | ||
| /* | ||
| * Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one | ||
| * or more contributor license agreements. Licensed under the Elastic License | ||
| * 2.0 and the following additional limitation. Functionality enabled by the | ||
| * files subject to the Elastic License 2.0 may only be used in production when | ||
| * invoked by an Elasticsearch process with a license key installed that permits | ||
| * use of machine learning features. You may not use this file except in | ||
| * compliance with the Elastic License 2.0 and the foregoing additional | ||
| * limitation. | ||
| */ | ||
| #ifndef INCLUDED_ml_model_CFieldValueTruncator_h | ||
| #define INCLUDED_ml_model_CFieldValueTruncator_h | ||
|
|
||
| #include <core/CHashing.h> | ||
|
|
||
| #include <model/ImportExport.h> | ||
|
|
||
| #include <cstdio> | ||
| #include <string> | ||
|
|
||
| namespace ml { | ||
| namespace model { | ||
|
|
||
| //! \brief Enforces term field length constraints with collision prevention. | ||
| //! | ||
| //! In the anomaly detection domain, term fields (by, over, partition, influencer) | ||
| //! are categorical identifiers that must satisfy two invariants: | ||
| //! 1. **Bounded Length** - Prevent memory amplification and OOM crashes | ||
| //! 2. **Unique Identity** - Distinct field values must remain distinguishable | ||
| //! | ||
| //! Values exceeding MAX_FIELD_VALUE_LENGTH (256 chars) are transformed using | ||
| //! collision-safe truncation: | ||
| //! - Retain PREFIX_LENGTH (239) characters of original value | ||
| //! - Append HASH_SEPARATOR ('$') | ||
| //! - Append HASH_HEX_DIGITS (16) character hex hash of complete original value | ||
| //! | ||
| //! Format: "<prefix_239_chars>$<hash_16_hex_chars>" | ||
| //! Example: "very_long_field_value_that_exceeds_limit_(...)$a1b2c3d4e5f67890" | ||
| //! | ||
| //! The 256-character limit aligns with Elasticsearch's ignore_above default | ||
| //! for keyword fields. The hash suffix ensures data integrity while maintaining | ||
| //! human readability (first 239 characters visible) and compatibility with | ||
| //! prefix-based filtering. | ||
| class MODEL_EXPORT CFieldValueTruncator { | ||
| public: | ||
| //! Maximum length for term fields in anomaly detection. | ||
| static constexpr std::size_t MAX_FIELD_VALUE_LENGTH = 256; | ||
|
|
||
| //! Collision prevention format components | ||
| static constexpr char HASH_SEPARATOR = '$'; | ||
| static constexpr std::size_t HASH_HEX_DIGITS = 16; // 16 hex chars = full 64-bit hash | ||
| static constexpr std::size_t HASH_SUFFIX_LENGTH = 1 /* separator */ + HASH_HEX_DIGITS; // 17 total | ||
|
|
||
| //! Content prefix length (readable portion after truncation) | ||
| static constexpr std::size_t PREFIX_LENGTH = MAX_FIELD_VALUE_LENGTH - HASH_SUFFIX_LENGTH; // 239 | ||
|
|
||
| // Domain invariants (enforced at compile-time) | ||
| static_assert(PREFIX_LENGTH + HASH_SUFFIX_LENGTH == MAX_FIELD_VALUE_LENGTH, | ||
| "Term field format invariant: prefix + suffix = total length"); | ||
| static_assert(PREFIX_LENGTH >= 200, | ||
| "Readable prefix must be substantial for human comprehension"); | ||
|
|
||
| //! Check if a term field value exceeds the domain constraint. | ||
| //! \return true if the value requires length enforcement | ||
| static bool needsTruncation(const std::string& value) { | ||
| return value.size() > MAX_FIELD_VALUE_LENGTH; | ||
| } | ||
|
|
||
| //! Enforce term field length constraint in-place. | ||
| //! Applies collision-safe truncation for values exceeding the limit. | ||
| //! \param[in,out] value Field value to constrain | ||
| //! \return true if truncation was applied, false if already within limit | ||
| static bool truncate(std::string& value) { | ||
| if (needsTruncation(value) == false) { | ||
| return false; | ||
| } | ||
|
|
||
| std::string originalValue = std::move(value); | ||
| value.assign(originalValue, 0, PREFIX_LENGTH); | ||
| appendCollisionPreventionSuffix(originalValue, value); | ||
|
|
||
| return true; | ||
| } | ||
|
|
||
| //! Enforce term field length constraint, returning constrained copy. | ||
| //! \param value Original field value | ||
| //! \return Copy with length constraint enforced | ||
| static std::string truncated(const std::string& value) { | ||
| if (needsTruncation(value) == false) { | ||
| return value; | ||
| } | ||
|
|
||
| std::string result; | ||
| result.reserve(MAX_FIELD_VALUE_LENGTH); | ||
| result.assign(value, 0, PREFIX_LENGTH); | ||
| appendCollisionPreventionSuffix(value, result); | ||
|
|
||
| return result; | ||
| } | ||
|
|
||
| private: | ||
| //! \brief Hash encoding for collision prevention. | ||
| //! | ||
| //! Encapsulates the technical details of hash computation and formatting. | ||
| //! Separated from domain logic for clarity and testability. | ||
| struct HashEncoding { | ||
| //! Compute collision-resistant identity hash. | ||
| //! Uses safeMurmurHash64 (endian-neutral) for state persistence safety. | ||
| static std::uint64_t compute(const std::string& value) { | ||
| return core::CHashing::safeMurmurHash64(value.data(), | ||
| static_cast<int>(value.size()), | ||
| 0); // Fixed seed for determinism | ||
| } | ||
|
|
||
| //! Format 64-bit hash as zero-padded lowercase hex string. | ||
| //! \param hash The hash value to format | ||
| //! \param[out] buffer Must be at least HASH_HEX_DIGITS + 1 bytes | ||
| //! \return Pointer to null-terminated hex string in buffer | ||
| static const char* toHex(std::uint64_t hash, char* buffer) { | ||
| // %016llx produces 16-char zero-padded lowercase hex (full 64 bits) | ||
| std::snprintf(buffer, HASH_HEX_DIGITS + 1, "%016llx", | ||
| static_cast<unsigned long long>(hash)); | ||
| return buffer; | ||
| } | ||
| }; | ||
|
|
||
| //! Append collision-prevention suffix: separator + hash. | ||
| //! \param originalValue Complete untruncated value for hash computation | ||
| //! \param[in,out] prefix Truncated prefix to which suffix is appended | ||
| static void appendCollisionPreventionSuffix(const std::string& originalValue, | ||
| std::string& prefix) { | ||
| std::uint64_t identityHash = HashEncoding::compute(originalValue); | ||
|
|
||
| prefix.reserve(MAX_FIELD_VALUE_LENGTH); | ||
| prefix.push_back(HASH_SEPARATOR); | ||
|
|
||
| char hashHexBuffer[HASH_HEX_DIGITS + 1]; | ||
| prefix.append(HashEncoding::toHex(identityHash, hashHexBuffer)); | ||
| } | ||
| }; | ||
| } | ||
| } | ||
|
|
||
| #endif // INCLUDED_ml_model_CFieldValueTruncator_h | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is copying the string
value, right? (because the return type isstring)Is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It creates a copy because truncation is sometimes used at places where the caller does not own the input. Hence, having only in-place truncation is insufficient.