Elasticsearch Machine Learning C++

Toolchain

Language: C++20 (CMAKE_CXX_STANDARD 20).
Build system: CMake (primary) or Gradle wrapper (./gradlew).
Compilers: GCC 13.3.0 on Linux (built from source, installed to /usr/local/gcc133/), Xcode Clang on macOS (Xcode 15.2+ for Ventura/Sonoma), Visual Studio 2022 Professional (MSVC) on Windows.
Key dependencies: Boost 1.86.0 (dynamic linking, includes Boost.Json for JSON handling), PyTorch 2.7.1 (libtorch), libxml2.
Header-only libraries: Eigen and valijson are header-only and managed by the 3rd_party/ CMake system (pulled automatically during configuration).
Platforms: Linux x86_64/aarch64, macOS aarch64, Windows x86_64.
Toolchain files: Auto-selected from cmake/<os>-<arch>.cmake or set via CMAKE_TOOLCHAIN_FILE.

Build & Run Commands

Configure and build (default RelWithDebInfo):

cmake -B cmake-build-relwithdebinfo
cmake --build cmake-build-relwithdebinfo -j$(nproc)

Or via Gradle:

./gradlew :compile

Set ML_DEBUG=1 to switch to a Debug build. Compiler caching (sccache/ccache) is auto-detected.

Refer to CONTRIBUTING.md and the build-setup/ directory for full platform-specific setup instructions.

Project Structure

bin/                    # Application executables
  autodetect/           #   Anomaly detection
  categorize/           #   Log categorization
  controller/           #   Process lifecycle controller
  data_frame_analyzer/  #   Data frame analytics (classification, regression)
  normalize/            #   Anomaly score normalization
  pytorch_inference/    #   PyTorch model inference
lib/                    # Shared libraries
  api/                  #   JSON/REST API layer
  core/                 #   Platform abstractions, I/O, logging, compression
  maths/                #   Mathematical and statistical algorithms
    analytics/          #     Boosted tree, data frame analytics
    common/             #     Bayesian optimisation, distributions, time series
    time_series/        #     Time series decomposition, forecasting
  model/                #   Anomaly detection models
  seccomp/              #   Seccomp/sandbox filters
  test/                 #   Shared test utilities (CBoostTestXmlOutput, etc.)
  ver/                  #   Version information
include/                # Public headers (mirrors lib/ structure)
3rd_party/              # Header-only third-party libraries (Eigen, valijson), licenses
cmake/                  # CMake toolchain files, helper functions, test runners
build-setup/            # Platform-specific build environment instructions
.buildkite/             # CI pipeline definitions (Buildkite)
.ci/                    # Packer scripts for building Orka macOS CI VMs
.github/workflows/      # GitHub Actions (automatic backport)
dev-tools/              # Developer scripts (clang-format, benchmarks)

Libraries must not have circular dependencies. The dependency order is roughly: core -> maths -> model -> api -> bin/*.

CMake Helper Functions

The build uses custom CMake functions defined in cmake/functions.cmake. Use these instead of raw add_library/add_executable — they handle platform-specific sources, linking, installation, and Windows resource generation automatically.

Adding a Shared Library

Set ML_LINK_LIBRARIES then call ml_add_library:

project("ML MyLib")

set(ML_LINK_LIBRARIES
  ${Boost_LIBRARIES}
  MlCore
  )

ml_add_library(MlMyLib SHARED
  CMyClass.cc
  CMyOtherClass.cc
  )

Libraries are named with the Ml prefix (e.g. MlCore, MlModel). The function handles shared library versioning, RPATH, and installation. Use SHARED for distributed libraries or STATIC for internal-only ones.

For libraries that should not be installed/distributed (e.g. internal helpers), use ml_add_non_distributed_library instead.

Adding an Executable

Set ML_LINK_LIBRARIES then call ml_add_executable. A Main.cc file is included automatically — do not list it in the sources:

project("ML MyApp")

set(ML_LINK_LIBRARIES
  ${Boost_LIBRARIES}
  MlCore
  MlApi
  MlVer
  )

ml_add_executable(myapp
  CCmdLineParser.cc
  )

The function creates a companion OBJECT library (MlMyApp) from the listed sources, which test executables can link against. The executable itself always builds from Main.cc plus those objects.

For executables not intended for distribution (dev tools, benchmarks), use ml_add_non_distributed_executable.

Adding a Test Executable

Test executables live in unittest/ subdirectories. Set ML_LINK_LIBRARIES (including ${Boost_LIBRARIES_WITH_UNIT_TEST} and MlTest), then call ml_add_test_executable:

project("ML MyLib unit tests")

set(SRCS
  CMyClassTest.cc
  CMyOtherClassTest.cc
  Main.cc
  )

set(ML_LINK_LIBRARIES
  ${Boost_LIBRARIES_WITH_UNIT_TEST}
  MlCore
  MlMyLib
  MlTest
  )

ml_add_test_executable(mylib ${SRCS})

The _target argument (e.g. mylib) is used to derive the test executable name (ml_test_mylib) and the CMake targets test_mylib and test_mylib_individually.

Registering Tests with the Build

After creating the test executable, register it in test/CMakeLists.txt by adding an ml_add_test call alongside the existing entries:

ml_add_test(lib/core/unittest core)
ml_add_test(lib/maths/common/unittest maths_common)
ml_add_test(lib/maths/time_series/unittest maths_time_series)
ml_add_test(lib/maths/analytics/unittest maths_analytics)
ml_add_test(lib/model/unittest model)
ml_add_test(lib/api/unittest api)
ml_add_test(lib/ver/unittest ver)
ml_add_test(lib/seccomp/unittest seccomp)
ml_add_test(bin/controller/unittest controller)
ml_add_test(bin/pytorch_inference/unittest pytorch_inference)
ml_add_test(lib/mylib/unittest mylib)          # <-- new entry

The first argument is the relative path to the unittest directory; the second is the target name matching ml_add_test_executable. Note how nested libraries use underscores in the target name (e.g. lib/maths/common/unittest -> maths_common).

Platform-Specific Sources

If a source file has a platform-specific variant (e.g. CMyClass_Linux.cc, CMyClass_Darwin.cc), the ml_generate_platform_sources function (called internally by all ml_add_* functions) will automatically substitute the platform-specific file at build time. Just list the base filename (CMyClass.cc) in your sources.

Testing

Tests use the Boost.Test framework. Each library and application has a unittest/ subdirectory containing test files and a Main.cc entry point.

Running Tests

Run all tests:

cmake --build cmake-build-relwithdebinfo -t test

Run tests for a specific library:

cmake --build cmake-build-relwithdebinfo -t test_core
cmake --build cmake-build-relwithdebinfo -t test_model

Run specific test cases (wildcards supported):

TESTS="*/testPersist" cmake --build cmake-build-relwithdebinfo -t test_model

Run tests individually in separate processes (better isolation, per-suite parallelism):

cmake --build cmake-build-relwithdebinfo -j8 -t test_individually
cmake --build cmake-build-relwithdebinfo -j8 -t test_api_individually

Run all test cases from all suites in a single CTest invocation (optimal cross-suite parallelism):

cmake --build cmake-build-relwithdebinfo -t test_all_parallel

Pass extra flags to the Boost.Test runner:

TEST_FLAGS="--random" cmake --build cmake-build-relwithdebinfo -t test

Precommit (format + test)

cmake --build cmake-build-relwithdebinfo -j8 -t precommit

Or: ./gradlew precommit

Writing Tests

Test files are named CClassNameTest.cc and placed in lib/<module>/unittest/ or bin/<module>/unittest/.
Each test file uses BOOST_AUTO_TEST_SUITE(CClassNameTest) / BOOST_AUTO_TEST_CASE(testMethodName).
Use real classes over mocks wherever possible. Tests should reflect real-world usage.
Every class should have a corresponding test suite; every public method should have a test.
Test cases must be completely independent from one another — they may be run in parallel across separate processes, so they must not depend on execution order or share mutable state.

Formatting & Style

Code is formatted with clang-format (LLVM-based style, 4-space indent). Run before committing:

cmake --build cmake-build-relwithdebinfo -t format

Or: ./gradlew format

The CI pipeline enforces formatting via the check-style step; PRs that fail formatting will not pass CI.

The full coding standard is in STYLEGUIDE.md. Key points:

Naming Conventions

Classes: CClassName, Structs: SStructName, Enums: EEnumName
Member variables: m_ClassMember, s_StructMember
Static members: ms_ClassStatic
Methods: methodName (camelCase)
Type aliases: TTypeName (e.g. using TDoubleVec = std::vector<double>)
Constants: CONSTANT_NAME
Non-boolean accessors: clientId (not getClientId)
Boolean accessors: isComplete (not complete)
Files: CClassName.cc / CClassName.h

Code Conventions

Use nullptr, never 0 or NULL.
No exceptions — use return codes for error handling. Catch third-party exceptions at the smallest scope.
No assert(). No C-style casts. No macros unless unavoidable.
Prefer smart pointers over raw pointers; prefer references over pointers.
Scope member function calls with this->.
Use auto when the type is obvious; avoid it when the type is unclear.
Prefer emplace_back over push_back, range-based for loops, and uniform initializers.
override must be used consistently; virtual must not appear alongside override.

File Layout

Implementation files (.cc): own header first, then other ML headers, third-party headers, standard library headers.
Group includes by library with blank lines between groups (clang-format will sort within groups).
Use unnamed namespaces in .cc files for file-local helpers, not private class members.
Forward-declare classes in headers rather than including their headers.

Documentation

Doxygen comments (exclamation mark style: //!) are required for all header files and public/protected methods.
Implementation files use regular C++ comments, not Doxygen.
Focus comments on the "why", not the "what".

License Headers

All source files must include the Elastic License 2.0 header. Copy from copyright_code_header.txt or any existing source file.

CI

CI runs on Buildkite (ml-cpp-pr-builds). The pipeline builds and tests on all platforms (Linux x86_64, Linux aarch64, macOS aarch64, Windows x86_64) in both RelWithDebInfo and Debug configurations. It also runs:

clang-format style validation
Snyk security/license scanning
Java integration tests against Elasticsearch

Automatic backporting is handled by a GitHub Action (.github/workflows/backport.yml) — add version labels (e.g. v9.3.0) to a PR and a backport PR is created automatically when it merges.

Pull Requests

Title must be prefixed with [ML] (e.g. [ML] Fix anomaly scoring edge case).
Label with :ml (mandatory), a type label (>bug, >enhancement, >feature, >refactoring, >test, >docs), and version labels for applicable releases.
Squash-and-merge is the standard merge strategy; keep commits clean for review but don't squash manually.
Backports start after merging to main. Add version labels to trigger automatic backport PRs.

Best Practices for Automation Agents

Always read existing code before editing to understand patterns and conventions.
Never edit unrelated files; keep diffs tightly scoped.
Run clang-format before presenting any code changes.
Match the naming conventions exactly — the prefixes (C, m_, ms_, T, E) are strictly followed throughout the codebase.
When adding new classes, follow the existing directory and namespace structure. Production code in lib/foo/ uses namespace ml::foo.
When adding tests, place them in the corresponding unittest/ directory and register them in the CMakeLists.txt.
Do not introduce new third-party dependencies without discussion.
Do not add AI attribution trailers (e.g. Co-Authored-By) to commit messages.
Commit messages should follow the [ML] Summary of change format.
If unsure about a convention, check a nearby file for the established pattern — consistency with surrounding code is the highest priority.

Stay aligned with CONTRIBUTING.md, STYLEGUIDE.md, and the build-setup/ guides; this AGENTS file summarizes but does not replace those authoritative documents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch Machine Learning C++

Toolchain

Build & Run Commands

Project Structure

CMake Helper Functions

Adding a Shared Library

Adding an Executable

Adding a Test Executable

Registering Tests with the Build

Platform-Specific Sources

Testing

Running Tests

Precommit (format + test)

Writing Tests

Formatting & Style

Naming Conventions

Code Conventions

File Layout

Documentation

License Headers

CI

Pull Requests

Best Practices for Automation Agents

FilesExpand file tree

AGENTS.md

Latest commit

History

AGENTS.md

File metadata and controls

Elasticsearch Machine Learning C++

Toolchain

Build & Run Commands

Project Structure

CMake Helper Functions

Adding a Shared Library

Adding an Executable

Adding a Test Executable

Registering Tests with the Build

Platform-Specific Sources

Testing

Running Tests

Precommit (format + test)

Writing Tests

Formatting & Style

Naming Conventions

Code Conventions

File Layout

Documentation

License Headers

CI

Pull Requests

Best Practices for Automation Agents