This repository contains code for training and evaluating machine learning models on the EMBER datasets for static malware detection. The project benchmarks multiple machine learning algorithms including gradient boosting methods (LightGBM, XGBoost, CatBoost), ensemble methods (Random Forest, Extra Trees), and neural networks (MLP).
Contains research implementations for both EMBER datasets:
/experiments/2018: Multi-algorithm approach (LightGBM, XGBoost, CatBoost, Random Forest, Extra Trees, MLP)/experiments/2024: LightGBM-focused implementation with challenge set parsing
| Aspect | 2018 Implementation | 2024 Implementation |
|---|---|---|
| Python Version | 3.6.15 | 3.12.3 |
| Dataset | EMBER 2018 | EMBER2024 |
| Data Library | ember package |
thrember package |
The source directory is where you copy the experiment files from either /experiments/2018/ or /experiments/2024/ depending on which EMBER dataset you want to work with. This directory will contain all the implementation code for your chosen experiment:
-
ember_training.ipynb: Main notebook for training both baseline and hyperparameter-tuned models on the EMBER dataset. It includes:- Data loading and preprocessing from EMBER vectorized features
- Baseline model training with default hyperparameters
- Hyperparameter optimization using Optuna
- Model serialization and storage
- Training time tracking
-
ember_evaluation.ipynb: Evaluation notebook that benchmarks all trained models. It includes:- Model loading and performance evaluation
- Multiple metrics calculation (ROC AUC, PR AUC, F1-Score)
- Detection rates at fixed false positive rates (1% and 0.1% FPR)
- Inference time and other metric measurement
- Feature importance analysis
- Comparative visualization of results
-
model_wrappers.py: Provides a unified interface for different ML frameworks through abstract base classes:ModelWrapper: Abstract base class defining the common interfaceLGBMWrapper: Wrapper for LightGBM models- 2018 version only:
XGBWrapper: Wrapper for XGBoost modelsCatBoostWrapper: Wrapper for CatBoost modelsSKLearnWrapper: Base wrapper for scikit-learn modelsRandomForestWrapper: Wrapper for Random Forest classifierExtraTreesWrapper: Wrapper for Extra Trees classifierKerasWrapper: Wrapper for neural network models with integrated data scaling
-
evaluation_utils.py: Helper functions for model evaluation and benchmarking:get_performance_scores()calculates PR AUC and optimal F1 scorepredict_and_time()measures inference time with warm-up runsbenchmark_models()evaluates and compares all trained models- 2018 version:
get_fpr()* calculates false positive rates andfind_threshold()* finds decision thresholds at target FPR using iterative search - 2024 version:
find_threshold_for_fpr()finds decision thresholds at target FPR using ROC curve analysis
challenge_parser.py: Utility for filtering and processing the EMBER2024 challenge set JSONL files to extract only PE files (Win32, Win64, Dot_Net)
Note: The get_fpr() and find_threshold() functions are adopted from the original ember repository (Copyright 2018 H. Anderson and P. Roth) to maintain methodological consistency with the original research.
This repository does not include the dataset files directly due to their size and to ensure you are using the official versions. You must download the datasets from their original sources and place them in the /data directory for the notebooks to function correctly.
Instructions:
You only need to download the dataset corresponding to the experiment you plan to run:
-
Choose and download your dataset:
- For EMBER 2018 experiments: Download the EMBER 2018 dataset by following the instructions in https://github.com/elastic/ember
- For EMBER2024 experiments: Download the EMBER2024 Dataset by following the instructions in https://github.com/FutureComputing4AI/EMBER2024
-
After downloading, ensure the data files are placed within the
/datadirectory according to the format expected by your chosen experiment.
This directory stores all trained machine learning models in their native formats. Models are stored with prefixes indicating whether they are baseline (baseline_) or hyperparameter-tuned (tuned_) versions.
This directory stores the output from model evaluation, including:
- Performance metrics in tabular format
- Visualization plots
- Comparative analysis results
- Feature importance rankings
Important Note for Jupyter Users: When running the notebooks, ensure that the Jupyter environment considers the root of this repository as the working directory, not the folder where the notebook is located. This is required for proper path resolution to data, models, and utilities.
To work with either research implementation:
- Choose your target dataset (2018 or 2024)
- Copy the experiment files from the respective folder (
experiments/2018/orexperiments/2024/) to the/srcdirectory - Download the appropriate dataset (see Dataset Storage section)
- Set up your environment using the installation instructions below
- Run the notebooks to train and evaluate models
-
Create virtual environment:
# For EMBER 2018 (Python 3.6.15) $ /usr/bin/python3.6 -m venv .venv # For EMBER2024 (Python 3.12.3) $ /usr/bin/python3.12 -m venv .venv
-
Activate environment and install base dependencies:
$ source .venv/bin/activate $ python -m pip install --upgrade pip $ python -m pip install -r requirements.txt -
Install Jupyter kernel:
$ python -m ipykernel install --user --name=ml-static-malware-analysis --display-name="Python (ml-static-malware-analysis)"
$ python -m pip install ./ember
$ python -m easy_install lief-0.9.0-py3.6-linux.eggNote:
./emberrefers to the locally cloned EMBER 2018 repository- Ensure you have downloaded the LIEF 0.9.0 egg file from the official release before running the installation commands
$ python -m pip install ./EMBER2024Note:
./EMBER2024refers to the locally cloned EMBER2024 repository
Issue: thrember import may fail with newer OpenSSL versions (e.g., 3.0.13) due to a regex pattern mismatch in the oscrypto library.
Root Cause: The regex pattern in oscrypto/_openssl/_libcrypto_ctypes.py line 43 uses \\d\\.\\d\\.\\d[a-z]* which expects exactly one digit for the patch version, but newer OpenSSL versions like "3.0.13" have multiple digits in the patch version that break this pattern.
Fix: Update the regex to handle multiple digits in the patch version:
$ sed -i 's/\\\\d\[/\\\\d+\[/' .venv/lib/python3.12/site-packages/oscrypto/_openssl/_libcrypto_ctypes.pyThis changes \\d\\.\\d\\.\\d[a-z]* to \\d\\.\\d\\.\\d+[a-z]*, adding a + quantifier only to the last digit matcher (patch version) to allow multiple digits.
This repository and its source code are licensed under the GNU Affero General Public License v3 (AGPL-v3). This choice is required because this project incorporates and derives from code originally from the ember repository, which is also licensed under AGPL-v3.
Our research notebooks are inspired by the original work in both the ember and ember2024 repositories. Notably, we have adopted a similar plotting method to ensure our results are visually consistent with the original research. To maintain methodological consistency, the get_fpr and find_threshold functions were also directly copied from the ember project.
This work also includes components from the ember2024 repository, which are licensed under the Apache License 2.0.
This project is licensed under the AGPL-v3, with the full text available in the LICENSE file. Information regarding third-party components and their respective licenses (Apache 2.0 and MIT) can be found in the NOTICE file.
This work has been accepted for presentation at the 18th International Conference on Agents and Artificial Intelligence (ICAART 2026).
- Read the Paper: Author's Accepted Version
- Conference Website: https://icaart.scitevents.org
Mandatory Disclaimer: This contribution has been accepted for presentation at ICAART 2026. The final authenticated version will be available online at the SCITEPRESS Digital Library.
If you use this code or our findings in your research, please cite the paper as follows:
A Comparative Benchmark of Machine Learning Models for Static Malware Analysis: From EMBER 2018 to the Challenges of 2024 David-Cristian Horvath and Imre Zsigmond Accepted at the 18th International Conference on Agents and Artificial Intelligence (ICAART 2026), Marbella, Spain, March 5–7, 2026.
@conference{icaart26,
author={David{-}Cristian Horvath and Imre Zsigmond},
title={A Comparative Benchmark of Machine Learning Models for Static Malware Analysis: From EMBER 2018 to the Challenges of 2024},
booktitle={Proceedings of the 18th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART},
year={2026},
pages={2284-2291},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0014218700004052},
isbn={978-989-758-796-2},
issn={2184-433X},
}