Machine Learning Static Malware Analysis

This repository contains code for training and evaluating machine learning models on the EMBER datasets for static malware detection. The project benchmarks multiple machine learning algorithms including gradient boosting methods (LightGBM, XGBoost, CatBoost), ensemble methods (Random Forest, Extra Trees), and neural networks (MLP).

Repository Structure

`/experiments` - Research Implementation Code

Contains research implementations for both EMBER datasets:

/experiments/2018: Multi-algorithm approach (LightGBM, XGBoost, CatBoost, Random Forest, Extra Trees, MLP)
/experiments/2024: LightGBM-focused implementation with challenge set parsing

Aspect	2018 Implementation	2024 Implementation
Python Version	3.6.15	3.12.3
Dataset	EMBER 2018	EMBER2024
Data Library	`ember` package	`thrember` package

`/src` - Source Code

The source directory is where you copy the experiment files from either /experiments/2018/ or /experiments/2024/ depending on which EMBER dataset you want to work with. This directory will contain all the implementation code for your chosen experiment:

Jupyter Notebooks

ember_training.ipynb: Main notebook for training both baseline and hyperparameter-tuned models on the EMBER dataset. It includes:
- Data loading and preprocessing from EMBER vectorized features
- Baseline model training with default hyperparameters
- Hyperparameter optimization using Optuna
- Model serialization and storage
- Training time tracking
ember_evaluation.ipynb: Evaluation notebook that benchmarks all trained models. It includes:
- Model loading and performance evaluation
- Multiple metrics calculation (ROC AUC, PR AUC, F1-Score)
- Detection rates at fixed false positive rates (1% and 0.1% FPR)
- Inference time and other metric measurement
- Feature importance analysis
- Comparative visualization of results

Utilities (`/src/utils`)

model_wrappers.py: Provides a unified interface for different ML frameworks through abstract base classes:
- ModelWrapper: Abstract base class defining the common interface
- LGBMWrapper: Wrapper for LightGBM models
- 2018 version only:
  - XGBWrapper: Wrapper for XGBoost models
  - CatBoostWrapper: Wrapper for CatBoost models
  - SKLearnWrapper: Base wrapper for scikit-learn models
  - RandomForestWrapper: Wrapper for Random Forest classifier
  - ExtraTreesWrapper: Wrapper for Extra Trees classifier
  - KerasWrapper: Wrapper for neural network models with integrated data scaling
evaluation_utils.py: Helper functions for model evaluation and benchmarking:
- get_performance_scores() calculates PR AUC and optimal F1 score
- predict_and_time() measures inference time with warm-up runs
- benchmark_models() evaluates and compares all trained models
- 2018 version: get_fpr()* calculates false positive rates and find_threshold()* finds decision thresholds at target FPR using iterative search
- 2024 version: find_threshold_for_fpr() finds decision thresholds at target FPR using ROC curve analysis

Additional Tools (EMBER2024 only)

challenge_parser.py: Utility for filtering and processing the EMBER2024 challenge set JSONL files to extract only PE files (Win32, Win64, Dot_Net)

Note: The get_fpr() and find_threshold() functions are adopted from the original ember repository (Copyright 2018 H. Anderson and P. Roth) to maintain methodological consistency with the original research.

`/data` - Dataset Storage

This repository does not include the dataset files directly due to their size and to ensure you are using the official versions. You must download the datasets from their original sources and place them in the /data directory for the notebooks to function correctly.

Instructions:

You only need to download the dataset corresponding to the experiment you plan to run:

Choose and download your dataset:
- For EMBER 2018 experiments: Download the EMBER 2018 dataset by following the instructions in https://github.com/elastic/ember
- For EMBER2024 experiments: Download the EMBER2024 Dataset by following the instructions in https://github.com/FutureComputing4AI/EMBER2024
After downloading, ensure the data files are placed within the /data directory according to the format expected by your chosen experiment.

`/models` - Trained Model Storage

This directory stores all trained machine learning models in their native formats. Models are stored with prefixes indicating whether they are baseline (baseline_) or hyperparameter-tuned (tuned_) versions.

`/results` - Evaluation Results

This directory stores the output from model evaluation, including:

Performance metrics in tabular format
Visualization plots
Comparative analysis results
Feature importance rankings

Getting Started

Important Note for Jupyter Users: When running the notebooks, ensure that the Jupyter environment considers the root of this repository as the working directory, not the folder where the notebook is located. This is required for proper path resolution to data, models, and utilities.

To work with either research implementation:

Choose your target dataset (2018 or 2024)
Copy the experiment files from the respective folder (experiments/2018/ or experiments/2024/) to the /src directory
Download the appropriate dataset (see Dataset Storage section)
Set up your environment using the installation instructions below
Run the notebooks to train and evaluate models

Environment Setup

Common Setup Steps

Create virtual environment:

# For EMBER 2018 (Python 3.6.15)
$ /usr/bin/python3.6 -m venv .venv

# For EMBER2024 (Python 3.12.3)
$ /usr/bin/python3.12 -m venv .venv

Activate environment and install base dependencies:

$ source .venv/bin/activate
$ python -m pip install --upgrade pip
$ python -m pip install -r requirements.txt

Install Jupyter kernel:

$ python -m ipykernel install --user --name=ml-static-malware-analysis --display-name="Python (ml-static-malware-analysis)"

EMBER 2018 Specific Steps

$ python -m pip install ./ember
$ python -m easy_install lief-0.9.0-py3.6-linux.egg

Note:

./ember refers to the locally cloned EMBER 2018 repository
Ensure you have downloaded the LIEF 0.9.0 egg file from the official release before running the installation commands

EMBER2024 Specific Steps

$ python -m pip install ./EMBER2024

Note:

./EMBER2024 refers to the locally cloned EMBER2024 repository

Troubleshooting EMBER2024

Issue: thrember import may fail with newer OpenSSL versions (e.g., 3.0.13) due to a regex pattern mismatch in the oscrypto library.

Root Cause: The regex pattern in oscrypto/_openssl/_libcrypto_ctypes.py line 43 uses \\d\\.\\d\\.\\d[a-z]* which expects exactly one digit for the patch version, but newer OpenSSL versions like "3.0.13" have multiple digits in the patch version that break this pattern.

Fix: Update the regex to handle multiple digits in the patch version:

$ sed -i 's/\\\\d\[/\\\\d+\[/' .venv/lib/python3.12/site-packages/oscrypto/_openssl/_libcrypto_ctypes.py

This changes \\d\\.\\d\\.\\d[a-z]* to \\d\\.\\d\\.\\d+[a-z]*, adding a + quantifier only to the last digit matcher (patch version) to allow multiple digits.

Licensing and Attribution

This repository and its source code are licensed under the GNU Affero General Public License v3 (AGPL-v3). This choice is required because this project incorporates and derives from code originally from the ember repository, which is also licensed under AGPL-v3.

Our research notebooks are inspired by the original work in both the ember and ember2024 repositories. Notably, we have adopted a similar plotting method to ensure our results are visually consistent with the original research. To maintain methodological consistency, the get_fpr and find_threshold functions were also directly copied from the ember project.

This work also includes components from the ember2024 repository, which are licensed under the Apache License 2.0.

This project is licensed under the AGPL-v3, with the full text available in the LICENSE file. Information regarding third-party components and their respective licenses (Apache 2.0 and MIT) can be found in the NOTICE file.

Publication & Accepted Version

This work has been accepted for presentation at the 18th International Conference on Agents and Artificial Intelligence (ICAART 2026).

Read the Paper: Author's Accepted Version
Conference Website: https://icaart.scitevents.org

Mandatory Disclaimer: This contribution has been accepted for presentation at ICAART 2026. The final authenticated version will be available online at the SCITEPRESS Digital Library.

Citing

If you use this code or our findings in your research, please cite the paper as follows:

A Comparative Benchmark of Machine Learning Models for Static Malware Analysis: From EMBER 2018 to the Challenges of 2024 David-Cristian Horvath and Imre Zsigmond Accepted at the 18th International Conference on Agents and Artificial Intelligence (ICAART 2026), Marbella, Spain, March 5–7, 2026.

BibTeX

@conference{icaart26,
  author={David{-}Cristian Horvath and Imre Zsigmond},
  title={A Comparative Benchmark of Machine Learning Models for Static Malware Analysis: From EMBER 2018 to the Challenges of 2024},
  booktitle={Proceedings of the 18th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART},
  year={2026},
  pages={2284-2291},
  publisher={SciTePress},
  organization={INSTICC},
  doi={10.5220/0014218700004052},
  isbn={978-989-758-796-2},
  issn={2184-433X},
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
experiments		experiments
src		src
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning Static Malware Analysis

Repository Structure

`/experiments` - Research Implementation Code

`/src` - Source Code

Jupyter Notebooks

Utilities (`/src/utils`)

Additional Tools (EMBER2024 only)

`/data` - Dataset Storage

`/models` - Trained Model Storage

`/results` - Evaluation Results

Getting Started

Environment Setup

Common Setup Steps

EMBER 2018 Specific Steps

EMBER2024 Specific Steps

Troubleshooting EMBER2024

Licensing and Attribution

Publication & Accepted Version

Citing

BibTeX

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Static Malware Analysis

Repository Structure

/experiments - Research Implementation Code

/src - Source Code

Jupyter Notebooks

Utilities (/src/utils)

Additional Tools (EMBER2024 only)

/data - Dataset Storage

/models - Trained Model Storage

/results - Evaluation Results

Getting Started

Environment Setup

Common Setup Steps

EMBER 2018 Specific Steps

EMBER2024 Specific Steps

Troubleshooting EMBER2024

Licensing and Attribution

Publication & Accepted Version

Citing

BibTeX

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`/experiments` - Research Implementation Code

`/src` - Source Code

Utilities (`/src/utils`)

`/data` - Dataset Storage

`/models` - Trained Model Storage

`/results` - Evaluation Results

Packages