Skip to content

davidcristian/ml-static-malware-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine Learning Static Malware Analysis

This repository contains code for training and evaluating machine learning models on the EMBER datasets for static malware detection. The project benchmarks multiple machine learning algorithms including gradient boosting methods (LightGBM, XGBoost, CatBoost), ensemble methods (Random Forest, Extra Trees), and neural networks (MLP).

Repository Structure

/experiments - Research Implementation Code

Contains research implementations for both EMBER datasets:

  • /experiments/2018: Multi-algorithm approach (LightGBM, XGBoost, CatBoost, Random Forest, Extra Trees, MLP)
  • /experiments/2024: LightGBM-focused implementation with challenge set parsing
Aspect 2018 Implementation 2024 Implementation
Python Version 3.6.15 3.12.3
Dataset EMBER 2018 EMBER2024
Data Library ember package thrember package

/src - Source Code

The source directory is where you copy the experiment files from either /experiments/2018/ or /experiments/2024/ depending on which EMBER dataset you want to work with. This directory will contain all the implementation code for your chosen experiment:

Jupyter Notebooks

  • ember_training.ipynb: Main notebook for training both baseline and hyperparameter-tuned models on the EMBER dataset. It includes:

    • Data loading and preprocessing from EMBER vectorized features
    • Baseline model training with default hyperparameters
    • Hyperparameter optimization using Optuna
    • Model serialization and storage
    • Training time tracking
  • ember_evaluation.ipynb: Evaluation notebook that benchmarks all trained models. It includes:

    • Model loading and performance evaluation
    • Multiple metrics calculation (ROC AUC, PR AUC, F1-Score)
    • Detection rates at fixed false positive rates (1% and 0.1% FPR)
    • Inference time and other metric measurement
    • Feature importance analysis
    • Comparative visualization of results

Utilities (/src/utils)

  • model_wrappers.py: Provides a unified interface for different ML frameworks through abstract base classes:

    • ModelWrapper: Abstract base class defining the common interface
    • LGBMWrapper: Wrapper for LightGBM models
    • 2018 version only:
      • XGBWrapper: Wrapper for XGBoost models
      • CatBoostWrapper: Wrapper for CatBoost models
      • SKLearnWrapper: Base wrapper for scikit-learn models
      • RandomForestWrapper: Wrapper for Random Forest classifier
      • ExtraTreesWrapper: Wrapper for Extra Trees classifier
      • KerasWrapper: Wrapper for neural network models with integrated data scaling
  • evaluation_utils.py: Helper functions for model evaluation and benchmarking:

    • get_performance_scores() calculates PR AUC and optimal F1 score
    • predict_and_time() measures inference time with warm-up runs
    • benchmark_models() evaluates and compares all trained models
    • 2018 version: get_fpr()* calculates false positive rates and find_threshold()* finds decision thresholds at target FPR using iterative search
    • 2024 version: find_threshold_for_fpr() finds decision thresholds at target FPR using ROC curve analysis

Additional Tools (EMBER2024 only)

  • challenge_parser.py: Utility for filtering and processing the EMBER2024 challenge set JSONL files to extract only PE files (Win32, Win64, Dot_Net)

Note: The get_fpr() and find_threshold() functions are adopted from the original ember repository (Copyright 2018 H. Anderson and P. Roth) to maintain methodological consistency with the original research.

/data - Dataset Storage

This repository does not include the dataset files directly due to their size and to ensure you are using the official versions. You must download the datasets from their original sources and place them in the /data directory for the notebooks to function correctly.

Instructions:

You only need to download the dataset corresponding to the experiment you plan to run:

  1. Choose and download your dataset:

  2. After downloading, ensure the data files are placed within the /data directory according to the format expected by your chosen experiment.

/models - Trained Model Storage

This directory stores all trained machine learning models in their native formats. Models are stored with prefixes indicating whether they are baseline (baseline_) or hyperparameter-tuned (tuned_) versions.

/results - Evaluation Results

This directory stores the output from model evaluation, including:

  • Performance metrics in tabular format
  • Visualization plots
  • Comparative analysis results
  • Feature importance rankings

Getting Started

Important Note for Jupyter Users: When running the notebooks, ensure that the Jupyter environment considers the root of this repository as the working directory, not the folder where the notebook is located. This is required for proper path resolution to data, models, and utilities.

To work with either research implementation:

  1. Choose your target dataset (2018 or 2024)
  2. Copy the experiment files from the respective folder (experiments/2018/ or experiments/2024/) to the /src directory
  3. Download the appropriate dataset (see Dataset Storage section)
  4. Set up your environment using the installation instructions below
  5. Run the notebooks to train and evaluate models

Environment Setup

Common Setup Steps

  1. Create virtual environment:

    # For EMBER 2018 (Python 3.6.15)
    $ /usr/bin/python3.6 -m venv .venv
    
    # For EMBER2024 (Python 3.12.3)
    $ /usr/bin/python3.12 -m venv .venv
  2. Activate environment and install base dependencies:

    $ source .venv/bin/activate
    $ python -m pip install --upgrade pip
    $ python -m pip install -r requirements.txt
  3. Install Jupyter kernel:

    $ python -m ipykernel install --user --name=ml-static-malware-analysis --display-name="Python (ml-static-malware-analysis)"

EMBER 2018 Specific Steps

$ python -m pip install ./ember
$ python -m easy_install lief-0.9.0-py3.6-linux.egg

Note:

EMBER2024 Specific Steps

$ python -m pip install ./EMBER2024

Note:

Troubleshooting EMBER2024

Issue: thrember import may fail with newer OpenSSL versions (e.g., 3.0.13) due to a regex pattern mismatch in the oscrypto library.

Root Cause: The regex pattern in oscrypto/_openssl/_libcrypto_ctypes.py line 43 uses \\d\\.\\d\\.\\d[a-z]* which expects exactly one digit for the patch version, but newer OpenSSL versions like "3.0.13" have multiple digits in the patch version that break this pattern.

Fix: Update the regex to handle multiple digits in the patch version:

$ sed -i 's/\\\\d\[/\\\\d+\[/' .venv/lib/python3.12/site-packages/oscrypto/_openssl/_libcrypto_ctypes.py

This changes \\d\\.\\d\\.\\d[a-z]* to \\d\\.\\d\\.\\d+[a-z]*, adding a + quantifier only to the last digit matcher (patch version) to allow multiple digits.

Licensing and Attribution

This repository and its source code are licensed under the GNU Affero General Public License v3 (AGPL-v3). This choice is required because this project incorporates and derives from code originally from the ember repository, which is also licensed under AGPL-v3.

Our research notebooks are inspired by the original work in both the ember and ember2024 repositories. Notably, we have adopted a similar plotting method to ensure our results are visually consistent with the original research. To maintain methodological consistency, the get_fpr and find_threshold functions were also directly copied from the ember project.

This work also includes components from the ember2024 repository, which are licensed under the Apache License 2.0.

This project is licensed under the AGPL-v3, with the full text available in the LICENSE file. Information regarding third-party components and their respective licenses (Apache 2.0 and MIT) can be found in the NOTICE file.

Publication & Accepted Version

This work has been accepted for presentation at the 18th International Conference on Agents and Artificial Intelligence (ICAART 2026).

Mandatory Disclaimer: This contribution has been accepted for presentation at ICAART 2026. The final authenticated version will be available online at the SCITEPRESS Digital Library.

Citing

If you use this code or our findings in your research, please cite the paper as follows:

A Comparative Benchmark of Machine Learning Models for Static Malware Analysis: From EMBER 2018 to the Challenges of 2024 David-Cristian Horvath and Imre Zsigmond Accepted at the 18th International Conference on Agents and Artificial Intelligence (ICAART 2026), Marbella, Spain, March 5–7, 2026.

BibTeX

@conference{icaart26,
  author={David{-}Cristian Horvath and Imre Zsigmond},
  title={A Comparative Benchmark of Machine Learning Models for Static Malware Analysis: From EMBER 2018 to the Challenges of 2024},
  booktitle={Proceedings of the 18th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART},
  year={2026},
  pages={2284-2291},
  publisher={SciTePress},
  organization={INSTICC},
  doi={10.5220/0014218700004052},
  isbn={978-989-758-796-2},
  issn={2184-433X},
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors