This repository contains the code and experiments for the paper Feature Importance in Machine Learning Models for Static Malware Detection.
This project analyzes which static PE file features drive malware detection decisions across different machine learning architectures. Using the EMBER 2018 dataset, we compare tree-based and neural network models with a focus on feature importance, interpretability, and robustness, rather than performance alone.
- LightGBM
- Random Forest
- Feedforward Neural Network (FFNN)
- Convolutional Neural Network (CNN)
- Tree-based models achieve the highest accuracy on clean data.
- Neural networks are less accurate but degrade more gracefully under feature perturbation.
- Imports, string-based metadata, and entropy-related features consistently signal malware across models.
- Different architectures rely on distinct subsets of the feature space.
- EMBER 2018 v2 feature dataset
- ~800k labeled samples, 2,381 features per file
- Static PE features only (no execution or dynamic analysis)
- Model-specific feature importance extraction
- Correlation analysis of high-importance features
- Robustness testing via Gaussian noise perturbation
This work evaluates static, feature-based malware detection. Gaussian perturbations are used as a stress test and do not represent realistic adversarial attacks.
- Raquel Ana Magalhães Bush
- Brian Kade Betterton