This project analyzes the Common Vulnerabilities and Exposures (CVE) dataset to uncover historical trends, assess severity levels, and forecast future risks in cybersecurity.
It demonstrates skills in data preprocessing, time-series analysis, statistical modeling, and visualization using Python.
Key Questions:
- Which vulnerability types are becoming more frequent?
- Can we predict the likelihood of high-risk vulnerabilities appearing over time?
- Source: CVE Dataset (Kaggle)
- Includes:
cvss: Severity score (CVSS)pub_date/mod_date: Publication & modification datescwe_code/cwe_name: Weakness type (e.g., CWE-79 for XSS)access_/impact_: Attack complexity, vector, and CIA triad impact
- Clean and preprocess CVE data (missing values, ordinal encoding).
- Identify year-by-year trends and the top software weakness categories (CWEs).
- Detect outliers and correlations among vulnerability features.
- Forecast future occurrences of CWE-79 using Exponential Smoothing.
- Build a logistic regression model to predict the likelihood of CWE-79 vulnerabilities.
- Temporal Analysis: Visualized logarithmic growth of CVEs over 20+ years.
- Forecasting: Predicted CWE-79 vulnerability counts with a 5-year horizon.
- Statistical Rigor: Applied Pearson, Spearman, and covariance analysis.
- Classification: Achieved measurable accuracy in detecting CWE-79 trends via logistic regression.
- Line charts of CVE growth by year.
- Top 5 CWE category trends.
- Covariance heatmap of severity metrics.
- Time-series forecast and ROC curve.
- Language: Python
- Libraries:
pandas,numpy,scikit-learn,statsmodels,
matplotlib,seaborn,plotly,scipy,kagglehub
- Add NLP features (TF-IDF, BERT)
- Implement ARIMA/Prophet for advanced forecasting
- Predict CVSS severity score using regression
MIT License – see LICENSE for details.