Skip to content

Latest commit

 

History

History
160 lines (118 loc) · 5.51 KB

File metadata and controls

160 lines (118 loc) · 5.51 KB

SaaSRevCast: SaaS Revenue Forecasting Pipeline

A comprehensive machine learning project for forecasting SaaS company revenues using advanced regression techniques, hyperparameter optimization, and model explainability.

📋 Project Overview

SaaSRevCast is a predictive analytics pipeline designed to forecast revenue for SaaS (Software as a Service) companies using historical financial and market data. The project implements multiple regression models, performs rigorous hyperparameter tuning with Optuna, and provides interpretable insights through SHAP (SHapley Additive exPlanations) values.

🎯 Key Features

  • Time-Series Revenue Forecasting: Predicts SaaS company revenues using temporal data (2020-2024)
  • Multiple ML Models: Implements and compares Linear Regression, Random Forest, XGBoost, and Support Vector Regression
  • Automated Hyperparameter Tuning: Utilizes Optuna for optimization with 30+ trials
  • Model Explainability: SHAP analysis for feature importance and impact visualization
  • Feature Engineering: Advanced lag features, growth rates, profit margins, and customer metrics
  • Comprehensive Evaluation: RMSE, MAE, and MAPE metrics for model comparison

📁 Project Structure

SaasRevCast/
│
├── data/                                    # Dataset and results
│   ├── saas_financial_market_dataset.csv   # Main dataset (2,500+ records)
│   ├── results_saas_revcast.csv            # Model predictions and results
│   ├── model_comparison.csv                # Performance metrics comparison
│   └── feature_importance.csv              # SHAP feature importance values
│
├── notebooks/                               # Jupyter notebooks
│   └── MLPipeline.ipynb                    # Complete ML pipeline implementation
│
├── paper/                                   # Research documentation
│   └── Final Term Paper ML.pdf             # Complete research paper
│
└── README.md                                # Project documentation (this file)

📊 Dataset

The dataset contains financial and operational metrics for multiple SaaS companies across different industries and regions:

  • Size: 2,500+ records
  • Time Period: 2020-2024
  • Companies: Multiple SaaS companies across various industries
  • Features:
    • Revenue (USD)
    • Expenses (USD)
    • Profit (USD)
    • Customer Count
    • Churn Rate
    • ARPU (Average Revenue Per User)
    • Market Share (%)
    • Industry, Region, Founded Year

🔧 Engineered Features

  1. Lag Features: revenue_lag_1, market_share_lag_1, customer_count_lag_1, churn_rate_lag_1
  2. Growth Metrics: revenue_growth (percentage change)
  3. Financial Ratios: profit_margin, expenses_per_customer
  4. Log Transformation: log_revenue (target variable) to reduce skewness

🤖 Models Implemented

  1. Linear Regression (Baseline)
  2. Random Forest Regressor (with Optuna tuning)
  3. XGBoost Regressor
  4. Support Vector Regression (SVR)

📈 Model Performance

Models are evaluated using:

  • RMSE (Root Mean Squared Error)
  • MAE (Mean Absolute Error)
  • MAPE (Mean Absolute Percentage Error)

Results are saved in data/model_comparison.csv for detailed comparison.

🔍 Explainability

The project uses SHAP (SHapley Additive exPlanations) to provide:

  • Feature importance rankings
  • Feature impact visualization
  • Individual prediction explanations

🚀 Getting Started

Prerequisites

pip install pandas numpy matplotlib seaborn scikit-learn xgboost optuna shap

Running the Pipeline

  1. Navigate to the notebooks/ directory
  2. Open MLPipeline.ipynb in Jupyter Notebook or VS Code
  3. Run all cells sequentially to:
    • Load and preprocess data
    • Engineer features
    • Train models
    • Evaluate performance
    • Generate SHAP visualizations

📑 Research Paper

A complete research paper documenting the methodology, experiments, results, and insights is available in the paper/ folder:

📄 Final Term Paper ML.pdf

The paper includes:

  • Literature review
  • Detailed methodology
  • Experimental setup and results
  • Model comparison and analysis
  • Conclusions and future work

📊 Results & Outputs

All results are automatically saved to the data/ folder:

  • Model predictions
  • Performance metrics
  • Feature importance scores
  • Comparison tables

🛠️ Technical Stack

  • Python 3.x
  • Data Processing: Pandas, NumPy
  • Visualization: Matplotlib, Seaborn
  • Machine Learning: Scikit-learn, XGBoost
  • Optimization: Optuna
  • Explainability: SHAP
  • Environment: Jupyter Notebook

📝 Data Splits

  • Training Set: 2020-2022
  • Validation Set: 2023 (for hyperparameter tuning)
  • Test Set: 2024 (for final evaluation)

🎓 Use Cases

This project is suitable for:

  • Revenue forecasting for SaaS businesses
  • Financial planning and budgeting
  • Investor analysis and due diligence
  • Academic research in time-series forecasting
  • Learning ML pipelines and model explainability

📧 Contact & Contributions

This project was developed as a machine learning research project. For questions or contributions, please refer to the research paper in the paper/ folder for detailed methodology and references.

📄 License

This project is for educational and research purposes.


Note: The complete technical details, mathematical formulations, and experimental results are documented in the research paper located in the paper/ folder.