A machine learning system for predicting flight ticket prices using Python, scikit-learn, XGBoost, CatBoost, and FastAPI.
Airlines dynamically price tickets based on multiple factors including:
- Route: Source and destination cities affect base pricing
- Timing: Departure time, day of week, and season impact prices
- Airline carrier: Different airlines have different pricing strategies
- Number of stops: Direct flights typically cost more than connecting flights
- Booking class: Business vs economy affects pricing significantly
This system predicts flight prices to help:
- Travelers: Find optimal booking times and routes
- Airlines: Optimize pricing strategies
- Travel agencies: Provide better price estimates
The model is trained on the Kaggle Flight Price Prediction Dataset and achieves R² > 0.85 using random forest algorithm.
- Comprehensive EDA with visualizations
- 6 ML models trained and compared
- Hyperparameter tuning with RandomizedSearchCV
- FastAPI REST API for predictions
- Docker containerization for deployment
- Full reproducibility with random_state=42
| Dataset Overview | Missing Values |
|---|---|
![]() |
![]() |
| Price Distribution | Feature Distributions |
|---|---|
![]() |
![]() |
| Correlation Heatmap | Datetime Features |
|---|---|
![]() |
![]() |
| Model Comparison | Best Model Performance |
|---|---|
![]() |
![]() |
| Swagger UI | ReDoc |
|---|---|
![]() |
![]() |
Download the dataset from Kaggle and place it in the dataset/ folder:
- Go to Kaggle Flight Price Prediction Dataset
- Download the dataset (requires Kaggle account)
- Extract and place the CSV file:
mkdir -p dataset
mv ~/Downloads/Clean_Dataset.csv dataset/The dataset directory is gitignored due to file size.
- Python 3.13+
- uv package manager
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone repository
git clone https://github.com/devnovikov/flight-price-prediciton.git
cd flight-price-prediciton
# Create virtual environment and install dependencies
uv sync
# Activate virtual environment
source .venv/bin/activate # On Windows: .venv\Scripts\activate# Start Jupyter
uv run jupyter notebook notebooks/notebook.ipynbThe notebook includes:
- Complete exploratory data analysis
- Feature engineering
- Model training and comparison
- Hyperparameter tuning
- All visualizations saved to
screenshots/
# Train model using production script
uv run python src/train/train.pyExpected output:
============================================================
FLIGHT PRICE PREDICTION - MODEL TRAINING
============================================================
Random State: 42
Loading data from dataset/Clean_Dataset.csv...
Loaded 300,153 rows, 12 columns
Preprocessing data...
Features shape: (300153, 11)
============================================================
TRAINING MODELS
============================================================
Baseline models:
Training Linear Regression...
Training Ridge Regression...
Training Lasso Regression...
Models with hyperparameter tuning:
Tuning Random Forest...
Tuning XGBoost...
Tuning CatBoost...
============================================================
MODEL COMPARISON (sorted by RMSE)
============================================================
Model RMSE R2 MAE
Random Forest 2127.354825 0.991221 768.780202
XGBoost 2189.536252 0.990700 1069.969849
CatBoost 3084.850592 0.981539 1718.064701
Lasso Regression 7012.694003 0.904598 4622.719640
Ridge Regression 7012.704891 0.904598 4623.039426
Linear Regression 7012.705115 0.904598 4622.989854
============================================================
BEST MODEL
============================================================
Best Model: Random Forest
RMSE: 2,127.35
R2: 0.9912
Best Hyperparameters: {'n_estimators': 300, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': None}
# Start FastAPI server
uv run uvicorn src.predict.predict:app --reloadServer will start at http://localhost:8000
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
curl http://localhost:8000/Response:
{
"status": "healthy",
"model_loaded": true,
"version": "1.0.0"
}curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{
"airline": "IndiGo",
"source": "Delhi",
"destination": "Cochin",
"date_of_journey": "2026-03-15",
"dep_time": "10:30",
"arrival_time": "14:45",
"duration": "4h 15m",
"total_stops": "1 stop"
}'Response:
{
"predicted_price": 8542.75,
"currency": "INR",
"model_version": "1.0.0",
"features_used": ["Journey_day", "Journey_month", "..."]
}# Ensure model is trained first
uv run python src/train/train.py
# Build Docker image
docker build -t flight-price-api .# Run container
docker run -p 8000:8000 flight-price-api# Health check
curl http://localhost:8000/
# Make prediction
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{
"airline": "IndiGo",
"source": "Delhi",
"destination": "Cochin",
"date_of_journey": "2026-03-15",
"dep_time": "10:30",
"arrival_time": "14:45",
"duration": "4h 15m",
"total_stops": "1 stop"
}'flight-price-prediciton/
├── dataset/ # Kaggle dataset (gitignored)
│ └── Clean_Dataset.csv
├── models/ # Trained model artifacts
│ └── best_model.joblib
├── notebooks/
│ └── notebook.ipynb # EDA and model training
├── screenshots/ # Visualizations (13 images)
├── src/
│ ├── __init__.py
│ ├── train/
│ │ ├── __init__.py
│ │ ├── constants.py # Configuration constants
│ │ ├── preprocessing.py # Data preprocessing
│ │ └── train.py # Production training script
│ └── predict/
│ ├── __init__.py
│ ├── schemas.py # Pydantic models
│ ├── model_loader.py # Model loading utility
│ ├── preprocessing.py # Inference preprocessing
│ └── predict.py # FastAPI application
├── tests/
│ └── __init__.py
├── specs/ # Feature specifications
├── .gitignore
├── .dockerignore
├── pyproject.toml # Dependencies (uv)
├── uv.lock # Locked versions
├── Dockerfile
└── README.md
| Model | RMSE | R² Score | MAE |
|---|---|---|---|
| Random Forest | 2127.354825 | 0.991221 | 768.780202 |
| XGBoost | 2189.536252 | 0.990700 | 1069.969849 |
| CatBoost | 3084.850592 | 0.981539 | 1718.064701 |
| Lasso Regression | 7012.694003 | 0.904598 | 4622.719640 |
| Ridge Regression | 7012.704891 | 0.904598 | 4623.039426 |
| Linear Regression | 7012.705115 | 0.904598 | 4622.989854 |
Best model (Random Forest) achieves R² > 0.85 target.
- Random State: All random operations use
random_state=42 - Dependencies: Locked with
uv.lockfor exact versions - Dataset: Download instructions provided above
The current model can be deployed to Google Cloud for prediction. Follow the steps.
- Install prerequisites
# Install Google Cloud SDK
curl https://sdk.cloud.google.com | bash
exec -l $SHELL
# Login and set project
gcloud auth login
gcloud config set project YOUR_PROJECT_ID- Enable required APIs
gcloud services enable run.googleapis.com cloudbuild.googleapis.com- Build and push to Google Container Registry
gcloud builds submit --tag gcr.io/YOUR_PROJECT_ID/flight-price-api- Deploy to Cloud Run
gcloud run deploy flight-price-api \
--image gcr.io/YOUR_PROJECT_ID/flight-price-api \
--platform managed \
--region us-central1 \
--memory 4Gi \
--cpu 2 \
--timeout 600 \
--allow-unauthenticated- Language: Python 3.13
- Package Manager: uv
- ML Frameworks: scikit-learn, XGBoost, CatBoost
- API: FastAPI, uvicorn, Pydantic
- Visualization: matplotlib, seaborn
- Containerization: Docker
This project is for educational purposes. The dataset is from Kaggle and subject to its license terms.












