This project is an end-to-end machine learning system for predicting used car prices. It follows a production-oriented workflow starting from raw data preprocessing and feature engineering, through model training and evaluation, and finally preparing the system for deployment as a Flask-based web application.
- Clean and preprocess real-world used car data
- Perform feature engineering using meaningful raw features
- Train and evaluate multiple regression models
- Select the best-performing model based on objective evaluation metrics
- Prepare the project for full-stack deployment using Flask
- Source: Kaggle – Used Cars Dataset
- Kaggle dataset Link : https://www.kaggle.com/datasets/austinreese/craigslist-carstrucks-data/
- Target Variable: price
- The dataset contains vehicle specifications, condition, usage history, and location data
- download dataset (vechiles.csv)file and put it in '../data/raw' and start by running notebooks one by one
- Language: Python
- Libraries:
- pandas, numpy
- scikit-learn
- xgboost
- matplotlib, seaborn
- tqdm
- Deployment: Flask (planned)
- Exploratory Data Analysis (EDA)
- Data Cleaning
- Handling missing values
- Removing outliers
- Feature Engineering
- Creating derived features (e.g. car_age)
- Preparing categorical features for encoding
- Model Training & Evaluation
- Model Selection
- Model Saving for Deployment
- Full-Stack Application (Flask)
- price – vehicle selling price
- year – manufacturing year
- odometer – mileage
- cylinders – number of engine cylinders
- manufacturer – car brand
- model – vehicle model
- condition – overall vehicle condition
- fuel – fuel type
- transmission – transmission type
- drive – drivetrain
- type – vehicle body type
- size – vehicle size category
- paint_color – exterior color
- title_status – legal title status
- state – vehicle location (US state)
Categorical features are encoded using one-hot encoding during the preprocessing stage inside the machine learning pipeline.
The following regression models were trained and evaluated:
- Linear Regression
- Ridge Regression
- Lasso Regression
- ElasticNet
- K-Nearest Neighbors Regressor
- Decision Tree Regressor
- Random Forest Regressor
- Gradient Boosting Regressor
- XGBoost Regressor
Note: Support Vector Regression (SVR) was intentionally excluded due to scalability limitations and poor suitability for deployment.
Models were evaluated using:
- MAE (Mean Absolute Error) ↓
- RMSE (Root Mean Squared Error) ↓
- R² Score ↑
After comparing all trained models, Random Forest Regressor was selected as the final model.
- Lowest MAE (average pricing error)
- Lowest RMSE (penalizing large errors)
- Highest R² score (~0.89)
- Strong generalization performance
The Random Forest model provided the best balance between accuracy, robustness, and production readiness.
| Metric | Value |
|---|---|
| MAE | ~2200 |
| RMSE | ~4700 |
| R² Score | ~0.89 |
The final trained model was saved using joblib for deployment:
joblib.dump(model, "models/used_car_price_model.pkl")
## Deployment Strategy:
- User inputs raw feature values through a web form
- Backend handles preprocessing and encoding
- Model returns real-time price prediction
- This design ensures clean user input and consistent preprocessing during inference.