Diamonds Machine Learning | Project

Objective

Create the best regression model of machine learning that can predict the price (independent variable) of the diamonds when a certain characteristic (features) are given.

Instructions

Two datasets are given in a kaggle competition

Train Set: with the following columns.

Feature	Description
`id`	only for test & sample submission files, id for prediction sample identification
`carat`	weight of the diamond
`cut`	quality of the cut (Fair, Good, Very Good, Premium, Ideal)
`color`	diamond colour, from J (worst) to D (best)
`clarity`	a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
`x`	length in mm
`y`	width in mm
`z`	depth in mm
`depth`	total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)
`table`	width of top of diamond relative to widest point (43--95)

Independent variable	Description
`price`	price in USD

Test Set: With the same features of the training set but in this case without the price, that has to be predicted using the ML model.
Submit Set: Id the test set with the only columns of “id” and “price”. This set has to be submitted to Kaggle competition that will return a score based on the MRSE. The lower this score, the better model you are delivering.

Resources and libraries

Sklearn | Machine Learning Library with regression models
TensorFlow Keras | ML library
h2O | Auto ML module for machine learning

Inputs

diamonds_train.CSV | Shape (40345, 11)
diamonds_test.CSV | Shape (13449, 10)

Outputs

Models submitted to the kaggle competition
Jupyter Notebooks of the process of each model

Methodology

Cleaning Dataset The following decision has to be done
- Ensure not Null values
- Avoid not numeric values
- Drop the high correlated features
- Standardize or normalize high-value range features
Identify the best model Different models apport different results. Some of them are analysed in the project.
- LinearRegression
- RandomForestRegression
- LassoCV
- SGDRegressor
- Keras Sequential Dense Neural Network
- SVR
- Polynomic RFR
- Sum of two Polynomic RFR
Analyse the result There are metrics from sklearn that help you to choose the best model
- R2_score
- MRSE
- Scatter plot y_test vs y_true

Results

As it can be appreciated, the best solution addressed is the Polynomic Random Forest Regressor. These are the plots that demmonstrate the relationship between the true and the predicted values

Plot

Scatter

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

Project status

Learning and enjoying every day. Next steps:

Test new models and explore better neural networks
Use h2O to learn the AutoML function

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
inputs		inputs
outputs		outputs
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diamonds Machine Learning | Project

Objective

Instructions

Resources and libraries

Inputs

Outputs

Methodology

Results

Plot

Scatter

Contributing

Project status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Diamonds Machine Learning | Project

Objective

Instructions

Resources and libraries

Inputs

Outputs

Methodology

Results

Plot

Scatter

Contributing

Project status

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages