The dataset has been downloaded from kaggle (here)
It contains the data about houses in Ames, Iowa. It contains 79 explanatory variables and an independent, continuous variable (the sale price).
pandas==1.2.3numpy==1.19.5matplotlib==3.4.1seaborn==0.11.1scikit-learn==0.24.1
- EDA
- Examining relationships between variables (used Pearson's correlation for numerical features, Kendall rank correlation for ordinal features, decision tree importance for nominal features)
- Feature selection based on the previous step, dropped the features that were highly correlated with each other to avoid multicollinearity
- Transforming the dataset:
- Imputing missing values (for numerical features I used median, for ordinal and nominal I used the most frequent value)
- Scaling numerical features and the output variable
- Encoding ordinal and nominal features (used OrdinalEncoder for ordinal features and OneHotEncoder for nominal features)
- Performed cross-validation and grid-search to find the best estimator. As the scoring method I chose RMSE
- Transforming the test set
- Making predictions
- Correlation matrix
- Numerical features with high correlation with the output variable
- Relationship between ordinal variables and the output variable
- Nominal features importances
SVR turned out to be the best estimator. It got 0.353 average error during the cross-validation.
- Using IterativeImputer instead of SimpleImputer could improve the score
- More advanced feature engineering could improve the score
- Possibly using ANN or XGBoost could lead to better results



