Predicting passenger survival using existing and engineered features, and understanding feature importance.
On April 15, 1912, the RMS Titanic sank traveling across the Altantic Ocean from Southampton, England, to New York City. In this Kaggle exercise, I'll be using the Titanic passenger dataset to determine what features highly correlate to passenger survival. The binary classification model approach is very relevant in the real world. Using algorithms like logistic regression we can isolate important features to determine success or failure. Using Kaggle Titanic dataset, I'll be answering these questions and building a binary classification model to predict survival.
- How many passenger classes did Titanic have and what was the median age in each class?
- Were there more families or single passengers on the Titanic?
- What is the one characteristic among the passengers that determined highest probability of survival?
- Python 3.7.4
- pandas 1.2.3
- numpy 1.19.5
- matplotlib 3.2.0
- seaborn 0.11.1
- sklearn 0.24.2
titanic_data/data-train.csv: Transformed train datasettitanic_data/data-test.csv: Transformed test datasettitanic_data/train.csv: Original train dataset from Kaggletitanic_data/test.csv: Original test dataset from Kaggle
-
Exploratory Analysis & Feature Engineering: Conducted in Postgres SQL
- create_base_titanic_tables.sql
- titanic_survival_by_feature_high_level.sql
- titanic_train_test_missing_vals.sql
- titanic_overall_survival.sql
- titanic_sex_age_survival.sql
- titanic_survival_fare_per_passenger.sql
- titanic_survival_title.sql
- titanic_is_woman_child.sql
- titanic_cabin_level_embarked.sql
- titanic_one_family_mixed_group_alone.sql
- titanic_family_size_by_class.sql
- titanic_train_test_raw_v2.sql
- titanic_estimate_cabin_level_logic.sql
- titanic_train_test_raw_v3.sql
- titanic_train_test_wcg_v0.sql
- titanic_train_test_ml_features_v0.sql
-
Modeling & Visualization: Conducted in Python with Jupyter notebook