Machine Learning project predicting passenger survival in the Titanic disaster using demographic and travel information.
The goal is to explore the dataset, engineer meaningful features, and build predictive models capable of estimating survival probability.
The sinking of the RMS Titanic is one of the most famous maritime disasters. Using passenger information such as age, sex, class, and ticket data, the objective is to predict whether a passenger survived.
This project applies data analysis and machine learning techniques to identify the most important factors influencing survival.
Dataset from Kaggle:
Titanic - Machine Learning from Disaster
Features include:
- Passenger class
- Name
- Sex
- Age
- Number of siblings/spouses aboard
- Number of parents/children aboard
- Ticket number
- Fare
- Cabin
- Port of embarkation
Target variable:
- Survived (0 = No, 1 = Yes)
titanic-ml-project/
data/
notebooks/
01_eda.ipynb
02_feature_engineering.ipynb
03_modeling.ipynb
src/
preprocessing.py
features.py
model.py
train.py
reports/
Key insights discovered during analysis:
- Women had a significantly higher survival rate than men.
- First-class passengers survived more frequently than third-class passengers.
- Children had higher survival probability.
- Passengers traveling in small families had better survival rates.
- Passengers embarking at Cherbourg showed higher survival rates, likely due to a higher proportion of first-class passengers.
Visualizations:
Several new features were created to improve model performance:
FamilySize = SibSp + Parch + 1Passengers traveling in small families showed higher survival rates.
train["HasCabin"] = train["Cabin"].notna().astype(int)Passengers with recorded cabin information had a higher chance of survival.
train['Deck'] = train['Cabin'].apply(lambda x: x[0] if pd.notnull(x) else 'U')Deck information was extracted from the Cabin column.
train = pd.get_dummies(train, columns=["Embarked"], drop_first=True)Categorical variables were transformed using one-hot encoding.
train['Fare_Per_Person']=train['Fare']/train['FamilySize']This feature provides a more accurate representation of individual socioeconomic standing by normalizing costs across family units.
train['IsAlone'] = (train['FamilySize'] == 1).astype(int)While FamilySize provides granular data, this feature simplifies the feature space, allowing the model to focus on the significant survival gap between these two primary categories without the noise of specific family counts.
The following models were evaluated:
- Random Forest
- Gradient Boosting
- AdaBoost
- XGBoost
- Voting Classifier
Model performance was evaluated using accuracy on validation data.
Best performing model:
Voting Classifier with AdaBoost and XGBoost
Cross-validation accuracy: 82% - 84%
Kaggle Leaderboard: 78%
Most important features:
- Sex
- Pclass
- Fare
- Age
- FamilySize
- HasCabin
Clone the repository:
git clone https://github.com/rodrigofl-dev/titanic-ml-project.git
Install dependencies:
pip install -r requirements.txt
Run the training pipeline:
python src/run.py
This will:
- Load raw data
- Perform preprocessing
- Create features
- Train the model
- Generate predictions
Predictions will be saved in:
data/submissions/submission.csv
- Python
- pandas
- numpy
- scikit-learn
- seaborn
- matplotlib
- Jupyter Notebook
Rodrigo Lopes - Data Science project developed as part of a machine learning portfolio.




