For an interactive preview (with all the charts displayed):
-
You can edit and run the notebook by importing it into your Databricks account. After clicking
Workspaceselect Import from any folder's menu and paste the URL that shows up → here -
If you don't have a Databricks account, set up your Databricks account and create a workspace →
.
- If you want a previous detailed explanation of how it works → click here
- PySpark is the ‘product’ of the collaboration beteween
Apache SparkandPython.- PySpark is the
Python APIfor Apache Spark, an open source distributed computing framework that provides some of the most popular tools used to carry out commonBig Datarelated tasks.
Aim: To create a ML model with PySpark that predicts which passengers survived the sinking of the Titanic.
→ Considering that the Titanic ML competition is almost legendary and that almost everyone (competitor or non-competitor) that tried to tackle the challenge did it either with python or R, I decided to use Pyspark having run a notebook in Databricks to show how easy can be to work with PySpark, namely regarding:
- EDA
- Feature Selection
- Feature Engineering
- Train-Test Split (within the training set)
- Pipelines
- Classification and Evaluation: a baseline model and hyperparameter tuning with Crossvalidator
