The code for SQL, Python, and data model sections are written using Spark SQL. To run the code, you will need the prerequisites listed below.
Prerequisites
Windows users: please setup WSL and a local Ubuntu Virtual machine following the instructions here.
Install the above prerequisites on your ubuntu terminal; if you have trouble installing docker, follow the steps here (only Step 1 is necessary).
Fork this repository data_engineering_for_beginners_code.
After forking, clone the repo to your local machine and start the containers as shown below:
# Replace your-user-name with your github username
git clone https://github.com/your-user-name/data_engineering_for_beginners_code.git
cd data_engineering_for_beginners_code
docker compose up -d --build
sleep 30 Open Jupyter Lab at http://localhost:8888 and run the code at ./notebooks/starter-notebook.ipynb to create the data and check that your setup worked.
After the data is created open the Airflow UI with http://localhost:8080/ and trigger the DAG and ensure that it runs successfully.
After you are done, shut down the containers with
docker compose down -v