Skip to content

josephmachado/data_engineering_for_beginners_code

Repository files navigation

Data Engineering for Beginners

The code for SQL, Python, and data model sections are written using Spark SQL. To run the code, you will need the prerequisites listed below.

Setup

Prerequisites

  1. git version >= 2.37.1
  2. Docker version >= 20.10.17 and Docker compose v2 version >= v2.10.2.

Windows users: please setup WSL and a local Ubuntu Virtual machine following the instructions here.

Install the above prerequisites on your ubuntu terminal; if you have trouble installing docker, follow the steps here (only Step 1 is necessary).

Fork this repository data_engineering_for_beginners_code.
GitHub Fork After forking, clone the repo to your local machine and start the containers as shown below:

# Replace your-user-name with your github username
git clone https://github.com/your-user-name/data_engineering_for_beginners_code.git 
cd data_engineering_for_beginners_code
docker compose up -d --build 
sleep 30 

Open Jupyter Lab at http://localhost:8888 and run the code at ./notebooks/starter-notebook.ipynb to create the data and check that your setup worked.

After the data is created open the Airflow UI with http://localhost:8080/ and trigger the DAG and ensure that it runs successfully.

Shut down

After you are done, shut down the containers with

docker compose down -v

Releases

No releases published

Packages

No packages published