Project consists of 9 docker images in docker-compose:
- namenode - image for Apache Hadoop namenode
- datanode-1 - image for Apache Hadoop datanode-1
- datanode-2 - image for Apache Hadoop datanode-2
- spark-master - image for master node of spark standalone cluster
- spark-worker-1 - image for worker node of spark standalone cluster
- spark-worker-2 - image for worker node of spark standalone cluster
- spark-worker-3 - image for worker node of spark standalone cluster
- ftpd_server - image (Dockerfile) for FTP server which will simulate source system
- pyspark-etl - image (Dockerfile) for PySpark Jobs for Data Lake
To start docker containers
docker-compose up --buildBefore docker start you should first upload source data!
There are 4 layers
This layer consists of 5 json files. These files expected to be filled before data processing and docker starting. Check README for information It is our "source" system
This layer consists of 5 ORC "tables" and represents raw storage. Source code for it's filling.
Each table partitioned by "ctl_loading" field, which is a technical identifier for data load.
To run data pipeline job do:
docker exec -it pyspark-etl /bin/bashInside docker container
cd /app && sh run_bronze_etl.shThis layer consists of 5 Parquet "tables". Each table Two dimensions as SCD2:
Three Snapshot Facts:
Silver layer represents Star schema, where you have 2 dimensional historical tables and 3 snapshot fact table. Usually we will store here historical data.
To run data pipeline for whole layer population do:
docker exec -it pyspark-etl /bin/bashInside docker container
cd /app && sh run_silver_etl.shThis layer consists of final aggregate PARQUET "table" weekly_business_aggregate
Gold layer here is a Data Mart part. So we have fully denormalized aggregated structure which can be queried by BI tools.
To run data pipeline for whole layer population do:
docker exec -it pyspark-etl /bin/bashInside docker container
cd /app && sh run_gold_etl.sh- Add scheduler and orchesrator, e.g. Airflow
- Add metadata management system, e.g. Postgresql + Self-Written Service
- Make more generous algorithm for SCD2, for examples, when by increment comes several changes per one business_id or user_id
- Make some synthetic generator for incremental loading test
- Add tests for code. Due to time limit it has not been done yet
- Dockerfile - docker-compose by which I made my docker-compose
- Docker image - Big Data Europe team project, who provided docker images for Apache Spark and Apache Hadoop
- Blogpost - SCD2 Algorithm for Spark, which I have generalised a bit
