SimpleDataLake

Project docker architecture

Project consists of 9 docker images in docker-compose:

namenode - image for Apache Hadoop namenode
datanode-1 - image for Apache Hadoop datanode-1
datanode-2 - image for Apache Hadoop datanode-2
spark-master - image for master node of spark standalone cluster
spark-worker-1 - image for worker node of spark standalone cluster
spark-worker-2 - image for worker node of spark standalone cluster
spark-worker-3 - image for worker node of spark standalone cluster
ftpd_server - image (Dockerfile) for FTP server which will simulate source system
pyspark-etl - image (Dockerfile) for PySpark Jobs for Data Lake

To start docker containers

docker-compose up --build

Before docker start you should first upload source data!

Project Data Lake Infrastructure

There are 4 layers

FTP Source.

This layer consists of 5 json files. These files expected to be filled before data processing and docker starting. Check README for information It is our "source" system

Bronze layer

This layer consists of 5 ORC "tables" and represents raw storage. Source code for it's filling.

Each table partitioned by "ctl_loading" field, which is a technical identifier for data load.

To run data pipeline job do:

docker exec -it pyspark-etl /bin/bash

Inside docker container

cd /app && sh run_bronze_etl.sh

Silver layer

This layer consists of 5 Parquet "tables". Each table Two dimensions as SCD2:

Three Snapshot Facts:

Silver layer represents Star schema, where you have 2 dimensional historical tables and 3 snapshot fact table. Usually we will store here historical data.

To run data pipeline for whole layer population do:

docker exec -it pyspark-etl /bin/bash

Inside docker container

cd /app && sh run_silver_etl.sh

Gold Layer

This layer consists of final aggregate PARQUET "table" weekly_business_aggregate

Gold layer here is a Data Mart part. So we have fully denormalized aggregated structure which can be queried by BI tools.

To run data pipeline for whole layer population do:

docker exec -it pyspark-etl /bin/bash

Inside docker container

cd /app && sh run_gold_etl.sh

Potential improvements

Add scheduler and orchesrator, e.g. Airflow
Add metadata management system, e.g. Postgresql + Self-Written Service
Make more generous algorithm for SCD2, for examples, when by increment comes several changes per one business_id or user_id
Make some synthetic generator for incremental loading test
Add tests for code. Due to time limit it has not been done yet

Materials Used in Task

Dockerfile - docker-compose by which I made my docker-compose
Docker image - Big Data Europe team project, who provided docker images for Apache Spark and Apache Hadoop
Blogpost - SCD2 Algorithm for Spark, which I have generalised a bit

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
docs/pics		docs/pics
pysparkJobs		pysparkJobs
sourceData		sourceData
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
hadoop.env		hadoop.env
spark-worker.env		spark-worker.env

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SimpleDataLake

Project docker architecture

Project Data Lake Infrastructure

FTP Source.

Bronze layer

Silver layer

Gold Layer

Potential improvements

Materials Used in Task

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SimpleDataLake

Project docker architecture

Project Data Lake Infrastructure

FTP Source.

Bronze layer

Silver layer

Gold Layer

Potential improvements

Materials Used in Task

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages