Skip to content

nikb-de/SimpleDataLake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SimpleDataLake

Project docker architecture

Project consists of 9 docker images in docker-compose:

  • namenode - image for Apache Hadoop namenode
  • datanode-1 - image for Apache Hadoop datanode-1
  • datanode-2 - image for Apache Hadoop datanode-2
  • spark-master - image for master node of spark standalone cluster
  • spark-worker-1 - image for worker node of spark standalone cluster
  • spark-worker-2 - image for worker node of spark standalone cluster
  • spark-worker-3 - image for worker node of spark standalone cluster
  • ftpd_server - image (Dockerfile) for FTP server which will simulate source system
  • pyspark-etl - image (Dockerfile) for PySpark Jobs for Data Lake

To start docker containers

docker-compose up --build

Before docker start you should first upload source data!

Project Data Lake Infrastructure

Data Lake layers

There are 4 layers

FTP Source.

This layer consists of 5 json files. These files expected to be filled before data processing and docker starting. Check README for information It is our "source" system

Bronze layer

This layer consists of 5 ORC "tables" and represents raw storage. Source code for it's filling.

Each table partitioned by "ctl_loading" field, which is a technical identifier for data load.

To run data pipeline job do:

docker exec -it pyspark-etl /bin/bash

Inside docker container

cd /app && sh run_bronze_etl.sh

Silver layer

This layer consists of 5 Parquet "tables". Each table Two dimensions as SCD2:

Three Snapshot Facts:

Silver layer represents Star schema, where you have 2 dimensional historical tables and 3 snapshot fact table. Usually we will store here historical data.

To run data pipeline for whole layer population do:

docker exec -it pyspark-etl /bin/bash

Inside docker container

cd /app && sh run_silver_etl.sh

Gold Layer

This layer consists of final aggregate PARQUET "table" weekly_business_aggregate

Gold layer here is a Data Mart part. So we have fully denormalized aggregated structure which can be queried by BI tools.

To run data pipeline for whole layer population do:

docker exec -it pyspark-etl /bin/bash

Inside docker container

cd /app && sh run_gold_etl.sh

Potential improvements

  • Add scheduler and orchesrator, e.g. Airflow
  • Add metadata management system, e.g. Postgresql + Self-Written Service
  • Make more generous algorithm for SCD2, for examples, when by increment comes several changes per one business_id or user_id
  • Make some synthetic generator for incremental loading test
  • Add tests for code. Due to time limit it has not been done yet

Materials Used in Task

  • Dockerfile - docker-compose by which I made my docker-compose
  • Docker image - Big Data Europe team project, who provided docker images for Apache Spark and Apache Hadoop
  • Blogpost - SCD2 Algorithm for Spark, which I have generalised a bit

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors