Skip to content

ewerthonk/datalakehouse-northwind

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Northwind Data Lakehouse with Delta Lake

Creating a data lakehouse on databricks with community edition.

📖 Project

👨🏻‍🏫 Introduction

This project is the creation of a Data Lakehouse - using Databricks and Delta lake technology - for a database containing sales data of a fictitious company called “Northwind Traders”, which imports and exports specialty foods around the world.

The project uses the community version of Databricks, which imposes restrictions, such as the use of Delta Live Streams, Cloud Partners Integration, Github Integration and Job Scheduling - the usage of this tools would enrich the project by a lot.

In order to run the project, the following requirements are needed:

  • Databricks Runtime 11.0 (community edition)
  • Apache Spark 3.3.0
  • Scala 2.12

Only structured data was used in the project, but the workspace and project structure - a Data Lakehouse - remains scalable for using semi-structured and unstructured data - depending on the use-case.

🎯 Goal

  • Create a data lakehouse from the csv files of the database using the following technologies:

    • PySpark
    • Koalas
    • Spark Pandas
    • Hive SQL
  • With the Data Lakehouse created, conducting a business analysis to answer the following questions:

    • What are the 5 least sold products?
    • What are the top 5 Customers with the highest number of purchases?
    • What are the top 5 Customers with the highest purchases value?
    • Who was the employee who made more sales last year?

💽 Database:

🗺 Data Lakehouse

🗄 Notebooks

📦 Folder Structure

├── LICENSE
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         and a short `-` delimited description, e.g.
│                         `1.0-initial-data-exploration`.
│
├── references         <- Figures, manuals, and all other explanatory materials.

About

Creating a Simple Data Lakehouse using Delta Lake on Databricks. My 1st Data Engineering Project.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors