Data Engineering Project

This reposiroty contains the implementation of the project for data engineering course

Description of current fils:

dags: the folder that is mounted for airflow containers and contain the DAG implementation
Data Engineer Presetation.pptx: The presentation for the project
de_proj.py: the DAG implementation for the project
docker-compose.yaml: the docker-compose file to build the infrastructure used for this project
esdata1: The volume that is mounted for elastic search containers, it is kept as it has the dashboards already implemented and saved inside of it
input: a volume that is mounted under /mnt for airflow workers so that the DAG can use this path to read the data
KibanaDashboard.ndjson: the exported file for dashboards that are implemented on Kibana
pgadmin: the volume that is mounted inside of pgadmin container, and it is kept as it has the connection to DB node already saved in it.
prepareenv.sh: The script that used to build the infrastructure and install all of the needed pyton models in airflow worker
projdata.csv: a csv file containing that data used for this project.

Tools used in this project:

how to run:

Please clone the repository into your computer
Please run the script 'prepareenv.sh' and it will prepare everything to you

workflow description:

load_csv: to read the csv file from /mnt
push_to_sql: to push the data as is without any modification to the postgres database
load_and_feature_engineering: to load the data from the SQL, and perform feature enigneering on it which includes:
- Add the class label based on business requirements that the customer is defaulted if number of days of delinquent is more than 90
- Drop some features that are not related to the analysis based business relation (like customer ID) or features that have a high correlation based on a correlation matrix that were preprared before hand
- Doing a rename for the features to be more easy to query and remote the white spaces from the names, and the % from inside of some features entries as they will later affect the push to sql after encoding
handing_missing_data: impute the missing values for categorical features with mode or 'Others' as a new category since it is expected that these values will be filled empty, update the numerical values with mean value.
push_to_elastic: pushing the ready data to elastic to create insights.
encoding: to proceed with performing a OneHotEncoding for the categorical variable, and a manual label encoding for some of the binary categorical variables
scaling: perform scaling on the numerical values
push_readyDF_to_sql: to send the prepared data frame for SQL so in case a further DAGs want to train different models they can load the data from this ready table in SQL
split-and_balance_training: since our problem is a classification for imbalanced data, on this task we split the data between training and testing and perform SMOTE to balance the data.
4 different task to train 4 different models which are: decision tree, random forest, bagging and xgboost
4 scoring tasks that is used to get the model accuracy, recall, precision and f1-score from each model
push_score_to_es: to push those score records to elastic search so a the dashboards can be drawn for those scores, and the dashboards can used to monitor the scoring on the models

Additional information:

The data is passed between the tasks in the DAG using the XCOM
Other internal fuctions are used inside of the DAG to avoid writing a code, like a function to create the engine to connect to DB, the details about them are in the DAG itself

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
dags		dags
esdata1/nodes/0		esdata1/nodes/0
pgadmin		pgadmin
DE_project.knar.knwf		DE_project.knar.knwf
Data Engineer Presentation.pptx		Data Engineer Presentation.pptx
Data_Engineering_project_final.ipynb		Data_Engineering_project_final.ipynb
KibanaDashboard.ndjson		KibanaDashboard.ndjson
README.md		README.md
de_proj.py		de_proj.py
docker-compose.yaml		docker-compose.yaml
prepareenv.sh		prepareenv.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Project

This reposiroty contains the implementation of the project for data engineering course

Description of current fils:

Tools used in this project:

how to run:

workflow description:

Additional information:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Project

This reposiroty contains the implementation of the project for data engineering course

Description of current fils:

Tools used in this project:

how to run:

workflow description:

Additional information:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages