Skip to content

alonsofabila-dev/data-breakable-toy

Repository files navigation

data-breakable-toy

This is a project to gain knowledge and improve skills in the area of data engineering dealing with ELT pipelines in batch and streaming data with the use cloud technologies. In this case this project was developed using Databricks Data Intelligence Platform and AWS Kinsesis.

App Architecture

Screenshot 2025-03-14 at 4 47 10 p m

Prerequisites

Note

The Free Trial has some limitations, not allowing you to create your own compute cluster and work only with a serverless cluster. To see the limitations of a serverless cluster you can visit: https://docs.databricks.com/aws/en/compute/serverless/limitations

Follow the next steps to set up the project:

Clonning the Repo into Databricks Workspace

Batch processing

  1. Set up a volume:
  • Creating a volume to Store data in Unity Catalog:
    • On the left panel click on Catalog > Click on Workspace (catalog) > Click on Defaul (schema)

    • Once in the default schema in the right upper corner Click on the Create dropdown and select volume.

    • Give the volume a name and select Managed volume

    • Once the volume is created, Click the button Upload to this volume located in the right upper corner.

    • Explore and Select or Drag and Drop the social_media_engagement.csv file and click the Upload button.

  1. Creating the Batch ETL pipeline:
  • Creating a Job Workflow to process batch data:
    • On the left panel click on Workflows

    • In the right upper corner Click on the Create Job.

      1. Give the task a name according to the notebook to append.

      2. In the path Select the one where you cloned the repository and select the Ingest Data notebook.

      • For the Ingest Data and Data Exploratory Notebook Tasks in the task form under parametes create one with the key 'source' and the value being the path to the social_media_engagement.csv file located in the workspace catalog.
      1. Click on the Create Task button.

      2. Click on the Add Task button and select Notebook.

      • You'll notice that now in the form Dependes on input now the previous taks is selected, this means that the subsucuent task will not begin until the previous task has finished.

      • For the Data Exploratory Notebook Task remove the dependens on input and for the Transform Data Notebook Task add the data exploratory notebook task as a dependency along with the ingest data notebook task.

      1. After finishing seting up the workflow click on the Run Now button on the right upper corner.

Note

Repeat the step 'a' to 'd' for every notebook under the Batch Data Pipeline folder in the clonned repository until you reach a workflow graph like this one:

Screenshot 2025-03-14 at 2 49 54 p m

Stream processing

  1. Set up a Pipeline:
  • Creating a Pipeline to process stream:

    • On the left panel click on Pipelines located under the Data Engineering menu > Click on Create Pipeline dropdown > Select ETL pipeline.

    • Give the pipeline a name.

    • Select Continous as pipeline mode.

    • under the Source Code block select the path to the Streaming ETL Pipeline notebook located in the clonned repository.

    • For the destination Select the Workspace Catalog and the default schema.

    • In the Advance configuration you'll need to create four key/value pair one for each of the following with the same name, just make sure you use your own values:

      • awsAccessKeyId
      • awsSecretKey
      • kinesisStreamName
      • kinesisRegion
    • Click on the Create button and It will start ingesting data from kinesis.

Data Dashboard

  1. Seting up a Dashboard in Databricks:

    • Once the batch and streaming data has been processed. On the left panel click on Dashboards located under the SQL menu.

    • In the right upper corner Click the arrow dropdown next to the Create dashboard button and Select Import Dashboard from File.

    • Select the file Social Media Engagement.lvdash.json from the clonned repository.

    • Click on Import Dashboard

    Sample of the generated report in the Dashboard

    Screenshot 2025-03-14 at 3 58 22 p m

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors