GitHub - inv-fourier-transform/Databricks-learnings: My experiments with Databricks

Databricks - A high level summary of my day-wise learnings

Day 1 – Databricks Basics

Understood what Databricks is and why it is used over Pandas and Hadoop for large-scale data processing
Learned the high-level idea of the Lakehouse architecture
Explored Databricks Workspace, Compute, and data browsing concepts
Created the first Databricks notebook
Ran basic PySpark commands

Day 2 – Apache Spark Fundamentals

Learned Spark’s high-level architecture (Driver, Executors, DAG)
Understood the difference between Spark DataFrames and RDDs
Learned why Spark uses lazy evaluation and when execution is triggered
Used Databricks notebook magic commands (%python, %sql, %fs)
Ran operations using filter, select, groupBy, orderBy

Day 3 – PySpark Data Operations

Compared PySpark and Pandas for large-scale data processing
Performed joins (inner, left, right, outer) on Spark DataFrames
Used window functions for running totals and rankings
Understood window functions vs groupBy aggregations
Learned when and why to use (or avoid) User-Defined Functions (UDFs)

Day 4 – Delta Lake Fundamentals

Learned what Delta Lake is and how it adds reliability on top of Parquet
Understood ACID transactions and their role in safe concurrent reads/writes
Clearly distinguished ACID Consistency from Schema Enforcement
Learned how schema enforcement prevents invalid data from entering tables
Created Delta tables using managed approach
Observed with examples how schema enforcement works

Day 5 – Delta Lake Advanced Operations

Learned Delta Lake time travel and table version history
Understood MERGE operations for safe upserts
Learned how OPTIMIZE reduces small files and improves read performance
Understood ZORDER for clustering data after OPTIMIZE to speed up queries
Learned how VACUUM cleans up old data and affects time travel

Day 6 – Medallion Architecture & Incremental Processing

Learned the Bronze → Silver → Gold (Medallion) architecture and responsibilities of each layer
Understood best practices for raw, cleaned, and business-ready data layers
Learned the concept of incremental processing and why it is essential at scale
Understood common incremental processing patterns (timestamp-based, ID-based, change-based)
Learned why incrementality must be applied across Bronze, Silver, and Gold layers

Day 7 – Databricks Jobs & Workflows

Understood the difference between Databricks notebooks and Jobs
Learned how multi-task workflows model pipelines as dependent tasks
Learned how parameters make jobs reusable without changing code
Understood scheduling for automated, reliable job execution
Learned the importance of error handling, retries, and fail-fast behaviour

Day 8 – Data Governance & Organization

Learned the Catalog → Schema → Table hierarchy in Databricks
Understood access control using GRANT and REVOKE
Learned the importance of data lineage for debugging and impact analysis
Understood the difference between managed and external tables and when to use each

Day 9 – Analytics with Databricks SQL

Learned the role of SQL Warehouses for fast, isolated analytical workloads
Understood complex analytical queries for trends, comparisons, and rankings
Learned how dashboards present curated business insights
Understood the role of visualizations and filters for interactive analytics

Day 10 – Query Performance & Optimization

Learned how query execution plans determine how SQL queries are executed
Understood partitioning strategies for efficient data skipping
Learned how OPTIMIZE reduces small file overhead
Understood how ZORDER improves data skipping within partitions
Learned when and when not to use caching for faster query performance

Day 11 – Analytics & Experimentation

Learned how to use descriptive statistics in PySpark to understand data distributions
Understood hypothesis testing to distinguish signal from noise
Learned how to design A/B tests using control and treatment groups
Practiced real feature engineering to convert raw data into meaningful patterns

Day 12 – MLflow

Learned core MLflow components: Tracking, Models, & Model Registry
Understood experiment tracking using runs, parameters, metrics, & artifacts
Learned why model logging is essential beyond saving plain serialized .pkl files
Understood how the MLflow UI is used to compare runs & manage model lifecycle

Day 13 – Spark ML Pipeline

Trained multiple models (Linear, Decision Tree, Random Forest) & compared performance using R² metrics.
Implemented MLflow tracking for model parameters, metrics, & artifacts.
Built Spark ML Pipelines to automate feature assembly & model training in a reproducible workflow.
Learned difference between Estimators (learn from data) & Transformers (transform data without learning).
Evaluated feature importance & practiced hyperparameter tuning for better model performance.
Compared scikit-learn & Spark ML models to select the best model for deployment.

Day 14 – AI & Analytics in Databricks

Explored Databricks AI capabilities for analytics & ML workflows.
Used Databricks Genie to convert natural language questions into SQL queries.
Queried business metrics directly from Lakehouse tables using Genie.
Explored Mosaic AI through a simple NLP sentiment analysis task.
Generated synthetic reviews for existing Gold layer data table without creating new datasets.
Applied a pre-trained transformer model for binary sentiment classification.
Tracked experiments, parameters, & metrics using MLflow.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.gitignore		.gitignore
01_bronze_ingestion.ipynb		01_bronze_ingestion.ipynb
02_silver_cleaning.ipynb		02_silver_cleaning.ipynb
03_gold_aggregates.ipynb		03_gold_aggregates.ipynb
DataBricks_tutorial.ipynb		DataBricks_tutorial.ipynb
Day_10.ipynb		Day_10.ipynb
Day_11.ipynb		Day_11.ipynb
Day_12.ipynb		Day_12.ipynb
Day_13.ipynb		Day_13.ipynb
Day_14.ipynb		Day_14.ipynb
Day_2.ipynb		Day_2.ipynb
Day_3.ipynb		Day_3.ipynb
Day_4.ipynb		Day_4.ipynb
Day_5.ipynb		Day_5.ipynb
Day_6.ipynb		Day_6.ipynb
Day_7.ipynb		Day_7.ipynb
Day_8.ipynb		Day_8.ipynb
Day_9.ipynb		Day_9.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Databricks - A high level summary of my day-wise learnings

Day 1 – Databricks Basics

Day 2 – Apache Spark Fundamentals

Day 3 – PySpark Data Operations

Day 4 – Delta Lake Fundamentals

Day 5 – Delta Lake Advanced Operations

Day 6 – Medallion Architecture & Incremental Processing

Day 7 – Databricks Jobs & Workflows

Day 8 – Data Governance & Organization

Day 9 – Analytics with Databricks SQL

Day 10 – Query Performance & Optimization

Day 11 – Analytics & Experimentation

Day 12 – MLflow

Day 13 – Spark ML Pipeline

Day 14 – AI & Analytics in Databricks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Databricks - A high level summary of my day-wise learnings

Day 1 – Databricks Basics

Day 2 – Apache Spark Fundamentals

Day 3 – PySpark Data Operations

Day 4 – Delta Lake Fundamentals

Day 5 – Delta Lake Advanced Operations

Day 6 – Medallion Architecture & Incremental Processing

Day 7 – Databricks Jobs & Workflows

Day 8 – Data Governance & Organization

Day 9 – Analytics with Databricks SQL

Day 10 – Query Performance & Optimization

Day 11 – Analytics & Experimentation

Day 12 – MLflow

Day 13 – Spark ML Pipeline

Day 14 – AI & Analytics in Databricks

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages