An end-to-end data engineering project on Databricks demonstrating a complete analytics platform - from raw API ingestion, through medallion architecture ETL, to a deployed Databricks App and AI-ready Genie space.
Prathmesh Bunde
This project demonstrates how to build a production-ready analytics platform on Databricks using real-time earthquake data from the USGS API.
Data Flow:
- USGS API → PySpark Custom Data Source fetches earthquake events in parallel
- Bronze Layer → Raw GeoJSON data stored in Delta format with CDF enabled
- Silver Layer → Cleansed, deduplicated, and enriched with significance categories
- Gold Layer → Analytics-ready aggregations (map data, daily summaries, regional summaries)
- Consumption → Databricks App (Streamlit 3D visualization) + AI/BI Genie (natural language queries)
Before starting this project, you should be comfortable with:
- Databricks - Familiarity with the platform
- Python - Intermediate level (functions, classes, data structures)
- SQL - Intermediate level (JOINs, aggregations, window functions)
- Spark Basics - Understanding of DataFrames, transformations, and actions
- Delta Lake Fundamentals - ACID transactions, time travel, schema evolution
- IDE - Visual Studio Code, PyCharm etc.
- Git - Basic version control operations
- Familiarity with Streamlit
| Requirement | Details |
|---|---|
| Databricks Account | Databricks Free Edition (free) or any Databricks workspace |
| IDE | VS Code, Cursor, or PyCharm with Databricks extension |
| Python | 3.11+ |
| Package Manager | uv (recommended) or pip |
| Databricks CLI | v0.200+ (installation guide) |
earthquake_analytics/
├── etl/ # ETL notebooks (medallion architecture)
│ ├── 01_raw_ingestion.ipynb # Bronze: API → raw_events
│ ├── 02_silver_transformation.ipynb # Silver: Cleansing & enrichment
│ └── 03_gold_aggregation.ipynb # Gold: Analytics-ready tables
├── streamlit_app/ # Databricks App
│ └── app.py # 3D earthquake visualisation dashboard
├── utils/ # Shared Python utilities
│ ├── helpers.py # Catalog, write, CDF, metadata functions
│ └── datasource.py # USGS API datasource registration
├── resources/ # Databricks Asset Bundle configs
├── fixtures/ # Test data
├── databricks.yml # Bundle configuration
└── pyproject.toml # Python dependencies
git clone https://github.com/your-username/earthquake_analytics.git
cd earthquake_analyticsUsing uv (recommended):
uv sync --devOr using pip:
pip install -e ".[dev]"databricks configureYou'll be prompted for:
- Databricks Host: Your workspace URL (e.g.,
https://adb-1234567890.12.azuredatabricks.net) - Personal Access Token: Generate from User Settings → Developer → Access Tokens
macOS/Linux (using zsh or bash):
export DATABRICKS_HOST="https://your-workspace.cloud.databricks.com"
export DATABRICKS_TOKEN="your-personal-access-token"Windows (using Git Bash):
export DATABRICKS_HOST="https://your-workspace.cloud.databricks.com"
export DATABRICKS_TOKEN="your-personal-access-token"Windows (using PowerShell):
$env:DATABRICKS_HOST="https://your-workspace.cloud.databricks.com"
$env:DATABRICKS_TOKEN="your-personal-access-token"databricks workspace list /Update the following for your environment:
Update the workspace host URL in databricks.yml:
workspace:
host: https://your-workspace.cloud.databricks.comUpdate the SQL Warehouse ID in app.yml:
env:
- name: 'DATARICKS_WAREHOUSE_ID'
value: 'your-databricks-sql-warehouse-id'Development deployment:
databricks bundle deploy --target devProduction deployment:
databricks bundle deploy --target proddatabricks bundle run earthquake_etl_jobOr run notebooks individually from the Databricks workspace.
The app is deployed automatically with the bundle. Access it from your Databricks workspace under Apps.
| Variable | Description | Default |
|---|---|---|
DATABRICKS_HOST |
Workspace URL | - |
DATABRICKS_TOKEN |
Personal access token | - |
DATABRICKS_WAREHOUSE_ID |
SQL Warehouse ID for Streamlit app | - |
CATALOG |
Unity Catalog name | quake_lake |
SCHEMA |
Schema name | usgs_earthquakes_dev |
Edit the widget parameters in the ETL notebooks:
- Catalog/Schema: Target location for tables
- Write Mode:
merge(incremental) oroverwrite(full refresh)
For the Streamlit app to query the gold tables, ensure your user or service principal has the following Unity Catalog permissions:
-- Grant catalog access
GRANT USE CATALOG ON CATALOG <your-catalog> TO `<user-or-group>`;
-- Grant schema access
GRANT USE SCHEMA ON SCHEMA <your-catalog>.<your-schema> TO `<user-or-group>`;
-- Grant read access to gold tables
GRANT SELECT ON TABLE <your-catalog>.<your-schema>.gold_events_map TO `<user-or-group>`;
GRANT SELECT ON TABLE <your-catalog>.<your-schema>.gold_daily_summary TO `<user-or-group>`;
GRANT SELECT ON TABLE <your-catalog>.<your-schema>.gold_regional_summary TO `<user-or-group>`;Alternatively, grant SELECT on all tables in the schema:
GRANT SELECT ON SCHEMA <your-catalog>.<your-schema> TO `<user-or-group>`;The pipeline uses Delta Lake Change Data Feed for efficient incremental processing:
# Only reads new/changed records since last run
df, version, is_incremental = read_incremental_or_full(
spark, source_table, checkpoint_table
)Uses USGS significance score (0-1000+) instead of magnitude for more consistent event classification:
| Significance | Category | Colour |
|---|---|---|
| 600+ | Severe | 🔴 Red |
| 400-599 | Major | 🟠 Orange |
| 200-399 | Moderate | 🟡 Yellow |
| 100-199 | Minor | 🟢 Green |
| <100 | Low | 🔵 Cyan |
Interactive globe with earthquake columns sized by significance score.
Once the gold tables are populated, create a Genie space in Databricks:
- Navigate to AI/BI → Genie Spaces
- Create a new space and connect to your gold tables
- Ask natural language questions like:
- "What regions had the most high-significance earthquakes last month?"
- "Show me the trend of daily earthquake counts"
- "Which day had the highest maximum significance?"
