Skip to content

Latest commit

Β 

History

History
175 lines (126 loc) Β· 4.74 KB

File metadata and controls

175 lines (126 loc) Β· 4.74 KB

🎧 SpotiFlow: End-to-End K-pop Artist Data Pipeline from Spotify API

SpotiFlow is an end-to-end data pipeline project that extracts K-pop artist data from the Spotify API, processes it using Airflow, stores the data in both PostgreSQL and ClickHouse, and visualizes insights using Metabase. This project demonstrates a full-stack data engineering workflow including ETL orchestration, data warehousing, containerization, and dashboarding.


πŸš€ Project Architecture

              +----------------+                     
              | Spotify API    |                     
              +----------------+                     
                      |                               
                  (Extract)                          
                      |                               
              +----------------+                     
              | Airflow (DAG)  |                     
              +----------------+                     
                |     |     |                       
            Extract  Transform  Load               
                |     |     |                       
       +--------+     |     +------------+          
       |              ↓                  ↓          
+----------------+        +-----------------------+
| PostgreSQL     |        | ClickHouse            |
| (raw data)     |        | (for analytics)       |
+----------------+        +-----------------------+
                                     |                
                               +-----------+          
                               | Metabase  |          
                               +-----------+          

πŸ“¦ Tech Stack

Domain Stack
Language Python 3.11
Data Source Spotify Web API
Workflow Orchestration Apache Airflow
Storage PostgreSQL (OLTP), ClickHouse (OLAP)
Visualization Metabase
Infrastructure Docker, Docker Compose
Others requests, dotenv, psycopg2, clickhouse-driver

πŸ”§ Setup Instructions

1. Clone the repository

git clone https://github.com/your-username/spotiflow.git
cd spotiflow

2. Set up environment variables

Create a .env file in the project root with the following contents:

SPOTIFY_CLIENT_ID=your_spotify_client_id
SPOTIFY_CLIENT_SECRET=your_spotify_client_secret

3. Start the containers

docker-compose up --build -d

This will spin up:

  • Airflow (Webserver + Scheduler)
  • PostgreSQL
  • ClickHouse
  • Metabase

Note: For Apple Silicon (M1/M2), platform: linux/amd64 is used to ensure compatibility.

4. Initialize Airflow

Inside the container:

docker exec -it airflow-webserver bash
airflow db init
airflow users create --username admin --password admin --firstname Chaerin --lastname Lee --role Admin --email test@example.com

Then visit: http://localhost:8080


πŸ’  Airflow DAGs

  • spotify_etl_dag.py: Main DAG that runs three tasks:
    • extract_task: Calls Spotify API and fetches K-pop artist data
    • load_task: Saves to PostgreSQL and migrates to ClickHouse
    • clickhouse_task: (Optional) Transformations or aggregations

You can trigger the DAG manually or set it to run on a schedule (@daily).


πŸ“Š Metabase Dashboard

Once Metabase is up:

  1. Visit http://localhost:3000

  2. Add a new database β†’ Select ClickHouse

  3. Set up connection:

    • Host: clickhouse
    • Port: 8123
    • Username: default
    • Password: (leave blank)
    • Database: default
  4. Browse Data β†’ spotify_artists table

  5. Create charts:

    • Popularity vs Followers (scatter)
    • Genre-wise average popularity
    • Trends over fetched_at (time-series)

πŸ“ˆ Sample Queries

-- Top 10 K-pop artists by followers
SELECT name, followers, popularity
FROM spotify_artists
ORDER BY followers DESC
LIMIT 10;

-- Average popularity per genre
SELECT genres, avg(popularity) AS avg_popularity
FROM spotify_artists
GROUP BY genres
ORDER BY avg_popularity DESC;

🧠 Lessons Learned

  • Mastered Spotify API authentication and data extraction
  • Gained hands-on experience with Airflow DAGs and scheduling
  • Handled ClickHouse integration and batch insert optimizations
  • Troubleshot Docker container compatibility issues on ARM architecture
  • Learned to visualize data effectively using Metabase

✨ Future Improvements

  • Support more Spotify entities (Albums, Tracks, Playlists)
  • Enable full CI/CD workflow for Airflow DAG deployments
  • Add real-time data streaming (e.g., via Kafka)
  • Improve dashboards with filters, drill-downs, and dynamic widgets

πŸ“„ License

MIT License


πŸ™‹β€οΈ Author

Chaerin Lee