Back to Projects

Data Engineering · Cloud Pipeline

Spotify Analytics Pipeline

End-to-end automated data pipeline — Spotify API to Power BI, orchestrated on AWS

Spotify APIAirflow DAGAWS EC2AWS S3Power BI

Daily

Pipeline Schedule

Airflow DAG auto-refresh

AWS

Cloud Infrastructure

EC2 orchestration + S3 storage

6+

Dashboard Visuals

Power BI + Deneb custom charts

REST

Spotify Web API

OAuth2 · track + artist data

What It Does

A fully automated, cloud-hosted data pipeline that extracts Spotify listening data on a daily schedule, stores it on AWS S3, and renders track popularity, audio features, and artist trends inside interactive Power BI dashboards. No manual intervention required after deployment — the Airflow DAG handles scheduling, retries, and logging end-to-end.

The project demonstrates a production-grade data engineering workflow — not just a notebook analysis — by combining REST API ingestion, workflow orchestration, cloud object storage, and business intelligence visualization into a single automated system.

Why These Choices

  • Apache Airflow for orchestration: Provides DAG-based dependency management, retry logic, and a visual UI for monitoring pipeline runs — far more robust than a cron job for a multi-step workflow.

  • AWS EC2 as the compute host: Runs the Airflow scheduler continuously in the cloud, decoupled from any local machine. EC2 gives full control over the environment without the complexity of managed orchestration services.

  • AWS S3 as the data lake: Object storage scales infinitely, costs near-zero at this volume, and integrates natively with Power BI's data source connectors — no database server to manage.

  • Power BI + Deneb for visualization: Deneb (Vega-Lite inside Power BI) enables custom chart grammar beyond the standard visuals — used here for audio feature radar charts and dynamic track image displays.

Architecture

Layer 1 — Ingestion

Spotify Web API

OAuth2 · REST

get_data.py

tracks · artists · audio features

CSV / JSON

local staging

Layer 2 — Orchestration

dag.py

Airflow DAG definition

Daily Schedule

retry · logging · alerts

AWS EC2

Airflow scheduler host

DAG tasks: fetch_token pull_tracks pull_audio_features upload_to_s3

Layer 3 — Storage

AWS S3 Bucket

partitioned by date

tracks/

CSV snapshots

artists/

CSV snapshots

Layer 4 — Visualization

Power BI Desktop

S3 data source connector

Deneb Visuals

Vega-Lite custom charts

Softify_Ec2.pbix

published report

Python 3.xApache AirflowAWS EC2AWS S3Power BIDeneb / Vega-LiteSpotify Web APIOAuth2pandas

Live Dashboard

Embedded Power BI report — fully interactive

Data updated 12/22/23 · Power BI · Hover to open

Open in Power BI

Dashboard Highlights

Track Popularity Trends

Line charts tracking popularity score over time for top tracks, updated each daily run.

Audio Feature Radar

Deneb Vega-Lite radar chart comparing danceability, energy, valence, acousticness, and speechiness per track.

Artist Leaderboard

Ranked table of artists by total streams and follower count, with dynamic album art loaded from the Spotify CDN.

Listening Heatmap

Day-of-week vs hour matrix showing when tracks were played, surfacing listening behaviour patterns.

Genre Distribution

Donut chart breaking down the genre mix of top-played artists, refreshed on each pipeline run.

Pipeline Run Log

A table of recent DAG execution timestamps and row counts ingested, giving full pipeline observability inside the report.

Skills Demonstrated

REST API Integration

OAuth2 token management, paginated endpoint consumption, rate-limit handling against the Spotify Web API.

Workflow Orchestration

DAG authoring in Apache Airflow with task dependencies, SLA monitoring, and automatic retries on failure.

Cloud Infrastructure

EC2 instance provisioning, IAM role configuration, S3 bucket policies, and cost-conscious instance sizing.

Business Intelligence

Power BI data modelling, DAX measures, and advanced Deneb (Vega-Lite) custom visual grammar.

PythonApache AirflowAWS EC2AWS S3Power BIDenebSpotify API
View Source