Data Engineering · Cloud Pipeline
Spotify Analytics Pipeline
End-to-end automated data pipeline — Spotify API to Power BI, orchestrated on AWS
Daily
Pipeline Schedule
Airflow DAG auto-refresh
AWS
Cloud Infrastructure
EC2 orchestration + S3 storage
6+
Dashboard Visuals
Power BI + Deneb custom charts
REST
Spotify Web API
OAuth2 · track + artist data
What It Does
A fully automated, cloud-hosted data pipeline that extracts Spotify listening data on a daily schedule, stores it on AWS S3, and renders track popularity, audio features, and artist trends inside interactive Power BI dashboards. No manual intervention required after deployment — the Airflow DAG handles scheduling, retries, and logging end-to-end.
The project demonstrates a production-grade data engineering workflow — not just a notebook analysis — by combining REST API ingestion, workflow orchestration, cloud object storage, and business intelligence visualization into a single automated system.
Why These Choices
Apache Airflow for orchestration: Provides DAG-based dependency management, retry logic, and a visual UI for monitoring pipeline runs — far more robust than a cron job for a multi-step workflow.
AWS EC2 as the compute host: Runs the Airflow scheduler continuously in the cloud, decoupled from any local machine. EC2 gives full control over the environment without the complexity of managed orchestration services.
AWS S3 as the data lake: Object storage scales infinitely, costs near-zero at this volume, and integrates natively with Power BI's data source connectors — no database server to manage.
Power BI + Deneb for visualization: Deneb (Vega-Lite inside Power BI) enables custom chart grammar beyond the standard visuals — used here for audio feature radar charts and dynamic track image displays.
Architecture
Layer 1 — Ingestion
Spotify Web API
OAuth2 · REST
get_data.py
tracks · artists · audio features
CSV / JSON
local staging
Layer 2 — Orchestration
dag.py
Airflow DAG definition
Daily Schedule
retry · logging · alerts
AWS EC2
Airflow scheduler host
DAG tasks: fetch_token → pull_tracks → pull_audio_features → upload_to_s3
Layer 3 — Storage
AWS S3 Bucket
partitioned by date
tracks/
CSV snapshots
artists/
CSV snapshots
Layer 4 — Visualization
Power BI Desktop
S3 data source connector
Deneb Visuals
Vega-Lite custom charts
Softify_Ec2.pbix
published report
Live Dashboard
Embedded Power BI report — fully interactive
Data updated 12/22/23 · Power BI · Hover to open
Open in Power BIDashboard Highlights
Track Popularity Trends
Line charts tracking popularity score over time for top tracks, updated each daily run.
Audio Feature Radar
Deneb Vega-Lite radar chart comparing danceability, energy, valence, acousticness, and speechiness per track.
Artist Leaderboard
Ranked table of artists by total streams and follower count, with dynamic album art loaded from the Spotify CDN.
Listening Heatmap
Day-of-week vs hour matrix showing when tracks were played, surfacing listening behaviour patterns.
Genre Distribution
Donut chart breaking down the genre mix of top-played artists, refreshed on each pipeline run.
Pipeline Run Log
A table of recent DAG execution timestamps and row counts ingested, giving full pipeline observability inside the report.
Skills Demonstrated
REST API Integration
OAuth2 token management, paginated endpoint consumption, rate-limit handling against the Spotify Web API.
Workflow Orchestration
DAG authoring in Apache Airflow with task dependencies, SLA monitoring, and automatic retries on failure.
Cloud Infrastructure
EC2 instance provisioning, IAM role configuration, S3 bucket policies, and cost-conscious instance sizing.
Business Intelligence
Power BI data modelling, DAX measures, and advanced Deneb (Vega-Lite) custom visual grammar.
