Building a YouTube Channel Analytics Dashboard with Airflow, Spark, and Grafana

Introduction

In the rapidly evolving creator economy, data-driven insights are essential for YouTube content creators to refine their strategies, maximize audience engagement, and drive channel growth. While YouTube Studio offers built-in analytics, its capabilities are often limited in terms of flexibility, depth, and customization.

To overcome these constraints, I developed an end-to-end automated data engineering pipeline orchestrated with Apache Airflow. This system extracts raw data from the YouTube Data API, transforms and enriches it using Apache Spark, stores the results in a PostgreSQL database, and presents interactive visualizations through Grafana. Apache Airflow manages and schedules these workflows as DAGs (Directed Acyclic Graphs), ensuring reliable, repeatable, and scalable data processing.

The result is a dynamic analytics dashboard that empowers creators with full visibility into their channel’s performance—enabling smarter decisions, faster.

Problem Statement

YouTube creators struggle with analyzing their channel performance because:

YouTube’s built-in analytics are limited and inflexible
Manually collecting and managing data is slow, error-prone, and doesn’t scale
There’s no easy, centralized way to track key metrics over time
Figuring out the best times to publish and what content works best requires complex analysis
Creators need an automated, reliable system that regularly pulls data, organizes it, and turns it into actionable insights.

Solution Architecture

To tackle this, I built a full ETL pipeline with visualization, including:

Extract: A Python script fetches video and channel data from the YouTube API
Transform: Apache Spark cleans and enriches the data with key metrics
Load: The data is saved into a PostgreSQL database
Orchestrate: Apache Airflow automates and schedules the entire workflow
Visualize: Grafana dashboards display interactive insights

All components run smoothly on an Azure VM, offering an affordable, scalable solution for creators and small teams.

Implementation Details

Data Extraction
The extraction process uses the YouTube Data API v3 to automatically gather channel and video information. The script handles all the API authentication, fetches the channel’s uploaded videos playlist, and pulls detailed metadata for each video — like views, likes, comments, and publish dates.

This setup ensures up-to-date data is collected efficiently without manual intervention.

Here’s a snippet of the extraction script:


import os
from googleapiclient.discovery import build
from datetime import datetime
import json
import sys
from dotenv import load_dotenv

load_dotenv()

YOUTUBE_API_KEY = os.getenv("YOUTUBE_API_KEY")
CHANNEL_ID = os.getenv("YOUTUBE_CHANNEL_ID")
OUTPUT_FILENAME = os.getenv("OUTPUT_FILENAME")


def validate_config():
    if not YOUTUBE_API_KEY:
        print("Error: YOUTUBE_API_KEY environment variable not set.", file=sys.stderr)
        sys.exit(1)
    if not CHANNEL_ID:
        print("Error: YOUTUBE_CHANNEL_ID environment variable not set.", file=sys.stderr)
        sys.exit(1)


def build_youtube_client():
    return build("youtube", "v3", developerKey=YOUTUBE_API_KEY)


def get_playlist_id(youtube, channel_id):
    try:
        request = youtube.channels().list(part="contentDetails", id=channel_id)
        response = request.execute()
        # Return the playlist ID containing all uploaded videos
        return response['items'][0]['contentDetails']['relatedPlaylists']['uploads'] if response['items'] else None
    except Exception as e:
        print(f"Error fetching channel details: {e}", file=sys.stderr)
        return None


def get_video_ids(youtube, playlist_id):
    video_ids, next_page_token = [], None
    while True:
        try:
            # Fetch playlist items in batches of 50 (max allowed)
            request = youtube.playlistItems().list(
                part="contentDetails", 
                playlistId=playlist_id, 
                maxResults=50, 
                pageToken=next_page_token
            )
            response = request.execute()
            # Extract video IDs from the current page
            video_ids.extend([item['contentDetails']['videoId'] for item in response['items']])
            # Check if there is a next page, else exit loop
            next_page_token = response.get('nextPageToken')
            if not next_page_token:
                break
        except Exception as e:
            print(f"Error fetching playlist items: {e}", file=sys.stderr)
            break
    return video_ids


def get_video_details(youtube, video_ids):
    all_video_data = []
    # Process video IDs in chunks of 50 due to API limits
    for i in range(0, len(video_ids), 50):
        video_chunk = ",".join(video_ids[i:i+50])
        try:
            request = youtube.videos().list(part="snippet,statistics", id=video_chunk)
            response = request.execute()
            # Timestamp for when data was extracted
            timestamp = datetime.utcnow().isoformat() + 'Z'
            # Extract relevant video info and stats
            for item in response['items']:
                video_data = {
                    'videoId': item['id'],
                    'title': item['snippet']['title'],
                    'publishedAt': item['snippet']['publishedAt'],
                    'viewCount': item['statistics'].get('viewCount'),
                    'likeCount': item['statistics'].get('likeCount'),
                    'commentCount': item['statistics'].get('commentCount'),
                    'extractionTimestamp': timestamp
                }
                all_video_data.append(video_data)
        except Exception as e:
            print(f"Error fetching video details: {e}", file=sys.stderr)
    return all_video_data


if __name__ == "__main__":
    print("Starting YouTube data extraction...")

    # Step 1: Check if environment variables are set
    validate_config()

    # Step 2: Create YouTube API client
    youtube = build_youtube_client()

    # Step 3: Get playlist ID for channel uploads
    playlist_id = get_playlist_id(youtube, CHANNEL_ID)
    if not playlist_id:
        print(f"Could not retrieve playlist ID for channel {CHANNEL_ID}.", file=sys.stderr)
        sys.exit(1)

    # Step 4: Retrieve all video IDs from the uploads playlist
    video_ids = get_video_ids(youtube, playlist_id)
    if not video_ids:
        print(f"No videos found for channel {CHANNEL_ID}.", file=sys.stderr)
        sys.exit(1)

    # Step 5: Get detailed video data for all video IDs
    video_data = get_video_details(youtube, video_ids)
    if not video_data:
        print("No video data fetched.", file=sys.stderr)
        sys.exit(1)

    # Step 6: Write the collected video data to the output JSON file
    try:
        with open(OUTPUT_FILENAME, 'w') as f:
            json.dump(video_data, f, indent=4)
        print(f"Data saved to {OUTPUT_FILENAME}")
    except IOError as e:
        print(f"Error writing data to {OUTPUT_FILENAME}: {e}", file=sys.stderr)
        sys.exit(1)

    print("Extraction process finished.")
    sys.exit(0)

Data Extraction & Transformation
Extraction Process
The data extraction script efficiently pulls video-level data from the YouTube Data API v3. It follows these core steps:

Validates required environment variables and the API key
Retrieves the channel’s uploads playlist ID
Iteratively fetches all video IDs, handling pagination
Collects detailed video metadata in batches (50 per request, respecting API quotas)
Stores the raw JSON data for transformation

Transformation & Enrichment

Using Apache Spark, the raw data is cleaned and enriched for analysis. The transformation process:

Converts date strings into proper timestamps
Derives time-based dimensions (year, month, day, hour)
Adds calculated metrics such as engagement rates
Categorizes videos by performance tiers (e.g. low, average, high)This ensures the dataset is structured, analytics-ready, and scalable for deeper insights.


import os
import psycopg2
from pyspark.sql import SparkSession
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

POSTGRES_USER = os.getenv("POSTGRES_USER")
POSTGRES_PASSWORD = os.getenv("POSTGRES_PASSWORD")
POSTGRES_HOST = os.getenv("POSTGRES_HOST")
POSTGRES_PORT = os.getenv("POSTGRES_PORT")
POSTGRES_DB = os.getenv("POSTGRES_DB")
POSTGRES_SSLMODE = os.getenv("POSTGRES_SSLMODE")

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("YouTube Data Transformation") \
    .config("spark.jars", "/home/main/downloads/postgresql-42.7.3.jar") \
    .getOrCreate()

# Load JSON data
df_raw = spark.read.option("multiline", "true").json("/home/main/youtube/ambuso.json")
df_raw.createOrReplaceTempView("youtube_videos_raw")

# Transform raw data
sql_query = """
    SELECT
        videoId,
        title,
        publishedAt,
        CAST(viewCount AS INT) AS viewCount,
        CAST(likeCount AS INT) AS likeCount,
        CAST(commentCount AS INT) AS commentCount,
        extractionTimestamp
    FROM youtube_videos_raw
    WHERE viewCount IS NOT NULL
    ORDER BY viewCount DESC
"""
df_transformed = spark.sql(sql_query)
df_transformed.createOrReplaceTempView("youtube_transformed")

# Enrich data
enriched_query = """
    SELECT
        videoId,
        title,
        publishedAt,
        viewCount,
        likeCount,
        commentCount,
        extractionTimestamp,
        YEAR(TO_TIMESTAMP(publishedAt)) AS year,
        MONTH(TO_TIMESTAMP(publishedAt)) AS month,
        DAY(TO_TIMESTAMP(publishedAt)) AS day_of_month,
        DATE_FORMAT(TO_TIMESTAMP(publishedAt), 'EEEE') AS day_of_week,
        HOUR(TO_TIMESTAMP(publishedAt)) AS hour,
        WEEKOFYEAR(TO_TIMESTAMP(publishedAt)) AS week,
        CASE
            WHEN viewCount > 100000 THEN 'viral'
            WHEN viewCount > 5000 THEN 'high'
            ELSE 'normal'
        END AS performance_class
    FROM youtube_transformed
"""
df_enriched = spark.sql(enriched_query)
df_enriched.show()

# Create schema and table if needed
try:
    conn = psycopg2.connect(
        dbname=POSTGRES_DB,
        user=POSTGRES_USER,
        password=POSTGRES_PASSWORD,
        host=POSTGRES_HOST,
        port=POSTGRES_PORT,
        sslmode=POSTGRES_SSLMODE
    )
    conn.autocommit = True
    cursor = conn.cursor()

    cursor.execute("CREATE SCHEMA IF NOT EXISTS dataengineering;")
    cursor.execute("""
        CREATE TABLE IF NOT EXISTS dataengineering.youtube_videos_enriched (
            videoId TEXT,
            title TEXT,
            publishedAt TIMESTAMP,
            viewCount INT,
            likeCount INT,
            commentCount INT,
            extractionTimestamp TIMESTAMP,
            year INT,
            month INT,
            day_of_month INT,
            day_of_week TEXT,
            hour INT,
            week INT,
            performance_class TEXT
        );
    """)
    cursor.close()
    conn.close()
    print("Schema and table checked or created.")
except Exception as e:
    print("Failed to create schema or table:", e)

# Write to PostgreSQL
jdbc_url = f"jdbc:postgresql://{POSTGRES_HOST}:{POSTGRES_PORT}/{POSTGRES_DB}"
connection_properties = {
    "user": POSTGRES_USER,
    "password": POSTGRES_PASSWORD,
    "driver": "org.postgresql.Driver"
}

df_enriched.write \
    .jdbc(
        url=jdbc_url,
        table="dataengineering.youtube_videos_enriched",
        mode="append",
        properties=connection_properties
    )

print("Enriched data successfully appended to PostgreSQL.")

Data Transformation & Loading

The transformation script processes raw JSON data extracted from the YouTube API and prepares it for analytics. Key steps include:

Ingests raw video metadata from JSON files
Cleans the data by casting types and filtering out incomplete records
Enriches the dataset with:
Temporal breakdowns: year, month, day, hour, week
Day of the week: for publishing time analysis
Performance classification: based on custom view count thresholds
Once enriched, the data is written into a structured PostgreSQL table for downstream visualization and querying.

Workflow Orchestration with Airflow

To automate and manage the data pipeline, Apache Airflow is used for workflow orchestration. A custom DAG (Directed Acyclic Graph) defines the execution order and dependencies across tasks. Key components of the DAG include:

Triggering the YouTube API extraction script
Transforming raw data with Apache Spark
Loading the enriched dataset into PostgreSQL
Scheduling: The pipeline runs on a regular interval (e.g., daily) to keep insights fresh
Error Handling and logging to ensure transparency and reliability
This setup enables end-to-end automation with minimal manual intervention.


from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime

default_args = {
    'start_date': datetime(2025, 6, 7),
    'retries': 1,
}

with DAG(
    'youtube_data_pipeline',
    default_args=default_args,
    description='A pipeline to extract and transform YouTube data',
    schedule='@daily',  # corrected argument
    catchup=False,
) as dag:

    extract_task = BashOperator(
        task_id='extract_youtube_data',
        bash_command='/home/main/youtube/env/bin/python /home/main/youtube/extract.py',
    )

    transform_task = BashOperator(
        task_id='transform_and_load',
        bash_command='/home/main/youtube/env/bin/python /home/main/youtube/transform.py',
    )

    extract_task >> transform_task

Data Visualization with Grafana

Once the data pipeline was in place, I built an interactive Grafana dashboard to visualize key YouTube channel insights. The dashboard helps creators quickly answer essential questions like:

How is the channel growing over time?
Which videos are driving the most engagement?
When is the best time to publish content?
What patterns exist in audience behavior?

The visuals include time-series charts, performance breakdowns, publishing heatmaps, and engagement metrics—all updating automatically as new data flows into the database.

Here's a sample query used for the top most viewed videos visualization:


SELECT *
FROM (
    SELECT DISTINCT ON ("videoId")
        "videoId",
        "title",
        "viewCount",
        "publishedAt"
    FROM dataengineering.youtube_videos_enriched
    ORDER BY "videoId", "publishedAt" DESC
) sub
ORDER BY "viewCount" DESC
LIMIT 10;

Data Analysis and Insights

Rank	Video ID	Title	Views	Published Date
1	NXaVLXSZdEw	Germany: The discreet lives of the super rich	20,087,376	2019-06-09 15:00:04
2	1ReFNdkQ5Y8	Brides for sale - Bulgaria's Roma marriage market	19,107,926	2018-03-10 16:00:00
3	TZ0j6kr4ZJ0	Kiribati: a drowning paradise in the South Pacific	13,202,687	2017-11-08 20:00:01
4	ew7-Z2xCMgE	India’s prostitution villages	12,474,846	2023-06-19 16:00:34
5	-rZkdPXP6H4	What happened to Otto Warmbier in North Korea?	11,968,726	2020-11-27 17:00:00
6	tT0ob3cHPmE	Hürtgen forest and the end of World War II	11,540,030	2020-05-02 18:00:27
7	JsPHKDuP-Hk	The end of a superpower - The collapse of the Soviet Union	11,279,087	2022-03-03 20:00:04
8	t6m49vNjEGs	How the rich get richer - Money in the world economy	11,088,418	2017-07-05 15:01:14
9	gqDCrdZVZnk	The world’s most dangerous arms dealer	10,221,158	2023-06-29 16:00:25
10	CVuVlk2E_e4	Money, happiness and eternal life - Greed (Director's Cut)	9,242,989	2017-06-24 02:00:01

Engagement Metrics Over Time

These are the engagement metrics for the top five videos as shown on the pie chart:

Published Time	Likes	Comments
2014-12-22 00:00:00.000	49,170	7,249
2015-05-23 00:00:00.000	247,170	678
2015-08-22 00:00:00.000	4,444	220
2016-05-30 00:00:00.000	50,105	1,980
2016-07-08 00:00:00.000	143,385	17,622

Content Strategy Recommendations

Based on the analytics, here are data-driven recommendations:

1. Top Performing Videos Summary
The Top 10 most viewed videos on the DW YouTube channel cover powerful and socially relevant topics such as extreme wealth, human rights issues, environmental crises, and historical events. These videos consistently attract multi-million view counts, with the most popular reaching over 20 million views. Most of them were published between 2017 and 2023, showing enduring viewer interest in in-depth documentaries.

2. Engagement Metrics Summary
The likes and comments table highlights strong audience interaction, with some videos gathering over 240,000 likes. The data spans from 2014 to 2025, indicating that even older videos maintain notable engagement levels.

Challenges & Solutions

API Rate Limiting
Challenge: YouTube API enforces a daily quota (10,000 units)
Solution: Implemented batch requests and efficient pagination to reduce API load
Data Quality Issues
Challenge: Incomplete or missing stats for certain videos
Solution: Added error handling and null-safe transformations in Spark

Future Enhancements

a. Sentiment Intelligence
Harness Natural Language Processing (NLP) to dig into viewer comments and decode how audiences feel about different videos. Track emotional trends and tie them back to content themes to understand what truly resonates.

b. Competitive Edge
introduce benchmarking tools to stack your channel against similar creators. See where you stand, uncover blind spots, and spot emerging opportunities to lead—not follow—in your niche.

c.Intelligent Publishing
Train a lightweight machine learning model to identify your golden hours for publishing. Beyond timing, use historical patterns to recommend high-impact content types for specific days.

d. Auto-Insights & Alerts
Generate automated, easy-to-digest reports straight to your inbox. Get real-time alerts when performance dips or spikes, helping you react faster to unexpected trends.

e. Unified Social Analytics
Expand visibility by integrating other platforms—Instagram, TikTok, X (Twitter). View engagement holistically and track how cross-posting influences YouTube performance.

Github:https://github.com/Ambuso/Dw-Youtube-Channel-Analysis

Ambuso Dismas @dismas_mike