Building a YouTube Channel Analytics Dashboard with Airflow, Spark, and Grafana
Ambuso Dismas

Ambuso Dismas @dismas_mike

About: building

Joined:
Apr 15, 2025

Building a YouTube Channel Analytics Dashboard with Airflow, Spark, and Grafana

Publish Date: Jun 10
1 3

Introduction

In the rapidly evolving creator economy, data-driven insights are essential for YouTube content creators to refine their strategies, maximize audience engagement, and drive channel growth. While YouTube Studio offers built-in analytics, its capabilities are often limited in terms of flexibility, depth, and customization.

To overcome these constraints, I developed an end-to-end automated data engineering pipeline orchestrated with Apache Airflow. This system extracts raw data from the YouTube Data API, transforms and enriches it using Apache Spark, stores the results in a PostgreSQL database, and presents interactive visualizations through Grafana. Apache Airflow manages and schedules these workflows as DAGs (Directed Acyclic Graphs), ensuring reliable, repeatable, and scalable data processing.

The result is a dynamic analytics dashboard that empowers creators with full visibility into their channel’s performance—enabling smarter decisions, faster.

Problem Statement

YouTube creators struggle with analyzing their channel performance because:

  1. YouTube’s built-in analytics are limited and inflexible
  2. Manually collecting and managing data is slow, error-prone, and doesn’t scale
  3. There’s no easy, centralized way to track key metrics over time
  4. Figuring out the best times to publish and what content works best requires complex analysis
  5. Creators need an automated, reliable system that regularly pulls data, organizes it, and turns it into actionable insights.

Solution Architecture

To tackle this, I built a full ETL pipeline with visualization, including:

  1. Extract: A Python script fetches video and channel data from the YouTube API
  2. Transform: Apache Spark cleans and enriches the data with key metrics
  3. Load: The data is saved into a PostgreSQL database
  4. Orchestrate: Apache Airflow automates and schedules the entire workflow
  5. Visualize: Grafana dashboards display interactive insights

All components run smoothly on an Azure VM, offering an affordable, scalable solution for creators and small teams.

Implementation Details

Data Extraction
The extraction process uses the YouTube Data API v3 to automatically gather channel and video information. The script handles all the API authentication, fetches the channel’s uploaded videos playlist, and pulls detailed metadata for each video — like views, likes, comments, and publish dates.

This setup ensures up-to-date data is collected efficiently without manual intervention.

Here’s a snippet of the extraction script:


import os
from googleapiclient.discovery import build
from datetime import datetime
import json
import sys
from dotenv import load_dotenv

load_dotenv()

YOUTUBE_API_KEY = os.getenv("YOUTUBE_API_KEY")
CHANNEL_ID = os.getenv("YOUTUBE_CHANNEL_ID")
OUTPUT_FILENAME = os.getenv("OUTPUT_FILENAME")


def validate_config():
    if not YOUTUBE_API_KEY:
        print("Error: YOUTUBE_API_KEY environment variable not set.", file=sys.stderr)
        sys.exit(1)
    if not CHANNEL_ID:
        print("Error: YOUTUBE_CHANNEL_ID environment variable not set.", file=sys.stderr)
        sys.exit(1)


def build_youtube_client():
    return build("youtube", "v3", developerKey=YOUTUBE_API_KEY)


def get_playlist_id(youtube, channel_id):
    try:
        request = youtube.channels().list(part="contentDetails", id=channel_id)
        response = request.execute()
        # Return the playlist ID containing all uploaded videos
        return response['items'][0]['contentDetails']['relatedPlaylists']['uploads'] if response['items'] else None
    except Exception as e:
        print(f"Error fetching channel details: {e}", file=sys.stderr)
        return None


def get_video_ids(youtube, playlist_id):
    video_ids, next_page_token = [], None
    while True:
        try:
            # Fetch playlist items in batches of 50 (max allowed)
            request = youtube.playlistItems().list(
                part="contentDetails", 
                playlistId=playlist_id, 
                maxResults=50, 
                pageToken=next_page_token
            )
            response = request.execute()
            # Extract video IDs from the current page
            video_ids.extend([item['contentDetails']['videoId'] for item in response['items']])
            # Check if there is a next page, else exit loop
            next_page_token = response.get('nextPageToken')
            if not next_page_token:
                break
        except Exception as e:
            print(f"Error fetching playlist items: {e}", file=sys.stderr)
            break
    return video_ids


def get_video_details(youtube, video_ids):
    all_video_data = []
    # Process video IDs in chunks of 50 due to API limits
    for i in range(0, len(video_ids), 50):
        video_chunk = ",".join(video_ids[i:i+50])
        try:
            request = youtube.videos().list(part="snippet,statistics", id=video_chunk)
            response = request.execute()
            # Timestamp for when data was extracted
            timestamp = datetime.utcnow().isoformat() + 'Z'
            # Extract relevant video info and stats
            for item in response['items']:
                video_data = {
                    'videoId': item['id'],
                    'title': item['snippet']['title'],
                    'publishedAt': item['snippet']['publishedAt'],
                    'viewCount': item['statistics'].get('viewCount'),
                    'likeCount': item['statistics'].get('likeCount'),
                    'commentCount': item['statistics'].get('commentCount'),
                    'extractionTimestamp': timestamp
                }
                all_video_data.append(video_data)
        except Exception as e:
            print(f"Error fetching video details: {e}", file=sys.stderr)
    return all_video_data


if __name__ == "__main__":
    print("Starting YouTube data extraction...")

    # Step 1: Check if environment variables are set
    validate_config()

    # Step 2: Create YouTube API client
    youtube = build_youtube_client()

    # Step 3: Get playlist ID for channel uploads
    playlist_id = get_playlist_id(youtube, CHANNEL_ID)
    if not playlist_id:
        print(f"Could not retrieve playlist ID for channel {CHANNEL_ID}.", file=sys.stderr)
        sys.exit(1)

    # Step 4: Retrieve all video IDs from the uploads playlist
    video_ids = get_video_ids(youtube, playlist_id)
    if not video_ids:
        print(f"No videos found for channel {CHANNEL_ID}.", file=sys.stderr)
        sys.exit(1)

    # Step 5: Get detailed video data for all video IDs
    video_data = get_video_details(youtube, video_ids)
    if not video_data:
        print("No video data fetched.", file=sys.stderr)
        sys.exit(1)

    # Step 6: Write the collected video data to the output JSON file
    try:
        with open(OUTPUT_FILENAME, 'w') as f:
            json.dump(video_data, f, indent=4)
        print(f"Data saved to {OUTPUT_FILENAME}")
    except IOError as e:
        print(f"Error writing data to {OUTPUT_FILENAME}: {e}", file=sys.stderr)
        sys.exit(1)

    print("Extraction process finished.")
    sys.exit(0)
Enter fullscreen mode Exit fullscreen mode

Data Extraction & Transformation
Extraction Process
The data extraction script efficiently pulls video-level data from the YouTube Data API v3. It follows these core steps:

  1. Validates required environment variables and the API key
  2. Retrieves the channel’s uploads playlist ID
  3. Iteratively fetches all video IDs, handling pagination
  4. Collects detailed video metadata in batches (50 per request, respecting API quotas)
  5. Stores the raw JSON data for transformation

Transformation & Enrichment

Using Apache Spark, the raw data is cleaned and enriched for analysis. The transformation process:

  1. Converts date strings into proper timestamps
  2. Derives time-based dimensions (year, month, day, hour)
  3. Adds calculated metrics such as engagement rates
  4. Categorizes videos by performance tiers (e.g. low, average, high)This ensures the dataset is structured, analytics-ready, and scalable for deeper insights.

import os
import psycopg2
from pyspark.sql import SparkSession
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

POSTGRES_USER = os.getenv("POSTGRES_USER")
POSTGRES_PASSWORD = os.getenv("POSTGRES_PASSWORD")
POSTGRES_HOST = os.getenv("POSTGRES_HOST")
POSTGRES_PORT = os.getenv("POSTGRES_PORT")
POSTGRES_DB = os.getenv("POSTGRES_DB")
POSTGRES_SSLMODE = os.getenv("POSTGRES_SSLMODE")

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("YouTube Data Transformation") \
    .config("spark.jars", "/home/main/downloads/postgresql-42.7.3.jar") \
    .getOrCreate()

# Load JSON data
df_raw = spark.read.option("multiline", "true").json("/home/main/youtube/ambuso.json")
df_raw.createOrReplaceTempView("youtube_videos_raw")

# Transform raw data
sql_query = """
    SELECT
        videoId,
        title,
        publishedAt,
        CAST(viewCount AS INT) AS viewCount,
        CAST(likeCount AS INT) AS likeCount,
        CAST(commentCount AS INT) AS commentCount,
        extractionTimestamp
    FROM youtube_videos_raw
    WHERE viewCount IS NOT NULL
    ORDER BY viewCount DESC
"""
df_transformed = spark.sql(sql_query)
df_transformed.createOrReplaceTempView("youtube_transformed")

# Enrich data
enriched_query = """
    SELECT
        videoId,
        title,
        publishedAt,
        viewCount,
        likeCount,
        commentCount,
        extractionTimestamp,
        YEAR(TO_TIMESTAMP(publishedAt)) AS year,
        MONTH(TO_TIMESTAMP(publishedAt)) AS month,
        DAY(TO_TIMESTAMP(publishedAt)) AS day_of_month,
        DATE_FORMAT(TO_TIMESTAMP(publishedAt), 'EEEE') AS day_of_week,
        HOUR(TO_TIMESTAMP(publishedAt)) AS hour,
        WEEKOFYEAR(TO_TIMESTAMP(publishedAt)) AS week,
        CASE
            WHEN viewCount > 100000 THEN 'viral'
            WHEN viewCount > 5000 THEN 'high'
            ELSE 'normal'
        END AS performance_class
    FROM youtube_transformed
"""
df_enriched = spark.sql(enriched_query)
df_enriched.show()

# Create schema and table if needed
try:
    conn = psycopg2.connect(
        dbname=POSTGRES_DB,
        user=POSTGRES_USER,
        password=POSTGRES_PASSWORD,
        host=POSTGRES_HOST,
        port=POSTGRES_PORT,
        sslmode=POSTGRES_SSLMODE
    )
    conn.autocommit = True
    cursor = conn.cursor()

    cursor.execute("CREATE SCHEMA IF NOT EXISTS dataengineering;")
    cursor.execute("""
        CREATE TABLE IF NOT EXISTS dataengineering.youtube_videos_enriched (
            videoId TEXT,
            title TEXT,
            publishedAt TIMESTAMP,
            viewCount INT,
            likeCount INT,
            commentCount INT,
            extractionTimestamp TIMESTAMP,
            year INT,
            month INT,
            day_of_month INT,
            day_of_week TEXT,
            hour INT,
            week INT,
            performance_class TEXT
        );
    """)
    cursor.close()
    conn.close()
    print("Schema and table checked or created.")
except Exception as e:
    print("Failed to create schema or table:", e)

# Write to PostgreSQL
jdbc_url = f"jdbc:postgresql://{POSTGRES_HOST}:{POSTGRES_PORT}/{POSTGRES_DB}"
connection_properties = {
    "user": POSTGRES_USER,
    "password": POSTGRES_PASSWORD,
    "driver": "org.postgresql.Driver"
}

df_enriched.write \
    .jdbc(
        url=jdbc_url,
        table="dataengineering.youtube_videos_enriched",
        mode="append",
        properties=connection_properties
    )

print("Enriched data successfully appended to PostgreSQL.")

Enter fullscreen mode Exit fullscreen mode

Data Transformation & Loading

The transformation script processes raw JSON data extracted from the YouTube API and prepares it for analytics. Key steps include:

  1. Ingests raw video metadata from JSON files
  2. Cleans the data by casting types and filtering out incomplete records
  3. Enriches the dataset with:
  4. Temporal breakdowns: year, month, day, hour, week
  5. Day of the week: for publishing time analysis
  6. Performance classification: based on custom view count thresholds
  7. Once enriched, the data is written into a structured PostgreSQL table for downstream visualization and querying.

Workflow Orchestration with Airflow

To automate and manage the data pipeline, Apache Airflow is used for workflow orchestration. A custom DAG (Directed Acyclic Graph) defines the execution order and dependencies across tasks. Key components of the DAG include:

  • Triggering the YouTube API extraction script
  • Transforming raw data with Apache Spark
  • Loading the enriched dataset into PostgreSQL
  • Scheduling: The pipeline runs on a regular interval (e.g., daily) to keep insights fresh
  • Error Handling and logging to ensure transparency and reliability
  • This setup enables end-to-end automation with minimal manual intervention.

from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime

default_args = {
    'start_date': datetime(2025, 6, 7),
    'retries': 1,
}

with DAG(
    'youtube_data_pipeline',
    default_args=default_args,
    description='A pipeline to extract and transform YouTube data',
    schedule='@daily',  # corrected argument
    catchup=False,
) as dag:

    extract_task = BashOperator(
        task_id='extract_youtube_data',
        bash_command='/home/main/youtube/env/bin/python /home/main/youtube/extract.py',
    )

    transform_task = BashOperator(
        task_id='transform_and_load',
        bash_command='/home/main/youtube/env/bin/python /home/main/youtube/transform.py',
    )

    extract_task >> transform_task


Enter fullscreen mode Exit fullscreen mode

Data Visualization with Grafana

Once the data pipeline was in place, I built an interactive Grafana dashboard to visualize key YouTube channel insights. The dashboard helps creators quickly answer essential questions like:

  1. How is the channel growing over time?
  2. Which videos are driving the most engagement?
  3. When is the best time to publish content?
  4. What patterns exist in audience behavior?

The visuals include time-series charts, performance breakdowns, publishing heatmaps, and engagement metrics—all updating automatically as new data flows into the database.

Image description

Image description

Image description

Here's a sample query used for the top most viewed videos visualization:


SELECT *
FROM (
    SELECT DISTINCT ON ("videoId")
        "videoId",
        "title",
        "viewCount",
        "publishedAt"
    FROM dataengineering.youtube_videos_enriched
    ORDER BY "videoId", "publishedAt" DESC
) sub
ORDER BY "viewCount" DESC
LIMIT 10;

Enter fullscreen mode Exit fullscreen mode

Data Analysis and Insights

Top Performing Videos

These are the most viewed DW Documentary videos based graph bar above:

Rank Video ID Title Views Published Date
1 NXaVLXSZdEw Germany: The discreet lives of the super rich 20,087,376 2019-06-09 15:00:04
2 1ReFNdkQ5Y8 Brides for sale - Bulgaria's Roma marriage market 19,107,926 2018-03-10 16:00:00
3 TZ0j6kr4ZJ0 Kiribati: a drowning paradise in the South Pacific 13,202,687 2017-11-08 20:00:01
4 ew7-Z2xCMgE India’s prostitution villages 12,474,846 2023-06-19 16:00:34
5 -rZkdPXP6H4 What happened to Otto Warmbier in North Korea? 11,968,726 2020-11-27 17:00:00
6 tT0ob3cHPmE Hürtgen forest and the end of World War II 11,540,030 2020-05-02 18:00:27
7 JsPHKDuP-Hk The end of a superpower - The collapse of the Soviet Union 11,279,087 2022-03-03 20:00:04
8 t6m49vNjEGs How the rich get richer - Money in the world economy 11,088,418 2017-07-05 15:01:14
9 gqDCrdZVZnk The world’s most dangerous arms dealer 10,221,158 2023-06-29 16:00:25
10 CVuVlk2E_e4 Money, happiness and eternal life - Greed (Director's Cut) 9,242,989 2017-06-24 02:00:01

Engagement Metrics Over Time

These are the engagement metrics for the top five videos as shown on the pie chart:

Published Time Likes Comments
2014-12-22 00:00:00.000 49,170 7,249
2015-05-23 00:00:00.000 247,170 678
2015-08-22 00:00:00.000 4,444 220
2016-05-30 00:00:00.000 50,105 1,980
2016-07-08 00:00:00.000 143,385 17,622

Content Strategy Recommendations

Based on the analytics, here are data-driven recommendations:

1. Top Performing Videos Summary
The Top 10 most viewed videos on the DW YouTube channel cover powerful and socially relevant topics such as extreme wealth, human rights issues, environmental crises, and historical events. These videos consistently attract multi-million view counts, with the most popular reaching over 20 million views. Most of them were published between 2017 and 2023, showing enduring viewer interest in in-depth documentaries.

2. Engagement Metrics Summary
The likes and comments table highlights strong audience interaction, with some videos gathering over 240,000 likes. The data spans from 2014 to 2025, indicating that even older videos maintain notable engagement levels.

Challenges & Solutions

API Rate Limiting
Challenge: YouTube API enforces a daily quota (10,000 units)
Solution: Implemented batch requests and efficient pagination to reduce API load
Data Quality Issues
Challenge: Incomplete or missing stats for certain videos
Solution: Added error handling and null-safe transformations in Spark

Future Enhancements

a. Sentiment Intelligence
Harness Natural Language Processing (NLP) to dig into viewer comments and decode how audiences feel about different videos. Track emotional trends and tie them back to content themes to understand what truly resonates.

b. Competitive Edge
introduce benchmarking tools to stack your channel against similar creators. See where you stand, uncover blind spots, and spot emerging opportunities to lead—not follow—in your niche.

c.Intelligent Publishing
Train a lightweight machine learning model to identify your golden hours for publishing. Beyond timing, use historical patterns to recommend high-impact content types for specific days.

d. Auto-Insights & Alerts
Generate automated, easy-to-digest reports straight to your inbox. Get real-time alerts when performance dips or spikes, helping you react faster to unexpected trends.

e. Unified Social Analytics
Expand visibility by integrating other platforms—Instagram, TikTok, X (Twitter). View engagement holistically and track how cross-posting influences YouTube performance.

Github:https://github.com/Ambuso/Dw-Youtube-Channel-Analysis

Comments 3 total

Add comment