🧩 The problem
I wanted to automate this boring and repetitive workflow: my idea is that every
time a YouTube video is published on my channel I want to have an associated
post on my personal Jekyll blog.
Until very recently I handled all this by hand. This process was very tedious
because, except for the video summary (more on this later), it was merely a
copy paste operation of fields in the YAML front matter.
✅ The solution
📶 Feeds
The solution to this is to leverage on the RSS feeds provided by YouTube
itself, and a few Python libraries:
- 📶 feedparser: the core dependency element
- 💻 PyYAML: parses the existing front matter
- 🐌 slugify: not crucial for this use case but useful
Every YouTube channel, in-fact, has feeds in this URL format:
https://www.youtube.com/feeds/videos.xml?channel_id={channel_id}
⚙️ Algorithm
Essentially the algorithm work like this:
- get and parse the feed file
- for each news item (
entry
) in the feed file, extract:- url
- title
- published date
- tags (via the video description, using a regex)
- create a new markdown file in the
_posts
directory using variables of step 2, and avoid changing existing auto-posts
Concerning step 1 and 2 it's quite simple thanks to list comprehensions:
def extract_feed_youtube_data(feed_source: str) -> list[dict]:
d = feedparser.parse(feed_source)
data = [
{
'url': e['link'],
'title': e['title'],
# feedparser generates a Python 9-tuple in UTC.
'published': datetime.datetime(*e['published_parsed'][:6]),
# Use a regex for this.
# If there are no hash tags available just use default ones.
'tags': get_video_tags(e['summary'])
if ('summary' in e and e['summary'] != '')
else STANDARD_TAGS,
# The video description is not used at the moment.
'sm': e['summary'],
}
for e in d.entries if 'link' in e
]
Before returning the data you can also perform a cleanup:
for e in data:
# Always use default tags. This also works in case videos do not have
# a description.
e['tags'] += STANDARD_TAGS
# Unique.
e['tags'] = sorted(list(set(e['tags'])))
e['summary'] = ' '.join(e['tags'])
return data
Step 3 involves an f-string. We need to take care of specific fields to avoid
Jekyll throwing YAML parsing errors when using quotes. This can happen in the title
and description
fields specifically.
def create_markdown_blog_post(posts_base_directory: str, data: dict):
return f"""---
title: "|-"
{data["title"]}
tags: [{youtube_tags_to_jekyll_front_matter_tags(data["tags"])}]
related_links: ['{data["url"]}']
updated: {datetime_object_to_jekyll_front_matter_utc(data["published"])}
description: "|-"
{data["title"]}
lang: 'en'
---
{generate_youtube_embed_code(get_youtube_video_id(data["url"]))}
<div markdown="0">
{data["summary"]}
</div>
*Note: post auto-generated from YouTube feeds.*
"""
🎯 Result
As you see, the content of each auto post is very basic: beside the standard
fields, in the body we can find an HTML YouTube embed code, and a list of hash
tags extracted from the video description using a regex. The description
(summary
) was left out on purpose. The idea to implement in the future is to
get the video trascription and let a LLM (via Ollama) generate a summary.
Of course I then need to manually proofread it. I also cannot copy the video
description verbatim because of SEO.
Another improvement could involve replicating a subset of the fields of the
auto-post to send toots to Mastodon using its API.
▶️ Running
At the moment the script is triggered by a local pre-commit hook which also
installs the Python dependencies in a separate environment:
- repo: local
hooks:
- id: generate_posts
name: generate_posts
entry: python ./.scripts/youtube/generate_posts.py
verbose: true
always_run: true
pass_filenames: false
language: python
types: [python]
additional_dependencies: ['feedparser>=6,<7', 'python-slugify[unidecode]>=8,<9', 'PyYAML>=6,<7']
🎉 Conclusion
And that's it really. This script saves time and it's the kind of automation I
like. Thankfully YouTube still provides RSS feeds although I already had to fix
the script once to adapt to their new structural changes.
If you are interested in the source code you can find it on
its repo.
You can comment here and check my YouTube channel for related content!
Hey everyone! We’re launching an exclusive token airdrop for all verified Dev.to authors. Click here here to see if you qualify (no gas fees). – Admin