Python HTML to Markdown
Mateen Kiani

Mateen Kiani @kiani0x01

About: Full stack developer.

Location:
Pakistan
Joined:
Jul 24, 2019

Python HTML to Markdown

Publish Date: Aug 5
0 0

Why Convert HTML to Markdown?

HTML is the backbone of web content, but Markdown offers simplicity and readability. Developers often need to extract HTML from templates, emails, or web scrapers and turn it into Markdown for docs or static site generators. Yet, converting every tag, link, or image can be a headache without the right approach.

"A small library can save huge time when dealing with conversions."

This guide walks through popular libraries and custom methods to transform HTML into clean Markdown. Ready to see how to turn a chunk of HTML into Markdown in just a few lines of Python?

Choosing a Library

When starting out, pick a library that fits your needs:

  • markdownify: Simple API, customizable tags.
  • html2text: Supports nested structures and tables.
  • Custom parser: Use BeautifulSoup for full control.

Each option comes with pros and cons in terms of flexibility, dependencies, and output style.

Basic Conversion with markdownify

First, install the library:

pip install markdownify
Enter fullscreen mode Exit fullscreen mode

Then run a basic conversion:

from markdownify import markdownify as md

html = '''
<h1>Welcome</h1>
<p>This is a <strong>test</strong> of HTML to markdown.</p>
'''
markdown = md(html)
print(markdown)
Enter fullscreen mode Exit fullscreen mode

The output looks like:

# Welcome

This is a **test** of HTML to markdown.
Enter fullscreen mode Exit fullscreen mode

You can tweak how tags map to Markdown using the heading_style or custom rules.

Handling Links and Images

Most converters handle links and images by default, but you can adjust them:

from markdownify import markdownify_with_options

opts = {
  'heading_style': 'ATX',
  'bullet_list_marker': '-',
  'strip': ['img'],
}
markdown = markdownify_with_options(html, opts=opts)
Enter fullscreen mode Exit fullscreen mode

Tip: To customize image syntax, catch <img> tags with BeautifulSoup and build Markdown strings manually.

Custom Parsing with BeautifulSoup

When libraries fall short, parse HTML yourself:

from bs4 import BeautifulSoup

def html_to_md(html):
    soup = BeautifulSoup(html, 'html.parser')
    md_lines = []
    for el in soup.find_all(['h1','p','a','img']):
        if el.name == 'h1':
            md_lines.append(f'# {el.text}')
        elif el.name == 'p':
            md_lines.append(el.get_text())
        elif el.name == 'a':
            href = el.get('href')
            md_lines.append(f'[{el.text}]({href})')
        elif el.name == 'img':
            alt = el.get('alt','')
            src = el.get('src')
            md_lines.append(f'![{alt}]({src})')
    return '\n\n'.join(md_lines)
Enter fullscreen mode Exit fullscreen mode

Here, you can use append to string techniques or build lists of lines.

Saving to a File

Once you have your Markdown, write it out:

md_content = html_to_md(html)
with open('output.md', 'w', encoding='utf-8') as f:
    f.write(md_content)
Enter fullscreen mode Exit fullscreen mode

For line-by-line writing, check Python write to file line by line.

Advanced Tips and Best Practices

  • Handle tables by mapping <table> tags to pipe syntax.
  • Use regex to convert inline styles when needed.
  • Automate batch conversions with glob or os.walk.
  • Normalize whitespace to avoid extra blank lines.

Conclusion

Converting HTML to Markdown in Python can be quick and reliable with the right tools. For most tasks, libraries like markdownify or html2text handle the heavy lifting. When you need full control, BeautifulSoup offers a flexible way to parse and rebuild content. Try these methods in your next project to keep your documentation DRY and easy to maintain.

Comments 0 total

    Add comment