Page Zen: The Open-Source Article Cleaning API You've Been Waiting For
Rohith Gilla

Rohith Gilla @gillarohith

About: I code to keep up my sanity, code in multiple languages and frameworks. These are my favs Python | Flutter | TypeScript | JavaScript | React | Expo

Location:
India
Joined:
Jan 14, 2020

Page Zen: The Open-Source Article Cleaning API You've Been Waiting For

Publish Date: Jun 19
10 0

In today's information-rich world, we're constantly bombarded with cluttered web articles filled with ads, popups, navigation menus, and other distractions. What if you could extract just the essential content from any article with a simple API call? Meet Page Zen - an open-source, self-hostable solution that transforms messy web articles into clean, readable content.

Logo

🚀 What is Page Zen?

Page Zen is a powerful Go-based API service that takes any article URL and returns clean, distraction-free content in multiple formats. Whether you're building a reading app, content aggregator, or just want to save articles without the clutter, Page Zen has you covered.

Key Features

Clean Article Extraction - Removes ads, navigation, social widgets, and other noise

Multiple Output Formats - Get content as clean text or markdown

Open Graph Metadata - Extract rich social media metadata

Medium Article Support - Works perfectly with Medium and other popular platforms

Self-Hostable - Complete control over your data and infrastructure

Open Source - MIT licensed, community-driven development

🌟 Why Choose Page Zen?

1. Open Source & Self-Hostable

Unlike proprietary services that lock you into their ecosystem, Page Zen is completely open source. You can:

  • Host it on your own infrastructure
  • Customize it for your specific needs
  • Never worry about API rate limits or service shutdowns
  • Maintain complete control over your data

2. Works with Any Article Platform

Page Zen intelligently handles content from various sources:

  • Medium articles
  • News websites
  • Blog posts
  • Technical documentation
  • And virtually any web article!

3. Rich Metadata Extraction

Beyond just cleaning content, Page Zen extracts comprehensive Open Graph metadata:

  • Article title and description
  • Author information
  • Publication dates
  • Social media images
  • Twitter Card data
  • And much more!

Extract

🛠️ Easy to Deploy

Getting started with Page Zen is incredibly simple. The project includes Docker support for easy deployment:

# Clone the repository
git clone https://github.com/rohithgilla12/page-zen.git

# Run with Docker Compose
docker-compose up -d
Enter fullscreen mode Exit fullscreen mode

That's it! Your article cleaning API is now running locally.

📝 API Usage Examples

Extract Article Content

Extract Article
Extract Article

curl -X POST http://localhost:8080/extract \
  -H "Content-Type: application/json" \
  -d '{"url": "https://itnext.io/essential-cli-tui-tools-for-developers-7e78f0cd27db", "include_markdown":true}'
Enter fullscreen mode Exit fullscreen mode

Extract Open Graph Data Only

Extract

curl -X POST http://localhost:8080/opengraph \
  -H "Content-Type: application/json" \
  -d '{"url": "https://dev.to/gillarohith/develop-url-shortener-application-with-redwood-js-3cf7 "}'
Enter fullscreen mode Exit fullscreen mode

Response:

{
  "url": "https://dev.to/gillarohith/develop-url-shortener-application-with-redwood-js-3cf7",
  "open_graph": {
    "title": "Develop URL shortener application with Redwood JS.",
    "description": "Develop URL shortener application with RedwoodJS            Introduction            What is...",
    "image": "https://media2.dev.to/dynamic/image/width=1000,height=500,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F77phvxr1c3i00fvv0jly.png",
    "url": "https://dev.to/gillarohith/develop-url-shortener-application-with-redwood-js-3cf7",
    "type": "article",
    "site_name": "DEV Community",
    "twitter_card": "summary_large_image",
    "twitter_site": "@thepracticaldev",
    "twitter_creator": "@gillarohith",
    "twitter_title": "Develop URL shortener application with Redwood JS.",
    "twitter_description": "Develop URL shortener application with RedwoodJS            Introduction            What is..."
  },
  "success": true
}
Enter fullscreen mode Exit fullscreen mode

🎯 Perfect Use Cases

Content Aggregators: Build clean RSS feeds or news aggregators

Reading Apps: Create distraction-free reading experiences

Research Tools: Extract clean content for analysis

Social Media Tools: Get rich preview data for link sharing

Documentation: Convert web articles to clean markdown

🔧 Advanced Features

Page Zen goes beyond basic article extraction:

  • Image Processing: Converts complex picture elements to simple img tags
  • URL Resolution: Handles relative URLs and converts them to absolute paths
  • Smart Content Detection: Uses Mozilla's Readability algorithm for accurate content extraction
  • Configurable Cleaning: Remove specific elements based on your needs
  • Comprehensive Logging: Built-in structured logging for debugging and monitoring

🌍 Join the Community

Page Zen is more than just a tool - it's a community-driven project that welcomes contributions:

  • 🐛 Report bugs and suggest features
  • 💻 Contribute code and improvements
  • 📖 Improve documentation
  • Star the repo to show your support

🚀 Get Started Today

Ready to clean up the web? Here's how to get started:

  1. Try it locally: Clone the repo and run with Docker
  2. Deploy to production: Use the included Dockerfile for easy deployment
  3. Integrate: Start making API calls from your application
  4. Customize: Fork the project and adapt it to your needs

Links:

Page Zen - Because the web deserves to be readable.

Comments 0 total

    Add comment