🧠 NewsPulse AI – Real-Time News Analysis with LLMs & Web Scraping via Bright Data
Suman Kumar

Suman Kumar @sumankalia

About: 👨‍💻 Full-time Software Developer | 📹 YouTuber @programmingwithsuman2300 | 💡 Passionate Educator

Location:
Chandigarh, India
Joined:
Feb 2, 2023

🧠 NewsPulse AI – Real-Time News Analysis with LLMs & Web Scraping via Bright Data

Publish Date: May 25
32 13

This is a submission for the Bright Data AI Web Access Hackathon
We built NewsPulse AI to explore a simple but powerful question:

"What if you could instantly see how different media outlets spin the same story?"

In a world flooded with headlines, bias, and misinformation, NewsPulse AI acts like an AI-powered research assistant. You ask a question, and it fetches, scrapes, analyzes, and visualizes fresh news articles in real-time—just like a human researcher, but supercharged.

🚀 What It Does

  • 🔍 Enter any news-related query (e.g. farmers protest, AI in education, etc.)
  • 🌐 NewsPulse fetches real-time articles via Bright Data’s MCP scraping infrastructure
  • 🧠 LangChain & GPT-3.5 process each article for:
  • Sentiment (positive/neutral/negative)
  • Bias
  • Political lean
  • Toxicity and propaganda presence
  • 📊 Get aggregate insights and transparent logs instantly

What I Built

NewsPulse AI is an intelligent real-time news analysis engine that allows users to query any news-related topic and instantly receive a stream of analyzed articles. It mimics how a human researcher might discover, navigate, extract, and interact with web content—but does it entirely autonomously.

The project solves the problem of understanding media bias, misinformation, and sentiment across different news sources in real-time. Whether a user wants to explore how different platforms cover a political event, assess the emotional tone of news about a public figure, or analyze the presence of propaganda or toxicity in media, NewsPulse AI provides deep insights in seconds.

🔧 Deep Integration with FastMCP via STDIO

🔥 One of the key differentiators of our solution is that we have directly integrated Bright Data’s official MCP source code with our Node.js backend using FastMCP over standard input/output (STDIO).

We are:

  • Running the FastMCP server inside our Express (Node.js) environment
  • Communicating with it through STDIO (stdin/stdout)
  • Launching and managing scraping methods programmatically from Node.js

This level of integration is not just plug-and-play—it required fine-tuning STDIO communication and handling input/output streams carefully. But it makes our backend much more flexible and efficient, enabling real-time task execution without relying on HTTP or RPC overhead.

It also aligns perfectly with Bright Data’s vision for real-time AI agents interacting with the open web, giving us full control over tool orchestration, logging, and performance.

🛠️ Tech Stack & Architecture

Our project is built entirely on modern, open-source tools optimized for real-time web data extraction and analysis:

  • Frontend: React + Vite
  • Backend: Node.js + Express
  • AI Orchestration: LangChain + OpenAI GPT-3.5 Turbo
  • Scraping Infrastructure: Bright Data MCP (FastMCP Server)
  • Communication: REST APIs + WebSocket (for real-time logs & results)
  • Deployment: Hosted on an EC2 instance with persistent STDIO-based communication

⚡ No Database Used:
We do not persist any data. Everything — from scraping to analysis — happens live from the open web. This ensures:

  • Real-time results for every query
  • No stale or outdated information
  • Transparent validation that scraping works on-demand
  • Compliance with the hackathon’s goal of showcasing live data access

This stateless, no-DB approach proves the reliability of Bright Data’s infrastructure in powering real-time AI agents without any dependency on pre-stored content.

Project artchitecture

Demo

Video Demo

📁 GitHub Repo: https://github.com/sumankalia/news-pulse-ai

🌐 Live Site:
Frontend: http://ec2-16-170-239-65.eu-north-1.compute.amazonaws.com:5173/
Backend: http://ec2-16-170-239-65.eu-north-1.compute.amazonaws.com:4002/api/articles/ping

📸 Screenshots:

🔍 Search Input Interface – Users can enter any news-related query, such as “farmers protest” or “AI in education,” to fetch real-time news articles and analyze them instantly.
Search field for user query

⚙️ Real-Time Query Processing – The system uses Bright Data’s MCP server to fetch and analyze fresh articles from Indian news sources. Logs show live scraping and analysis updates for transparency and debugging.
Analyzing the query and showing logs

📰 Scraped Article Snapshot – A detailed preview of an individual article showing title, source, timestamp, and extracted summary. Each result is processed for bias, sentiment, and lean.
One scraped result from the list

📊 Sentiment, Bias, and Political Lean Breakdown – Each article undergoes NLP-based analysis to categorize tone (positive/neutral/negative), detect media bias, and predict political inclination.
All the sentiment analysis details

📈 Aggregate Insights Dashboard – Provides a summary of all fetched articles, highlighting sentiment distribution, bias frequency, and lean trends to help users quickly assess media coverage patterns.
Overall analysis of all the articles

How I Used Bright Data's Infrastructure

This project is powered by Bright Data’s MCP infrastructure, particularly the FastMCP server, which enables our AI agent to simulate human browsing behavior and extract structured information in real time. Here’s how the four key actions were implemented:

1. Discover

We use LangChain with a custom PromptTemplate to dynamically route user queries to one of four scraping methods:

  • scrape_as_article: Direct article URLs
  • scrape_a_homepage: Homepage or latest headline queries
  • search_via_google: Informational and broad-topic queries
  • search_via_bing: Dynamic, JS-heavy search result pages
switch (result.method) {
      case "scrape_as_article":
        data = await runToolCall("scrape_as_article", { url: query });
        break;
      case "scrape_a_homepage":
        data = await runToolCall("scrape_a_homepage", {
          url: result?.homepageUrl,
        });
        break;
      case "search_via_google":
        data = await runToolCall("search_via_google", { query });
        break;
      case "search_via_bing":
        processedResults = await searchViaBing({
          query,
          userId,
        });
    }
Enter fullscreen mode Exit fullscreen mode

LangChain decides the optimal route based on user intent, and our backend follows through using the selected method.

2. Access

We leverage Bright Data’s FastMCP to access dynamic and protected web pages like news homepages or search results.

Here’s what happens:

  • First, we load the target page (e.g., a news homepage or Bing/Google search results).
  • We then scrape all the article links visible on that page.
  • Next, we go article-by-article, scraping each one individually for full content and metadata.
  • This method helps us analyze multiple perspectives from a single page, all without hitting stale data or relying on pre-saved content.

This smart multi-link scraping approach powers our real-time insights across 50+ articles per query.

// Step 1: Navigate to the target page (homepage or search results)
await runToolCall("scraping_browser_navigate", {
  url: "https://www.newswebsite.com/",
});

// Step 2: Wait for articles to load on the page
await runToolCall("scraping_browser_wait_for", {
  selector: "article a[href]",
  timeout: 10000,
});

// Step 3: Extract all article links
const linksResult = await runToolCall("scraping_browser_links", {});

// Step 4: Iterate through each link and trigger detailed article scraping
for (const link of linksResult.links) {
  await runToolCall("scrape_as_article", { url: link });
}
Enter fullscreen mode Exit fullscreen mode

3. Extract

We extract detailed article metadata, including:
• title, content, url, published date, author, source, and image

Bright Data supports extraction via:

  • rawHtml – full HTML of the scraped content
  • markdown – clean, AI-friendly summary format
server.addTool({
  name: "scrape_a_homepage",
  description:
    "Scrape a single webpage URL with advanced options for " +
    "content extraction and get back the results in MarkDown language. " +
    "This tool can unlock any webpage even if it uses bot detection or " +
    "CAPTCHA.",
  parameters: z.object({ url: z.string().url() }),
  execute: tool_fn("scrape_a_homepage", async ({ url }) => {
    let response = await axios({
      url: "https://api.brightdata.com/request",
      method: "POST",
      data: {
        url,
        zone: unlocker_zone,
        format: "raw",
        data_format: "markdown",
      },
      headers: api_headers(),
      responseType: "text",
    });
    return response.data;
  }),
});
Enter fullscreen mode Exit fullscreen mode

We utilized a combination of custom and prebuilt functions to clean the raw data and extract the necessary information for analysis.

4. Interact

This is where Bright Data’s full capabilities come to life. In the search_via_bing method, we simulate a full browser interaction flow using MCP tools:

  1. Navigate to Bing.com
  2. Wait for the page to load
  3. Clear the input field
  4. Enter the search text
  5. Press Enter
  6. Wait ~5 seconds
  7. Wait for a result-related HTML selector to appear
  8. Scrape result links using scraping_browser_links

This closely mimics human behavior, allowing us to pull data from otherwise inaccessible or JS-heavy websites.

const url = "https://www.bing.com/news";
  const searchSelector = "input#sb_form_q";
  const searchText = query;
  const processedResults = [];


    // Navigate to the webpage using scraping_browser_navigate
    await runToolCall("scraping_browser_navigate", {
      url,
    });

    await runToolCall("scraping_browser_wait_for", {
      selector: searchSelector,
      timeout: 10000,
    });

    // Clear the search field first using scraping_browser_type
    await runToolCall("scraping_browser_type", {
      selector: searchSelector,
      text: "",
      submit: false,
    });
    // Type the search text using scraping_browser_type
    await runToolCall("scraping_browser_type", {
      selector: searchSelector,
      text: searchText,
      submit: false,
    });
    // Press Enter to submit the search
    await runToolCall("scraping_browser_press", {
      key: "Enter",
    });
    // Wait for the search results to load
    // First wait for network to be idle
    await new Promise((resolve) => setTimeout(resolve, 3000));

    // Wait for search results container
    await runToolCall("scraping_browser_wait_for", {
      selector: 'a.linkBtn[aria-label="Best match"]',
      timeout: 50000,
    });
    // Additional wait for search results to be fully loaded
    await new Promise((resolve) => setTimeout(resolve, 2000));
    // Wait for article links to be present
    await runToolCall("scraping_browser_wait_for", {
      selector: "article a[href*='/articles/']",
      timeout: 5000,
    });
    // Get all links from the page using scraping_browser_links
    const linksResult = await runToolCall("scraping_browser_links", {});
Enter fullscreen mode Exit fullscreen mode

Performance Improvements

Before integrating Bright Data’s MCP server, our initial architecture relied on:

  • Sequential proxy rotation using Bright Data’s Residential, Web Unlocker, and Mobile proxies
  • Headless browser automation via Puppeteer
  • Custom code for region-based rotation and JavaScript-rendered scraping

While functional, this method had significant drawbacks:

  • High latency per article (6–12 seconds)
  • Increased code complexity and maintenance overhead
  • Failures on JS-heavy or protected pages

🚀 Transition to Bright Data MCP (FastMCP Server)

By switching to the Fast MCP server, we achieved:

  • ~80% reduction in scraping latency per article
  • 💡 Seamless access to protected and JavaScript-heavy sites with zero-code browser interaction
  • ⚙️ Lightweight, declarative scraping powered by STDIO communication between our Node.js backend and MCP server
  • 📦 Simplified architecture (less boilerplate, no Puppeteer or manual proxy handling)
  • ✅ Greater reliability across countries and site types — with support for region-specific scraping

We are now scraping 50+ articles across multiple countries in real time without bottlenecks or rate-limiting issues.

You can even compare our old Puppeteer-based proxy project here:
👉 Legacy Puppeteer Proxy Scraper https://inspiring-taffy-5808f5.netlify.app/
And see how Bright Data MCP gave us a 10x better development and performance experience.

Future Improvements

We built NewsPulse AI to meet hackathon goals with a real-time, stateless architecture. For the next phase, we plan to:

  • Integrate a Vector DB (like Pinecone or Qdrant) to enable semantic search and avoid redundant scraping.
  • Add a Scalable Job Queue for handling spikes using tools like BullMQ or Redis.
  • Implement Auth & API Keys to support user-specific usage and rate-limiting.
  • Secure Secrets Properly, moving all credentials to secret managers for a real deployment.
  • Improve UX with features like sentiment trend visualizations, historical comparisons, and saved analyses.

These updates will make the platform more robust, scalable, and production-ready.

Final Notes

NewsPulse AI showcases how powerful AI agents become when paired with open, real-time, structured web data. We didn’t just build a tool—we built a thinking system that mimics human research patterns at internet speed.

Lovingly crafted by Suman and his wife Sarita. 💫

🙌 Shoutout
Big thanks to the team at Bright Data! Loved integrating your MCP platform.

If you're reading this and found Bright Data useful, give their repo some love:
🌟 https://github.com/luminati-io/brightdata-mcp

Comments 13 total

  • Nevo David
    Nevo DavidMay 25, 2025

    Been cool seeing steady progress - it adds up. What do you think actually keeps things growing over time? Habits? Luck? Just showing up?

    • Suman Kumar
      Suman KumarMay 26, 2025

      I think it's mostly habits and consistency. Even on days when things aren’t perfect, just showing up makes a difference over time.

  • Dotallio
    DotallioMay 26, 2025

    Super impressive to see the real-time insights without any DB in the loop. Really curious, how do you handle scaling when queries spike up?

    • Suman Kumar
      Suman KumarMay 26, 2025

      Absolutely — scaling is definitely something we’re thinking about for the next stage.

      Right now, the system is optimized to meet the hackathon goals: fully real-time, stateless, and DB-free, focusing on live data access and analysis. It performs well under moderate load and showcases the core value of Bright Data’s infrastructure.

      But for production-level traffic or query spikes, we’d definitely need to:

      Implement request queues and concurrent workers

      Add rate-limiting to protect both system and target sites

      Possibly introduce a temporary cache layer (e.g., Redis) for recent results

      And eventually move to autoscaling infrastructure like AWS Fargate or GCP Cloud Run

      So yes, the current setup is hackathon-ready — but scaling and load management are high on the roadmap as we evolve this into a production-grade tool. 🙌

  • Ranjan Dailata
    Ranjan DailataMay 26, 2025

    The Github link in this blog post is broken.

  • Ranjan Dailata
    Ranjan DailataMay 26, 2025

    Suggestion - After the news analysis gets completed, it would be great to programmatically scroll to the "Article Analysis" section.

    • Suman Kumar
      Suman KumarMay 26, 2025

      Added the scroll brother, thanks

  • Shweta Kale
    Shweta KaleMay 26, 2025

    Loved the idea!!

    I had a question though – I noticed you used the API https://api.brightdata.com instead of @brightdata/mcp How does that work? Does @brightdata/mcp use the API under the hood, or are they two separate things? In the documentation, the only method I saw was using @brightdata/mcp.

    • Suman Kumar
      Suman KumarMay 26, 2025

      Heyy thanks! Glad you liked it 😄

      Yes, the @brightdata/mcp use the same api.

      So actually, I just took Bright Data’s FastMCP server code and plugged it into my Express backend directly. Basically doing the same thing as @brightdata/mcp, just manually. I'm still using api.brightdata.com under the hood, but with a bit more control on how things run.

  • Jin Park
    Jin ParkMay 27, 2025

    Very interesting and impressive project!
    But what makes it different to say engines like ground.news/ ?

    • Suman Kumar
      Suman KumarMay 28, 2025

      Thanks for the great question! 🙌

      Ground News offers static, outlet-level bias insights.
      NewsPulse AI gives dynamic, live, article-level intelligence.

      Right now, we use OpenAI for content analysis, but we’re already planning to train our own models tailored for news sentiment, propaganda detection, and political bias — optimized for real-time media monitoring and transparency.

      • Jin Park
        Jin ParkMay 28, 2025

        That's really cool!
        I am definitely going to keep an eye out for your continued development! :)

Add comment