🧠 NewsPulse AI – Real-Time News Analysis with LLMs & Web Scraping via Bright Data

This is a submission for the Bright Data AI Web Access Hackathon
We built NewsPulse AI to explore a simple but powerful question:

"What if you could instantly see how different media outlets spin the same story?"

In a world flooded with headlines, bias, and misinformation, NewsPulse AI acts like an AI-powered research assistant. You ask a question, and it fetches, scrapes, analyzes, and visualizes fresh news articles in real-time—just like a human researcher, but supercharged.

🚀 What It Does

🔍 Enter any news-related query (e.g. farmers protest, AI in education, etc.)
🌐 NewsPulse fetches real-time articles via Bright Data’s MCP scraping infrastructure
🧠 LangChain & GPT-3.5 process each article for:
Sentiment (positive/neutral/negative)
Bias
Political lean
Toxicity and propaganda presence
📊 Get aggregate insights and transparent logs instantly

What I Built

NewsPulse AI is an intelligent real-time news analysis engine that allows users to query any news-related topic and instantly receive a stream of analyzed articles. It mimics how a human researcher might discover, navigate, extract, and interact with web content—but does it entirely autonomously.

The project solves the problem of understanding media bias, misinformation, and sentiment across different news sources in real-time. Whether a user wants to explore how different platforms cover a political event, assess the emotional tone of news about a public figure, or analyze the presence of propaganda or toxicity in media, NewsPulse AI provides deep insights in seconds.

🔧 Deep Integration with FastMCP via STDIO

🔥 One of the key differentiators of our solution is that we have directly integrated Bright Data’s official MCP source code with our Node.js backend using FastMCP over standard input/output (STDIO).

We are:

Running the FastMCP server inside our Express (Node.js) environment
Communicating with it through STDIO (stdin/stdout)
Launching and managing scraping methods programmatically from Node.js

This level of integration is not just plug-and-play—it required fine-tuning STDIO communication and handling input/output streams carefully. But it makes our backend much more flexible and efficient, enabling real-time task execution without relying on HTTP or RPC overhead.

It also aligns perfectly with Bright Data’s vision for real-time AI agents interacting with the open web, giving us full control over tool orchestration, logging, and performance.

🛠️ Tech Stack & Architecture

Our project is built entirely on modern, open-source tools optimized for real-time web data extraction and analysis:

Frontend: React + Vite
Backend: Node.js + Express
AI Orchestration: LangChain + OpenAI GPT-3.5 Turbo
Scraping Infrastructure: Bright Data MCP (FastMCP Server)
Communication: REST APIs + WebSocket (for real-time logs & results)
Deployment: Hosted on an EC2 instance with persistent STDIO-based communication

⚡ No Database Used:
We do not persist any data. Everything — from scraping to analysis — happens live from the open web. This ensures:

Real-time results for every query
No stale or outdated information
Transparent validation that scraping works on-demand
Compliance with the hackathon’s goal of showcasing live data access

This stateless, no-DB approach proves the reliability of Bright Data’s infrastructure in powering real-time AI agents without any dependency on pre-stored content.

Demo

Video Demo

📁 GitHub Repo: https://github.com/sumankalia/news-pulse-ai

🌐 Live Site:
Frontend: http://ec2-16-170-239-65.eu-north-1.compute.amazonaws.com:5173/
Backend: http://ec2-16-170-239-65.eu-north-1.compute.amazonaws.com:4002/api/articles/ping

📸 Screenshots:

🔍 Search Input Interface – Users can enter any news-related query, such as “farmers protest” or “AI in education,” to fetch real-time news articles and analyze them instantly.

⚙️ Real-Time Query Processing – The system uses Bright Data’s MCP server to fetch and analyze fresh articles from Indian news sources. Logs show live scraping and analysis updates for transparency and debugging.

📰 Scraped Article Snapshot – A detailed preview of an individual article showing title, source, timestamp, and extracted summary. Each result is processed for bias, sentiment, and lean.

📊 Sentiment, Bias, and Political Lean Breakdown – Each article undergoes NLP-based analysis to categorize tone (positive/neutral/negative), detect media bias, and predict political inclination.

📈 Aggregate Insights Dashboard – Provides a summary of all fetched articles, highlighting sentiment distribution, bias frequency, and lean trends to help users quickly assess media coverage patterns.

How I Used Bright Data's Infrastructure

This project is powered by Bright Data’s MCP infrastructure, particularly the FastMCP server, which enables our AI agent to simulate human browsing behavior and extract structured information in real time. Here’s how the four key actions were implemented:

1. Discover

We use LangChain with a custom PromptTemplate to dynamically route user queries to one of four scraping methods:

scrape_as_article: Direct article URLs
scrape_a_homepage: Homepage or latest headline queries
search_via_google: Informational and broad-topic queries
search_via_bing: Dynamic, JS-heavy search result pages

switch (result.method) {
      case "scrape_as_article":
        data = await runToolCall("scrape_as_article", { url: query });
        break;
      case "scrape_a_homepage":
        data = await runToolCall("scrape_a_homepage", {
          url: result?.homepageUrl,
        });
        break;
      case "search_via_google":
        data = await runToolCall("search_via_google", { query });
        break;
      case "search_via_bing":
        processedResults = await searchViaBing({
          query,
          userId,
        });
    }

LangChain decides the optimal route based on user intent, and our backend follows through using the selected method.

2. Access

We leverage Bright Data’s FastMCP to access dynamic and protected web pages like news homepages or search results.

Here’s what happens:

First, we load the target page (e.g., a news homepage or Bing/Google search results).
We then scrape all the article links visible on that page.
Next, we go article-by-article, scraping each one individually for full content and metadata.
This method helps us analyze multiple perspectives from a single page, all without hitting stale data or relying on pre-saved content.

This smart multi-link scraping approach powers our real-time insights across 50+ articles per query.

// Step 1: Navigate to the target page (homepage or search results)
await runToolCall("scraping_browser_navigate", {
  url: "https://www.newswebsite.com/",
});

// Step 2: Wait for articles to load on the page
await runToolCall("scraping_browser_wait_for", {
  selector: "article a[href]",
  timeout: 10000,
});

// Step 3: Extract all article links
const linksResult = await runToolCall("scraping_browser_links", {});

// Step 4: Iterate through each link and trigger detailed article scraping
for (const link of linksResult.links) {
  await runToolCall("scrape_as_article", { url: link });
}

3. Extract

We extract detailed article metadata, including:
• title, content, url, published date, author, source, and image

Bright Data supports extraction via:

✅ rawHtml – full HTML of the scraped content
✅ markdown – clean, AI-friendly summary format

server.addTool({
  name: "scrape_a_homepage",
  description:
    "Scrape a single webpage URL with advanced options for " +
    "content extraction and get back the results in MarkDown language. " +
    "This tool can unlock any webpage even if it uses bot detection or " +
    "CAPTCHA.",
  parameters: z.object({ url: z.string().url() }),
  execute: tool_fn("scrape_a_homepage", async ({ url }) => {
    let response = await axios({
      url: "https://api.brightdata.com/request",
      method: "POST",
      data: {
        url,
        zone: unlocker_zone,
        format: "raw",
        data_format: "markdown",
      },
      headers: api_headers(),
      responseType: "text",
    });
    return response.data;
  }),
});

We utilized a combination of custom and prebuilt functions to clean the raw data and extract the necessary information for analysis.

4. Interact

This is where Bright Data’s full capabilities come to life. In the search_via_bing method, we simulate a full browser interaction flow using MCP tools:

Navigate to Bing.com
Wait for the page to load
Clear the input field
Enter the search text
Press Enter
Wait ~5 seconds
Wait for a result-related HTML selector to appear
Scrape result links using scraping_browser_links

This closely mimics human behavior, allowing us to pull data from otherwise inaccessible or JS-heavy websites.

const url = "https://www.bing.com/news";
  const searchSelector = "input#sb_form_q";
  const searchText = query;
  const processedResults = [];


    // Navigate to the webpage using scraping_browser_navigate
    await runToolCall("scraping_browser_navigate", {
      url,
    });

    await runToolCall("scraping_browser_wait_for", {
      selector: searchSelector,
      timeout: 10000,
    });

    // Clear the search field first using scraping_browser_type
    await runToolCall("scraping_browser_type", {
      selector: searchSelector,
      text: "",
      submit: false,
    });
    // Type the search text using scraping_browser_type
    await runToolCall("scraping_browser_type", {
      selector: searchSelector,
      text: searchText,
      submit: false,
    });
    // Press Enter to submit the search
    await runToolCall("scraping_browser_press", {
      key: "Enter",
    });
    // Wait for the search results to load
    // First wait for network to be idle
    await new Promise((resolve) => setTimeout(resolve, 3000));

    // Wait for search results container
    await runToolCall("scraping_browser_wait_for", {
      selector: 'a.linkBtn[aria-label="Best match"]',
      timeout: 50000,
    });
    // Additional wait for search results to be fully loaded
    await new Promise((resolve) => setTimeout(resolve, 2000));
    // Wait for article links to be present
    await runToolCall("scraping_browser_wait_for", {
      selector: "article a[href*='/articles/']",
      timeout: 5000,
    });
    // Get all links from the page using scraping_browser_links
    const linksResult = await runToolCall("scraping_browser_links", {});

Performance Improvements

Before integrating Bright Data’s MCP server, our initial architecture relied on:

Sequential proxy rotation using Bright Data’s Residential, Web Unlocker, and Mobile proxies
Headless browser automation via Puppeteer
Custom code for region-based rotation and JavaScript-rendered scraping

While functional, this method had significant drawbacks:

High latency per article (6–12 seconds)
Increased code complexity and maintenance overhead
Failures on JS-heavy or protected pages

🚀 Transition to Bright Data MCP (FastMCP Server)

By switching to the Fast MCP server, we achieved:

⏱ ~80% reduction in scraping latency per article
💡 Seamless access to protected and JavaScript-heavy sites with zero-code browser interaction
⚙️ Lightweight, declarative scraping powered by STDIO communication between our Node.js backend and MCP server
📦 Simplified architecture (less boilerplate, no Puppeteer or manual proxy handling)
✅ Greater reliability across countries and site types — with support for region-specific scraping

We are now scraping 50+ articles across multiple countries in real time without bottlenecks or rate-limiting issues.

You can even compare our old Puppeteer-based proxy project here:
👉 Legacy Puppeteer Proxy Scraper https://inspiring-taffy-5808f5.netlify.app/
And see how Bright Data MCP gave us a 10x better development and performance experience.

Future Improvements

We built NewsPulse AI to meet hackathon goals with a real-time, stateless architecture. For the next phase, we plan to:

Integrate a Vector DB (like Pinecone or Qdrant) to enable semantic search and avoid redundant scraping.
Add a Scalable Job Queue for handling spikes using tools like BullMQ or Redis.
Implement Auth & API Keys to support user-specific usage and rate-limiting.
Secure Secrets Properly, moving all credentials to secret managers for a real deployment.
Improve UX with features like sentiment trend visualizations, historical comparisons, and saved analyses.

These updates will make the platform more robust, scalable, and production-ready.

Final Notes

NewsPulse AI showcases how powerful AI agents become when paired with open, real-time, structured web data. We didn’t just build a tool—we built a thinking system that mimics human research patterns at internet speed.

Lovingly crafted by Suman and his wife Sarita. 💫

🙌 Shoutout
Big thanks to the team at Bright Data! Loved integrating your MCP platform.

If you're reading this and found Bright Data useful, give their repo some love:
🌟 https://github.com/luminati-io/brightdata-mcp

Suman Kumar @sumankalia