State of the Art BrightDataQL with the MCP Agent
Ranjan Dailata

Ranjan Dailata @ranjancse

About: A Professional based out of India specialized in handling AI-powered automations. Contact me at ranjancse@gmail.com, LinkedIn - https://www.linkedin.com/in/ranjan-dailata/

Joined:
Nov 16, 2023

State of the Art BrightDataQL with the MCP Agent

Publish Date: May 22
12 0

This is a submission for the Bright Data AI Web Access Hackathon

Problem Statement

Large language models (LLMs) like GPT-4 and Gemini are incredibly capable; they often face a major limitation: Lack of access to fresh, structured, and reliable real-world data.

What I Built

BrightDataQL-Agent

Introducing the State-of-the-Art Bright Data MCP Agent - A next-generation web data intelligence system powered by Bright Data’s Model Context Protocol (MCP) and BrightDataQL (Bright Data Query Language), a schema-first, declarative query interface for extracting structured information.

The agent doesn't just "browse" or "scrape" the web. It discovers, navigates, interacts, and extracts high-value information from complex, protected, or JavaScript-heavy web environments just like a human would and transforms it into structured, machine-consumable output via a natural language interface.

New users of Bright Data, please make sure to sign-up here - Bright Data

Technology Stack

  • Bright Data MCP (Model Context Protocol)
  • Python + asyncio
  • LangChain
  • Pydantic-ai
  • Logfire (Observability Platform)
  • LLM Agent (Google Gemini via LangChain)
  • MCP Agent Client (mcp_use)
  • Streamlit

Real-world problem

Most AI systems struggle to perform well in production environments because they rely on:

  • Outdated or incomplete information
  • Predefined APIs with limited access
  • Unstructured or noisy web data
  • Static models disconnected from real-time change

The web is the mostly a dynamic thing with the vast source of information getting updated every day, the AI’s full potential is hindered unless it can access, understand, and use live, reliable data effectively.

Real-World Application Domains

This solution is already transforming industries like:

  • Healthcare – Extracting clinical trial data, provider info, and insurance rules

  • Finance – Aggregating loan offers, exchange rates, and investment opportunities

  • Real Estate – Collecting property listings, price histories, and agent info

  • Job Search – Mining remote job boards for fresh opportunities

  • Retail – Tracking competitive pricing and product availability in real time

Source Code

Please visit BrightDataQL-Agent to download the open-source code.

Demo

Real-estate use-case

Real-estate-demo

Amazon Scrapping Browser Demo

Amazon-Scrapping-Browser

Behind the scenes

 Thought: I need to execute the steps in the provided BEGIN...END block. First, I will navigate to the Amazon website. Then, I will type "wireless earbuds" into the search bar and submit the search. After waiting for the search results to load, I will click on the first search result. Finally, I will extract the title, price, and description of the product and return them in a JSON format.

Action: scraping_browser_navigate
Action Input: {"url": "https://www.amazon.com"}
Observation: {'status': 'OK'}
Thought: Now I need to type "wireless earbuds" into the search bar.
Action: scraping_browser_type
Action Input: {"selector": "input[name='field-keywords']", "text": "wireless earbuds"}
Observation: {'status': 'OK'}
Thought: Now I need to click the submit button to start the search.
Action: scraping_browser_click
Action Input: {"selector": "input[type='submit']"}
Observation: {'status': 'OK'}
Thought: Now I need to wait for the search results to load.
Action: scraping_browser_wait_for
Action Input: {"selector": "div[data-component-type='s-search-result']"}
Observation: {'status': 'OK'}
Thought: Now I need to click on the first search result.
Action: scraping_browser_click
Action Input: {"selector": "div[data-component-type='s-search-result'] a.a-link-normal"}
Observation: {'status': 'OK'}
Thought: Now I need to wait for the product page to load.
Action: scraping_browser_wait_for
Action Input: {"selector": "#productTitle"}
Observation: {'status': 'OK'}
Thought: Now I need to get the product title.
Action: scraping_browser_get_text
Action Input: {"selector": "#productTitle"}
Observation: {'status': 'OK', 'text': 'TOZO T6 True Wireless Earbuds Bluetooth 5.3 Headphones Wireless Earphones IPX8 Waterproof Stereo Sound 
Headsets with Wireless Charging Case Black'}
Thought: Now I need to get the product price.
Action: scraping_browser_get_text
Action Input: {"selector": ".a-price .a-offscreen"}
Observation: {'status': 'OK', 'text': '$25.49'}
Thought: Now I need to get the product description.
Action: scraping_browser_get_text
Action Input: {"selector": "#feature-bullets"}
Observation: {'status': 'OK', 'text': '[\n    "\\ud83d\\udca8[Hi-Fi Stereo Sound Quality]: TOZO T6 Offers a truly authentic sound and powerful bass performance with 10 mm large size speaker driver - the drive area is 1.77 times than the normal drive area.",\n    "\\ud83d\\udca8[Bluetooth 5.3 Technology]: Adopt the advanced Bluetooth 5.3 technology. TOZO T6 Support HSP, HFP, A2DP, AVRCP, which greatly improve the transmission speed and providing you with a low-latency listening experience.",\n    "\\ud83d\\udca8[One step pairing]: Pick up 2 Headsets from Charging box They will connect each other automatically, then only one step easily enter mobile phone Bluetooth setting to pair the earbuds.",\n    "\\ud83d\\udca8[IPX8 Waterproof]: Earbuds and charging case inner Nano-coating makes it possible to waterproof for 1 meters deep for 30 minutes. Waterproof and sweatproof latest technology can effectively prevent from the damage of the splash by sweat and raindrops.",\n    "\\ud83d\\udca8[Charge on the go]: TOZO T6 wireless earbuds can last for over 6 hours’ playtime from a single charge and 30 extra hours in the compact charging case. Charging case support wireless charging and Type-C charging. Providing convenient charging way with no strings attached."\n]'}    
Thought: I have successfully extracted the title, price, and description of the product. Now I need to return them in a JSON format.
Final Answer: {"title": "TOZO T6 True Wireless Earbuds Bluetooth 5.3 Headphones Wireless Earphones IPX8 Waterproof Stereo Sound Headsets with 
Wireless Charging Case Black", "price": "$25.49", "description": "[\n    \"\\ud83d\\udca8[Hi-Fi Stereo Sound Quality]: TOZO T6 Offers a truly 
authentic sound and powerful bass performance with 10 mm large size speaker driver - the drive area is 1.77 times than the normal drive area.\",\n    \"\\ud83d\\udca8[Bluetooth 5.3 Technology]: Adopt the advanced Bluetooth 5.3 technology. TOZO T6 Support HSP, HFP, A2DP, AVRCP, which 
greatly improve the transmission speed and providing you with a low-latency listening experience.\",\n    \"\\ud83d\\udca8[One step pairing]: 
Pick up 2 Headsets from Charging box They will connect each other automatically, then only one step easily enter mobile phone Bluetooth setting to pair the earbuds.\",\n    \"\\ud83d\\udca8[IPX8 Waterproof]: Earbuds and charging case inner Nano-coating makes it possible to waterproof for 1 meters deep for 30 minutes. Waterproof and sweatproof latest technology can effectively prevent from the damage of the splash by sweat 
and raindrops.\",\n    \"\\ud83d\\udca8[Charge on the go]: TOZO T6 wireless earbuds can last for over 6 hours’ playtime from a single charge and 30 extra hours in the compact charging case. Charging case support wireless charging and Type-C charging. Providing convenient charging way with no strings attached.\"\n]"}
Enter fullscreen mode Exit fullscreen mode

How I Used Bright Data's Infrastructure

Discover – Find relevant content across the open web
How it works: BrightDataQL prompts allow you to describe your discovery intent naturally (e.g., “Search Amazon for the best wireless earbuds”).

The underlying LLM interprets this and selects tools like

  • search_engine
  • scrape_as_html
  • scrape_as_markdown
  • scraping_browser_navigate
  • scraping_browser_click
  • scraping_browser_type
  • scraping_browser_wait_for
  • scraping_browser_get_text
  • scraping_browser_get_html
  • session_stats

Bright Data's residential or datacenter IPs ensure the requests appear as real users, bypassing rate limits or detection.

Access – Navigate even the most complex or protected websites

How it works: Bright Data offers proxy rotation, IP geolocation, and session management, making it ideal for sites that deploy:

  • Anti-bot JavaScript
  • Captchas
  • Behavioral analysis

MCP agent infrastructure with scrapping browser and server tools ensures access is retained across multiple steps.

Extract – Pull structured, real-time data at scale

How it works: By utilizing the Bright Data's scrapping browser, search and Web Unlocker tools, it's easy for one to accomplish the data extraction needs. However, with the BrightDataQL allows one to explain the aspects clearly in an "English" like language which allows the MCP Agent to make a query and extract the relevant information with ease. Also, by leveraging the LLMs in-coordination with the BrightDataQL schema, the structured data extraction is made possible in real-time.

Interact – Engage with dynamic, JavaScript-rendered pages as a human would

  • Bright Data uses headless browsers that execute JavaScript just like Chrome or Firefox.
  • Tools like scraping_browser_type, scraping_browser_click, and scraping_browser_wait_for simulate real human interactions which are crucial for modern, SPA-based websites.

Examples

Here are few real-world examples demonstrating the capabilities of BrightDataQL - An extremely user friendly, easy to work with query language designed to operate with the web data extraction.

Simplistic examples that deal with the MCP Search and Browser tool usages for the structured data extraction.

from mcp_agent.brightdata_mcp_agent import get_mcp_agent

async def main():
    max_steps = int(os.environ["MAX_MCP_AGENT_STEPS"])
    agent = get_mcp_agent(max_steps)

    query = """
BEGIN
  LET amazon = scraping_browser_navigate("https://www.amazon.com")
  THEN scraping_browser_type("input[name='field-keywords']", "wireless earbuds")
  THEN scraping_browser_click("input[type='submit']")
  THEN scraping_browser_wait_for("div[data-component-type='s-search-result']")
  THEN scraping_browser_click("div[data-component-type='s-search-result'] a.a-link-normal")
  THEN scraping_browser_wait_for("#productTitle")
  LET title = scraping_browser_get_text("#productTitle")
  LET price = scraping_browser_get_text(".a-price .a-offscreen")
  LET description = scraping_browser_get_text("#feature-bullets")

RETURN {
  "title": title,
  "price": price,
  "description": description
}
END
"""

    result = await agent.run(query)
    print("Structured Output:\n", result)

if __name__ == "__main__":
    asyncio.run(main())
Enter fullscreen mode Exit fullscreen mode

Here's another, a bit complex web data extraction by leveraging the Bright Data's MCP Browser tools to integrate with multiple websites, run some logics and then return the structured data.

from mcp_agent.brightdata_mcp_agent import get_mcp_agent

async def main():
    max_steps = int(os.environ["MAX_MCP_AGENT_STEPS"])
    agent = get_mcp_agent(max_steps)

    query = """
BEGIN
  # Search on Amazon
  scraping_browser_navigate("https://www.amazon.com")
  scraping_browser_type("input[name='field-keywords']", "best wireless earbuds")
  scraping_browser_click("input[type='submit']")
  scraping_browser_wait_for("div[data-component-type='s-search-result']")
  scraping_browser_click("div[data-component-type='s-search-result']")
  scraping_browser_wait_for("#productTitle")

  LET amazon_name = scraping_browser_get_text("#productTitle")
  LET amazon_price = scraping_browser_get_text(".a-price .a-offscreen")
  LET amazon_rating = scraping_browser_get_text("span[data-asin][data-variation] span.a-icon-alt")

  # Search on Walmart
  scraping_browser_navigate("https://www.walmart.com")
  scraping_browser_type("input[name='query']", amazon_name)
  scraping_browser_click("button[type='submit']")
  scraping_browser_wait_for("a[data-type='itemTitles']")
  scraping_browser_click("a[data-type='itemTitles']")
  scraping_browser_wait_for("h1")

  LET walmart_name = scraping_browser_get_text("h1")
  LET walmart_price = scraping_browser_get_text("span[class*='price']")
  LET walmart_rating = scraping_browser_get_text("span[class*='stars']")

  # Logic to choose best deal
  LET best_store = IF(amazon_price < walmart_price, "Amazon", "Walmart")
  LET reason = IF(amazon_price < walmart_price, "Amazon has the lower price", "Walmart has the lower price")

RETURN {
  "amazon_product": {
    "name": amazon_name,
    "price": amazon_price,
    "rating": amazon_rating
  },
  "walmart_product": {
    "name": walmart_name,
    "price": walmart_price,
    "rating": walmart_rating
  },
  "recommendation": {
    "best_store": best_store,
    "reason": reason
  }
}
END
"""

    result = await agent.run(query)
    print("==== Structured Response ====")
    print(result)

if __name__ == "__main__":
    asyncio.run(main())
Enter fullscreen mode Exit fullscreen mode

Healthcare MCP Agent driven use-case which deals with the structured data extraction navigating over the web by utilizing the combination of Bright Data's MCP Server and Browser tools.

from mcp_agent.brightdata_mcp_agent import get_mcp_agent

async def main():
    # Load .env if needed
    load_dotenv()

    max_steps = int(os.environ["MAX_MCP_AGENT_STEPS"])

    # Create MCP Agent
    agent = get_mcp_agent(max_steps)

    url = "https://www.zocdoc.com/primary-care-doctors"

    schema = """
    {
        providers[] {
            name,
            specialty,
            rating,
            review_count,
            accepted_insurance[],
            appointments[] {
            date,
            time,
            mode,
            location,
            availability_status
            }
        }
    }
    """

    prompt = f"""
    You are a healthcare data extractor bot.

    Your task is to extract structured provider information and real-time appointment availability from this page:  
    {url}

    Use the schema below to return information on primary care providers available in New York City.  
    Simulate required interactions like location filters, next-week calendar navigation, and selecting virtual visits.

    Schema:
    {schema}

    Output in JSON format only.
    """

    # Run query
    result = await agent.run(
        prompt,
        max_steps=max_steps,
    )
    print(f"\nResult: {result}")
Enter fullscreen mode Exit fullscreen mode

Performance Improvements

BrightDataQL, combined with MCP, significantly enhances AI performance by enabling agents to operate on live, structured, and domain-specific web data - improving accuracy, grounding responses in real-time evidence, reducing hallucination, and allowing dynamic interaction with complex web pages, which leads to faster, more relevant, and adaptive insights.

The following section explains the AI performance improvements in collaboration with the BrightDataQL and the Bright Data's MCP Server.

Area Without BrightDataQL BrightDataQL + MCP
Accuracy Based on stale or summarized data Based on up-to-date, live structured data
Relevance Generalized, generic responses Tailored insights from domain-specific sources
Grounding High hallucination risk Data grounded in real-time web evidence
Comprehension Struggles with web noise (ads, JS, etc.) Interacts as a user + uses schema-guided parsing
Speed-to-Insight Manual web scraping or API calls Autonomous, real-time data extraction
Adaptability Static training data Dynamic context from any URL or topic

Benefits of BrightDataQL in AI-Based Web Scraping

Improved Structure & Clarity

  • Schema-first approach: BrightDataQL defines what to extract clearly before any execution, improving precision.
  • Structured output: Ensures data is returned in a consistent, machine-readable format (like JSON).

More Intelligent Extraction

  • Semantic alignment: Agents can match schema fields with page content using NLP, not just CSS/XPath.
  • Handles noise gracefully: Better at ignoring irrelevant page elements like ads, cookie banners, and modals.

Dynamic Adaptability

  • Schema reusability: Templates can be reused or extended easily for multiple websites or domains.
  • Resilience to HTML changes: BrightDataQL decouples logic from page layout, so minor DOM shifts don’t break extraction.

Faster Development & Debugging

  • Low-code/no-code UX: Developers specify desired data instead of writing brittle scraping logic.
  • Easy to debug: Schema mismatches or missing fields are easier to spot and correct.

Cross-Site Generalization

  • Domain-agnostic design: One BrightDataQL schema can work across multiple similar sites (e.g., Zillow, Redfin).
  • Fewer custom scrapers: Reduces need for site-specific parsers or scrapers.

Compliance & Ethics

  • Intent-aware extraction: Focuses only on relevant and permitted data.
  • Audit-friendly: BrightDataQL templates can be reviewed or version-controlled for transparency.

Better Integration with AI Models

  • Natural prompt pairing: Schemas become plug-and-play with LLM instructions, improving comprehension and output.
  • Real-time grounding: Data returned is grounded in what was actually found, reducing hallucinations.

Content Credits - This blog-post contents were formatted with ChatGPT to make it more professional and produce a polished content for the targeted audience.

Cover Image Credit - Updated cover image with the BrightDataQL and MCP Agent based on brightdata-mcp

Comments 0 total

    Add comment