This is a submission for the Bright Data AI Web Access Hackathon
Problem Statement
Large language models (LLMs) like GPT-4 and Gemini are incredibly capable; they often face a major limitation: Lack of access to fresh, structured, and reliable real-world data.
What I Built
Introducing the State-of-the-Art Bright Data MCP Agent - A next-generation web data intelligence system powered by Bright Data’s Model Context Protocol (MCP) and BrightDataQL (Bright Data Query Language), a schema-first, declarative query interface for extracting structured information.
The agent doesn't just "browse" or "scrape" the web. It discovers, navigates, interacts, and extracts high-value information from complex, protected, or JavaScript-heavy web environments just like a human would and transforms it into structured, machine-consumable output via a natural language interface.
New users of Bright Data, please make sure to sign-up here - Bright Data
Technology Stack
- Bright Data MCP (Model Context Protocol)
- Python + asyncio
- LangChain
- Pydantic-ai
- Logfire (Observability Platform)
- LLM Agent (Google Gemini via LangChain)
- MCP Agent Client (mcp_use)
- Streamlit
Real-world problem
Most AI systems struggle to perform well in production environments because they rely on:
- Outdated or incomplete information
- Predefined APIs with limited access
- Unstructured or noisy web data
- Static models disconnected from real-time change
The web is the mostly a dynamic thing with the vast source of information getting updated every day, the AI’s full potential is hindered unless it can access, understand, and use live, reliable data effectively.
Real-World Application Domains
This solution is already transforming industries like:
Healthcare – Extracting clinical trial data, provider info, and insurance rules
Finance – Aggregating loan offers, exchange rates, and investment opportunities
Real Estate – Collecting property listings, price histories, and agent info
Job Search – Mining remote job boards for fresh opportunities
Retail – Tracking competitive pricing and product availability in real time
Source Code
Please visit BrightDataQL-Agent to download the open-source code.
Demo
Real-estate use-case
Amazon Scrapping Browser Demo
Behind the scenes
Thought: I need to execute the steps in the provided BEGIN...END block. First, I will navigate to the Amazon website. Then, I will type "wireless earbuds" into the search bar and submit the search. After waiting for the search results to load, I will click on the first search result. Finally, I will extract the title, price, and description of the product and return them in a JSON format.
Action: scraping_browser_navigate
Action Input: {"url": "https://www.amazon.com"}
Observation: {'status': 'OK'}
Thought: Now I need to type "wireless earbuds" into the search bar.
Action: scraping_browser_type
Action Input: {"selector": "input[name='field-keywords']", "text": "wireless earbuds"}
Observation: {'status': 'OK'}
Thought: Now I need to click the submit button to start the search.
Action: scraping_browser_click
Action Input: {"selector": "input[type='submit']"}
Observation: {'status': 'OK'}
Thought: Now I need to wait for the search results to load.
Action: scraping_browser_wait_for
Action Input: {"selector": "div[data-component-type='s-search-result']"}
Observation: {'status': 'OK'}
Thought: Now I need to click on the first search result.
Action: scraping_browser_click
Action Input: {"selector": "div[data-component-type='s-search-result'] a.a-link-normal"}
Observation: {'status': 'OK'}
Thought: Now I need to wait for the product page to load.
Action: scraping_browser_wait_for
Action Input: {"selector": "#productTitle"}
Observation: {'status': 'OK'}
Thought: Now I need to get the product title.
Action: scraping_browser_get_text
Action Input: {"selector": "#productTitle"}
Observation: {'status': 'OK', 'text': 'TOZO T6 True Wireless Earbuds Bluetooth 5.3 Headphones Wireless Earphones IPX8 Waterproof Stereo Sound
Headsets with Wireless Charging Case Black'}
Thought: Now I need to get the product price.
Action: scraping_browser_get_text
Action Input: {"selector": ".a-price .a-offscreen"}
Observation: {'status': 'OK', 'text': '$25.49'}
Thought: Now I need to get the product description.
Action: scraping_browser_get_text
Action Input: {"selector": "#feature-bullets"}
Observation: {'status': 'OK', 'text': '[\n "\\ud83d\\udca8[Hi-Fi Stereo Sound Quality]: TOZO T6 Offers a truly authentic sound and powerful bass performance with 10 mm large size speaker driver - the drive area is 1.77 times than the normal drive area.",\n "\\ud83d\\udca8[Bluetooth 5.3 Technology]: Adopt the advanced Bluetooth 5.3 technology. TOZO T6 Support HSP, HFP, A2DP, AVRCP, which greatly improve the transmission speed and providing you with a low-latency listening experience.",\n "\\ud83d\\udca8[One step pairing]: Pick up 2 Headsets from Charging box They will connect each other automatically, then only one step easily enter mobile phone Bluetooth setting to pair the earbuds.",\n "\\ud83d\\udca8[IPX8 Waterproof]: Earbuds and charging case inner Nano-coating makes it possible to waterproof for 1 meters deep for 30 minutes. Waterproof and sweatproof latest technology can effectively prevent from the damage of the splash by sweat and raindrops.",\n "\\ud83d\\udca8[Charge on the go]: TOZO T6 wireless earbuds can last for over 6 hours’ playtime from a single charge and 30 extra hours in the compact charging case. Charging case support wireless charging and Type-C charging. Providing convenient charging way with no strings attached."\n]'}
Thought: I have successfully extracted the title, price, and description of the product. Now I need to return them in a JSON format.
Final Answer: {"title": "TOZO T6 True Wireless Earbuds Bluetooth 5.3 Headphones Wireless Earphones IPX8 Waterproof Stereo Sound Headsets with
Wireless Charging Case Black", "price": "$25.49", "description": "[\n \"\\ud83d\\udca8[Hi-Fi Stereo Sound Quality]: TOZO T6 Offers a truly
authentic sound and powerful bass performance with 10 mm large size speaker driver - the drive area is 1.77 times than the normal drive area.\",\n \"\\ud83d\\udca8[Bluetooth 5.3 Technology]: Adopt the advanced Bluetooth 5.3 technology. TOZO T6 Support HSP, HFP, A2DP, AVRCP, which
greatly improve the transmission speed and providing you with a low-latency listening experience.\",\n \"\\ud83d\\udca8[One step pairing]:
Pick up 2 Headsets from Charging box They will connect each other automatically, then only one step easily enter mobile phone Bluetooth setting to pair the earbuds.\",\n \"\\ud83d\\udca8[IPX8 Waterproof]: Earbuds and charging case inner Nano-coating makes it possible to waterproof for 1 meters deep for 30 minutes. Waterproof and sweatproof latest technology can effectively prevent from the damage of the splash by sweat
and raindrops.\",\n \"\\ud83d\\udca8[Charge on the go]: TOZO T6 wireless earbuds can last for over 6 hours’ playtime from a single charge and 30 extra hours in the compact charging case. Charging case support wireless charging and Type-C charging. Providing convenient charging way with no strings attached.\"\n]"}
How I Used Bright Data's Infrastructure
Discover – Find relevant content across the open web
How it works: BrightDataQL prompts allow you to describe your discovery intent naturally (e.g., “Search Amazon for the best wireless earbuds”).
The underlying LLM interprets this and selects tools like
- search_engine
- scrape_as_html
- scrape_as_markdown
- scraping_browser_navigate
- scraping_browser_click
- scraping_browser_type
- scraping_browser_wait_for
- scraping_browser_get_text
- scraping_browser_get_html
- session_stats
Bright Data's residential or datacenter IPs ensure the requests appear as real users, bypassing rate limits or detection.
Access – Navigate even the most complex or protected websites
How it works: Bright Data offers proxy rotation, IP geolocation, and session management, making it ideal for sites that deploy:
- Anti-bot JavaScript
- Captchas
- Behavioral analysis
MCP agent infrastructure with scrapping browser and server tools ensures access is retained across multiple steps.
Extract – Pull structured, real-time data at scale
How it works: By utilizing the Bright Data's scrapping browser, search and Web Unlocker tools, it's easy for one to accomplish the data extraction needs. However, with the BrightDataQL allows one to explain the aspects clearly in an "English" like language which allows the MCP Agent to make a query and extract the relevant information with ease. Also, by leveraging the LLMs in-coordination with the BrightDataQL schema, the structured data extraction is made possible in real-time.
Interact – Engage with dynamic, JavaScript-rendered pages as a human would
- Bright Data uses headless browsers that execute JavaScript just like Chrome or Firefox.
- Tools like scraping_browser_type, scraping_browser_click, and scraping_browser_wait_for simulate real human interactions which are crucial for modern, SPA-based websites.
Examples
Here are few real-world examples demonstrating the capabilities of BrightDataQL - An extremely user friendly, easy to work with query language designed to operate with the web data extraction.
Simplistic examples that deal with the MCP Search and Browser tool usages for the structured data extraction.
from mcp_agent.brightdata_mcp_agent import get_mcp_agent
async def main():
max_steps = int(os.environ["MAX_MCP_AGENT_STEPS"])
agent = get_mcp_agent(max_steps)
query = """
BEGIN
LET amazon = scraping_browser_navigate("https://www.amazon.com")
THEN scraping_browser_type("input[name='field-keywords']", "wireless earbuds")
THEN scraping_browser_click("input[type='submit']")
THEN scraping_browser_wait_for("div[data-component-type='s-search-result']")
THEN scraping_browser_click("div[data-component-type='s-search-result'] a.a-link-normal")
THEN scraping_browser_wait_for("#productTitle")
LET title = scraping_browser_get_text("#productTitle")
LET price = scraping_browser_get_text(".a-price .a-offscreen")
LET description = scraping_browser_get_text("#feature-bullets")
RETURN {
"title": title,
"price": price,
"description": description
}
END
"""
result = await agent.run(query)
print("Structured Output:\n", result)
if __name__ == "__main__":
asyncio.run(main())
Here's another, a bit complex web data extraction by leveraging the Bright Data's MCP Browser tools to integrate with multiple websites, run some logics and then return the structured data.
from mcp_agent.brightdata_mcp_agent import get_mcp_agent
async def main():
max_steps = int(os.environ["MAX_MCP_AGENT_STEPS"])
agent = get_mcp_agent(max_steps)
query = """
BEGIN
# Search on Amazon
scraping_browser_navigate("https://www.amazon.com")
scraping_browser_type("input[name='field-keywords']", "best wireless earbuds")
scraping_browser_click("input[type='submit']")
scraping_browser_wait_for("div[data-component-type='s-search-result']")
scraping_browser_click("div[data-component-type='s-search-result']")
scraping_browser_wait_for("#productTitle")
LET amazon_name = scraping_browser_get_text("#productTitle")
LET amazon_price = scraping_browser_get_text(".a-price .a-offscreen")
LET amazon_rating = scraping_browser_get_text("span[data-asin][data-variation] span.a-icon-alt")
# Search on Walmart
scraping_browser_navigate("https://www.walmart.com")
scraping_browser_type("input[name='query']", amazon_name)
scraping_browser_click("button[type='submit']")
scraping_browser_wait_for("a[data-type='itemTitles']")
scraping_browser_click("a[data-type='itemTitles']")
scraping_browser_wait_for("h1")
LET walmart_name = scraping_browser_get_text("h1")
LET walmart_price = scraping_browser_get_text("span[class*='price']")
LET walmart_rating = scraping_browser_get_text("span[class*='stars']")
# Logic to choose best deal
LET best_store = IF(amazon_price < walmart_price, "Amazon", "Walmart")
LET reason = IF(amazon_price < walmart_price, "Amazon has the lower price", "Walmart has the lower price")
RETURN {
"amazon_product": {
"name": amazon_name,
"price": amazon_price,
"rating": amazon_rating
},
"walmart_product": {
"name": walmart_name,
"price": walmart_price,
"rating": walmart_rating
},
"recommendation": {
"best_store": best_store,
"reason": reason
}
}
END
"""
result = await agent.run(query)
print("==== Structured Response ====")
print(result)
if __name__ == "__main__":
asyncio.run(main())
Healthcare MCP Agent driven use-case which deals with the structured data extraction navigating over the web by utilizing the combination of Bright Data's MCP Server and Browser tools.
from mcp_agent.brightdata_mcp_agent import get_mcp_agent
async def main():
# Load .env if needed
load_dotenv()
max_steps = int(os.environ["MAX_MCP_AGENT_STEPS"])
# Create MCP Agent
agent = get_mcp_agent(max_steps)
url = "https://www.zocdoc.com/primary-care-doctors"
schema = """
{
providers[] {
name,
specialty,
rating,
review_count,
accepted_insurance[],
appointments[] {
date,
time,
mode,
location,
availability_status
}
}
}
"""
prompt = f"""
You are a healthcare data extractor bot.
Your task is to extract structured provider information and real-time appointment availability from this page:
{url}
Use the schema below to return information on primary care providers available in New York City.
Simulate required interactions like location filters, next-week calendar navigation, and selecting virtual visits.
Schema:
{schema}
Output in JSON format only.
"""
# Run query
result = await agent.run(
prompt,
max_steps=max_steps,
)
print(f"\nResult: {result}")
Performance Improvements
BrightDataQL, combined with MCP, significantly enhances AI performance by enabling agents to operate on live, structured, and domain-specific web data - improving accuracy, grounding responses in real-time evidence, reducing hallucination, and allowing dynamic interaction with complex web pages, which leads to faster, more relevant, and adaptive insights.
The following section explains the AI performance improvements in collaboration with the BrightDataQL and the Bright Data's MCP Server.
Area | Without BrightDataQL | BrightDataQL + MCP |
---|---|---|
Accuracy | Based on stale or summarized data | Based on up-to-date, live structured data |
Relevance | Generalized, generic responses | Tailored insights from domain-specific sources |
Grounding | High hallucination risk | Data grounded in real-time web evidence |
Comprehension | Struggles with web noise (ads, JS, etc.) | Interacts as a user + uses schema-guided parsing |
Speed-to-Insight | Manual web scraping or API calls | Autonomous, real-time data extraction |
Adaptability | Static training data | Dynamic context from any URL or topic |
Benefits of BrightDataQL in AI-Based Web Scraping
Improved Structure & Clarity
- Schema-first approach: BrightDataQL defines what to extract clearly before any execution, improving precision.
- Structured output: Ensures data is returned in a consistent, machine-readable format (like JSON).
More Intelligent Extraction
- Semantic alignment: Agents can match schema fields with page content using NLP, not just CSS/XPath.
- Handles noise gracefully: Better at ignoring irrelevant page elements like ads, cookie banners, and modals.
Dynamic Adaptability
- Schema reusability: Templates can be reused or extended easily for multiple websites or domains.
- Resilience to HTML changes: BrightDataQL decouples logic from page layout, so minor DOM shifts don’t break extraction.
Faster Development & Debugging
- Low-code/no-code UX: Developers specify desired data instead of writing brittle scraping logic.
- Easy to debug: Schema mismatches or missing fields are easier to spot and correct.
Cross-Site Generalization
- Domain-agnostic design: One BrightDataQL schema can work across multiple similar sites (e.g., Zillow, Redfin).
- Fewer custom scrapers: Reduces need for site-specific parsers or scrapers.
Compliance & Ethics
- Intent-aware extraction: Focuses only on relevant and permitted data.
- Audit-friendly: BrightDataQL templates can be reviewed or version-controlled for transparency.
Better Integration with AI Models
- Natural prompt pairing: Schemas become plug-and-play with LLM instructions, improving comprehension and output.
- Real-time grounding: Data returned is grounded in what was actually found, reducing hallucinations.
Content Credits - This blog-post contents were formatted with ChatGPT to make it more professional and produce a polished content for the targeted audience.
Cover Image Credit - Updated cover image with the BrightDataQL and MCP Agent based on brightdata-mcp