WebCrawlAI: An AI-Powered Web Scraper Built Using Bright Data

Publish Date: Dec 28 '24

249 32

This is a submission for the Bright Data Web Scraping Challenge: Build a Web Scraper API to Solve Business Problems

What I Built

I created an AI-powered web scraper called WebCrawlAI.

It can scrape any type of data from a given website and return only the information you need.

Key Features:

Scrapes all kinds of data from websites.
Filters and provides only the relevant information based on your requirements.
Easy-to-use API for seamless integration into your projects.

Website:

Visit the live project here: WebCrawlAI

API Endpoint:

[POST]: https://webcrawlai.onrender.com/scrape-and-parse
Payload:

{
    "url": "",
    "parse_description": ""
}

Technologies Used:

Gemini API: For powerful AI capabilities.
Render: To deploy and host the project.
Flask (3.0.0): For building the web API.
BeautifulSoup (4.12.2): For parsing and extracting data from HTML.
Selenium (4.16.0): For automating web browsing and handling dynamic content.
lxml: For fast and efficient XML and HTML parsing.
html5lib: For parsing HTML documents in a web browser-like manner.
python-dotenv (1.0.0): For managing environment variables.
google-generativeai (0.3.1): For integrating AI-powered features into the scraper.

How It Solves a Business Problem

Web scraping is a critical tool for businesses that rely on large amounts of data.

However, scraping interactive or complex websites can be challenging. WebCrawlAI solves this problem by:

Automating the data extraction process.
Handling complex websites, including those with dynamic content or CAPTCHA challenges.
Providing clean and structured data ready for analysis.

Businesses can use this tool for market research, competitor analysis, price monitoring, content aggregation, and more.

It saves time, reduces manual effort, and ensures accurate results.

Demo

Check out the project live: WebCrawlAI
And the code: GitHub

Here’s a preview of how it works:

Input the website URL and a description of the data you want to extract.
The scraper fetches and parses the data, returning only the relevant results.

How I Used Bright Data

To complement the functionality of WebCrawlAI I made use of Bright Data’s scraping browser to open new possibilities.

Below is how Bright Data performed its magic:

Automated Proxy Management: Ensures reliable connections and avoids blocks.
CAPTCHA Solving: Handles CAPTCHA challenges seamlessly.
Fully Hosted Browsers: Runs and scales Selenium scripts without the need for local infrastructure.
Zero Operational Overhead: No need to maintain scraping or browser infrastructure, allowing me to focus on the API's core functionality.

Additional Prompts

My submission qualifies for:

Prompt 1: Scrape Data from Complex, Interactive Websites. WebCrawlAI excels at handling dynamic websites and interactive elements, making it a powerful solution for scraping even the most challenging sites.

Thank you for reviewing my submission!
I hope WebCrawlAI demonstrates the potential of combining AI and web scraping to solve real-world business challenges.

My Other Project

🚀 Excited to share Portify, the easiest way to create stunning portfolios in minutes!

Choose from sleek templates, customize effortlessly, and get a shareable link for your work. Perfect for developers, designers, and creatives.

Teaser Page: https://dub.sh/portify-teaser
GitHub: https://github.com/ArjunCodess/portify
Early Access: https://getportify.vercel.app (Create yours at /create!)

Comments 32 total

Kudzai MurimiDec 28, 2024
Great Job!
- Arjun Vijay PrakashDec 28, 2024
  Thanks a lot!
Anmol BaranwalDec 28, 2024
Nice 🔥
- Arjun Vijay PrakashDec 28, 2024
  Thanks, man!
𝚂𝚊𝚞𝚛𝚊𝚋𝚑 𝚁𝚊𝚒Dec 28, 2024
This is awesome @arjuncodess
- Arjun Vijay PrakashDec 29, 2024
  Thank you so much! Means a lot!
Data with JohnsonDec 28, 2024
AI web scraper idea is awesome. Where's the GitHub?
- Arjun Vijay PrakashDec 29, 2024
  Thank you! Glad you liked the idea!
  
  Oh, yeah, thanks for reminding me - just added it in the article.
  Here is the GitHub - github.com/ArjunCodess/WebCrawlAI
Rohith SinghDec 29, 2024
great tool for web scraping!!
- Arjun Vijay PrakashDec 29, 2024
  Thank you! Appreciate the support!
K Om Senapati Dec 29, 2024
Cool 🧊
- Arjun Vijay PrakashDec 29, 2024
  Thanks, bro! ❄️
Tanuj SharmaDec 29, 2024
Hello brother, It is not able to return any results when I used this against amazon.com, for your info, amazon is one of world hardest website to scrape and similar goes for walmart both of these implements more than 5 anti-bots captcha's in their website
- Arjun Vijay PrakashDec 29, 2024
  I see. But you aren't giving it the unique product URL, I guess. Let me try it out both ways.
  
  As expected.
  
  For the next unique product URL example, I'm using this product: a.co/d/jdUb6sa
  
  And yes, it works!
  - Tanuj SharmaDec 29, 2024
    On single product its works I agree but you are talking about using it in real business needs, in that scenarios crawling is done for more than 100 millions products at once, in that scenario it won't work.
    
    Here you can see try by yourself,
    
    Url : walmart.com/all-departments
    
    Input prompt : give me all categories urls, title, skus till the max level sub categories available, final output would be a huge list of around 16k categories/sub categories going upto 5 nested sub categories.
    - Arjun Vijay PrakashDec 29, 2024
      
      It works, brother.
      
      But yes, this issue lies:
      
      Any suggestions for this? (using gemini pro here)
      - Dustin WashingtonDec 29, 2024
        You basically need to tell Gemini to perform pagination with some notion of "max entries per json string"and specify some delimiter token you can use to find json string boundaries, split the response on your specified delimiter, and attempt to decode each section and merge them until you find something invalid within the Gemini context window.
        
        Gemini can't fully count as it generates but it will be approximately close and allow the API to at least return a valid and useable response even if it's not literally every product on the page
        
        Arjun Vijay PrakashDec 30, 2024
        Thanks for the suggestion!
        I'll look into implementing this method to improve the functionality.
        Appreciate your insight, this is super helpful!
Tanuj SharmaDec 29, 2024
- Arjun Vijay PrakashDec 29, 2024
  Let me try it again:
  
  Walmart Product Link for testing: https://www.walmart.com/ip/Star-Wars-Force-N-Telling-Vader-Star-Wars-Toys-for-Kids-Ages-4-and-Up-Walmart-Exclusive/5254334148?classType=REGULAR&athbdg=L1600&sid=9f74642d-e12d-4e30-970b-914104b1f54b
  
  Response:
  
  First try:
  
  Second try:
  
  So yes, scraping some large websites doesn't really work on the first or even second try, but eventually, it does.
  - Tanuj SharmaDec 29, 2024
    Your are not getting my point still brother, don't take it as offensive, I have been working in large scale web scraping in python mostly, I have pretty much understanding of what companies look's when its comes to large scale web scraping, anyways its great project, appreciate your efforts, I just got to know about this challenge, let me come up with my submission for all three prompts, you can also regress my submission.
    - Arjun Vijay PrakashDec 29, 2024
      Got it, brother, and no offence taken at all!
      Really appreciate you sharing your expertise.
      - Tanuj SharmaDec 29, 2024
        Hello Brother again,
        
        Just Reviewed your project codebase, you have just given the html content to gemini pro model, this ain't going to work in large scale web scraping, AI(No matters which LLM you use paid or free, None of them is capable of performing complex web scraping of its own, scraping is not a straight forward Software Engineering Task, its quite complicated) is not that much capable as of now that it can parse complicated information out of HTML, it can definitely work for such websites that have clear DOM structure, to be exact, those website which have clear names in div classes such as product_description/description, product_title/title, product_price/price, but for large scale web scraping and sophisticated kind of scraping you have to write a generic web scraping code written in core python.
        
        You have to understand that any complex web scraping system have these components :
        
        1.Loading the Web Page ::
        For this, one can use Dynamic Browser such as Selenium, Bright Data Browser and other private Browsers, Playwright or directly using raw http libraries "requests", "urllib3", "httpx" and the holy "curl" cli.
        
        Captcha Bypass (If strict rate limiting in place) :
        
        Possible Workarounds :
        2.1. Use High Availability Proxies such as Bright Data Proxies, I have used them in one of my large scale web scraping projects and they are really good.
        2.2. Bypass captcha's using Specifically Crafted Scripts for common captcha's such as cloud flare, Imperva, Google ReCaptcha(v1, v2, v3, v4), GeeTest Slide Captcha(v1 to v4) and recent Puzzle Pieces Based Captcha's and last those move objects according to static image direction.
        2.3. Solve using Captcha's Solving Paid APIs.
        
        Parse the HTML Source and get the required Data:
        One can parse the html source using standing bs4 library and as well as lxml library for parsing the lxml based web pages, for example parsing robots.txt
        
        This parsing can be done using two methods :
        
        Static Parsing
        
        Dynamic Parsing
        
        For Specific Websites Scrapers, One can use static Parsing Techniques where sections are pre defined which one have to extract with defined classes names, html tag names that may or may not change over time.
        
        For Generic Use Cases Websites,One must use dynamic parsing techniques which implements the "parents-childs-sibbling" relationship based scraping approach.
        
        So, in this scenario, One must have to write Different Sections Specific Generic Code using the dynamic parsing approach, it won't be done using few lines of python code this type of generic API requires several months of dedicated coding over different market Domains..
        
        then only one can develop truly Generic Crawl API.
        
        Scraping Logic for Handling Different Kind of Paginations Such as simple Num Clicks Based Pagination(for e.g., 1-> 2->3->4...Max Page), Infinite Scrolling Pagination Scraping Logic, Load More Button Click Kind Pagination Logic.
        
        ***Note : Large Scale Web Scraping i.e., Real World Web Scraping Scenario's often requires Speedy Execution, if one always going to use the dynamic Browser for every page loading, its going to take weeks to scrape even Millions Pages, so one must know when to use the Dynamic Web browsers & when to use raw HTTP based Browsing Capabilities.
        
        I hope it will help you to see the big picture behind large scale web scraping systems and anybody who will be reading this huge comment will learn real world scraping key aspects.
        
        Any more clarification, if you need, you can let me know.
        
        Here is the snapshot of your code method:
        
        Thanks & Regards
        Tanuj Sharma
        
        Arjun Vijay PrakashDec 30, 2024
        Thank you so much for providing such a detailed and insightful explanation. 🙏
        
        You’re right - my current implementation is quite basic.
        While it works for simple, well-structured websites, I now see how it falls short for more complex or large-scale use cases.
        
        Thanks again for sharing your knowledge!
Sachin BahegavankarDec 30, 2024
Super Man, maybe there are some issues but I know you will fix those. Great brother keep it up 👍
- Arjun Vijay PrakashDec 30, 2024
  Thanks, brother!
  
  You're right—there is an issue right now, and it's because the scraper isn't working at the moment.
  
  I lost all my credits due to the unexpected attraction this post received. I'm working on getting it back up and running soon.
ProCodersDec 30, 2024
Wow! Thank You!
- Arjun Vijay PrakashDec 30, 2024
  You're very welcome!
A. SurinaDec 30, 2024
I tried extraction of my name on Google as well as case information on Spokanecounty.org.

I couldn't get it to work at all.

It would be nice to pull the case info on a dirty lawyer operating in Spokane county. Thoughts ?
- Arjun Vijay PrakashDec 31, 2024
  I don’t think this aligns with the ethical or intended use of my project.
  
  Thanks for the comment. Have a good one.
Quân TrầnJan 2, 2025
Thank you for this useful sharing.
- Arjun Vijay PrakashJan 2, 2025
  Welcome! Glad you liked it!

Add comment