web-crawler
A simple web crawler using Python that stores the metadata and main content of each web page in a database.
Purpose and Functionality
The web crawler is designed to crawl web pages starting from a base URL, extract metadata such as title, description, image, locale, type, and main content, and store this information in a MongoDB database. The crawler can handle multiple levels of depth and respects the robots.txt rules of the websites it visits.
Dependencies
The project requires the following dependencies:
requestsbeautifulsoup4pymongo
You can install the dependencies using the following command:
pip install -r requirements.txt
Setting Up and Running the Web Crawler
- Clone the repository:
git clone https://github.com/schBenedikt/web-crawler.git
cd web-crawler
- Install the dependencies:
pip install -r requirements.txt
-
Ensure that MongoDB is running on your local machine. The web crawler connects to MongoDB at
localhost:27017and uses a database namedsearch_engine. -
Run the web crawler:
python…








How many rows of data have you stored so far?