web-crawler

A simple web crawler using Python that stores the metadata and main content of each web page in a database.

Purpose and Functionality

The web crawler is designed to crawl web pages starting from a base URL, extract metadata such as title, description, image, locale, type, and main content, and store this information in a MongoDB database. The crawler can handle multiple levels of depth and respects the robots.txt rules of the websites it visits.

Dependencies

The project requires the following dependencies:

requests
beautifulsoup4
pymongo

You can install the dependencies using the following command:

pip install -r requirements.txt

Setting Up and Running the Web Crawler

Clone the repository:

git clone https://github.com/schBenedikt/web-crawler.git
cd web-crawler

Install the dependencies:

pip install -r requirements.txt

Ensure that MongoDB is running on your local machine. The web crawler connects to MongoDB at localhost:27017 and uses a database named search_engine.
Run the web crawler:

python

…

search-engine

The matching search engine to my web crawler.

The Docker image is currently not working https://hub.docker.com/r/schbenedikt/search

Features

Display of the search speed.
Ask AI for help.
Uses MongoDB for database operations.

Docker Instructions

Building the Docker Image

To build the Docker image, run the following command in the root directory of the repository:

docker build -t ghcr.io/schbenedikt/search-engine:latest .

Running the Docker Container

To run the Docker container, use the following command:

docker run -p 5560:5560 ghcr.io/schbenedikt/search-engine:latest

This will start the Flask application using Gunicorn as the WSGI server, and it will be accessible at http://localhost:5560.

Pulling the Docker Image

The Docker image is publicly accessible. To pull the Docker image from GitHub Container Registry, use the following command:

docker pull ghcr.io/schbenedikt/search-engine:latest

Note

Ensure that the tags field in the GitHub Actions workflow is correctly set to ghcr.io/schbenedikt/search-engine:latest to avoid multiple packages.

Running with Docker Compose

To run…

techtech @techtech

I created my own search engine

NEW POST: