I'm excited to share some new features and improvements in my custom search engine project. This search engine is designed to work seamlessly with my web crawler, providing efficient and accurate search results. Let's dive into the latest updates!
@aminnairi has already asked why I don't use a NoSQL database. The software and the web crawler now use mongoDB as a noSQL database, which leads to faster search results.
The AI is still from Llama, now llama-3.3-70b.
In the search results, you can right-click to display a preview of the website. In addition, the favicons are only loaded when all search results have been successfully loaded. These are stored locally temporarily so that they do not have to be retrieved again each time.
The databases can now be managed via the settings page. There is now also the option to add multiple databases at the same time. When a search is made, the system checks whether there are websites that are saved multiple times so that the same website is not displayed multiple times.
This brings us to the filter functions:
The meta data is used to retrieve the various website types, which can then be used to filter. However, since there may be a "website" type and a "website" type, these can be combined into a "all websites" type.
A simple web crawler using Python that stores the metadata of each web page in a database.
web-crawler
A simple web crawler using Python that stores the metadata and main content of each web page in a database.
Purpose and Functionality
The web crawler is designed to crawl web pages starting from a base URL, extract metadata such as title, description, image, locale, type, and main content, and store this information in a MongoDB database. The crawler can handle multiple levels of depth and respects the robots.txt rules of the websites it visits.
Dependencies
The project requires the following dependencies:
requests
beautifulsoup4
pymongo
You can install the dependencies using the following command:
pip install -r requirements.txt
Setting Up and Running the Web Crawler
Clone the repository:
git clone https://github.com/schBenedikt/web-crawler.git
cd web-crawler
Install the dependencies:
pip install -r requirements.txt
Ensure that MongoDB is running on your local machine. The web crawler connects to MongoDB at localhost:27017 and uses a database named search_engine.
This is a great project to have in your portfolio. Well done!