A key capability is its standby mode, which runs the Actor as a persistent API server. This removes the usual start-up times - a common pain point in many systems - and lets users make direct API calls to interact with the system immediately.
This blog explains how SuperScraper works, highlights its implementation details, and provides code snippets to demonstrate its core functionality.
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
A web scraping and browser automation library
Crawlee covers your crawling and scraping end-to-end and helps you build reliable scrapers. Fast.
Your crawlers will appear human-like and fly under the radar of modern bot protections even with the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data, and store it to disk or cloud while staying configurable to suit your project's needs.
SuperScraper transforms a traditional scraper into an API server. Instead of running with static inputs and waiting for completion, it starts only once, stays active, and listens for incoming requests.
How to enable standby mode
To activate standby mode, you must configure the settings so it listens for incoming requests.
Server setup
The project uses Node.js http module to create a server that listens on the desired port. After the server starts, a check ensures users are interacting with it correctly by sending requests instead of running it traditionally. This keeps SuperScraper operating as a persistent server.
Handling multiple crawlers
SuperScraper processes user requests using multiple instances of Crawlee’s PlaywrightCrawler. Since each PlaywrightCrawler instance can only handle one proxy configuration, a separate crawler is created for each unique proxy setting.
For example, if the user sends one request for “normal” proxies and one request with residential US proxies, a separate crawler needs to be created for each proxy configuration. Hence, to solve this, we store the crawlers in a key-value map, where the key is a stringified proxy configuration.
constcrawlers=newMap<string,PlaywrightCrawler>();
Here’s a part of the code that gets executed when a new request from the user arrives; if the crawler for this proxy configuration exists in the map, it will be used. Otherwise, a new crawler gets created. Then, we add the request to the crawler’s queue so it can be processed.
The function below initializes new crawlers with predefined settings and behaviors. Each crawler utilizes its own in-memory queue created with the MemoryStorage client. This approach is used for two key reasons:
Performance: In-memory queues are faster, and there's no need to persist them when SuperScraper migrates.
Isolation: Using a separate queue prevents interference with the shared default queue of the SuperScraper Actor, avoiding potential bugs when multiple crawlers use it simultaneously.
At the end of the function, we start the crawler and log a message if it terminates for any reason. Next, we add the newly created crawler to the key-value map containing all crawlers, and finally, we return the crawler.
When creating the server, it accepts a request listener function that takes two arguments: the user’s request and a response object. The response object is used to send scraped data back to the user. These response objects are stored in a key-value map to so they can be accessed later in the code. The key is a randomly generated string shared between the request and its corresponding response object, it is used as request.uniqueKey.
constresponses=newMap<string,ServerResponse>();
Saving response objects
The following function stores a response object in the key-value map:
Once a crawler finishes processing a request, it retrieves the corresponding response object using the key and sends the scraped data back to the user:
exportconstsendSuccResponseById=(responseId:string,result:unknown,contentType:string)=>{constres=responses.get(responseId);if (!res){log.info(`Response for request ${responseId} not found`);return;}res.writeHead(200,{'Content-Type':contentType});res.end(result);responses.delete(responseId);};
Error handling
There is similar logic to send a response back if an error occurs during scraping:
exportconstsendErrorResponseById=(responseId:string,result:string,statusCode:number=500)=>{constres=responses.get(responseId);if (!res){log.info(`Response for request ${responseId} not found`);return;}res.writeHead(statusCode,{'Content-Type':'application/json'});res.end(result);responses.delete(responseId);};
Adding timeouts during migrations
During migration, SuperScraper adds timeouts to pending responses to handle termination cleanly.
exportconstaddTimeoutToAllResponses=(timeoutInSeconds:number=60)=>{constmigrationErrorMessage={errorMessage:'Actor had to migrate to another server. Please, retry your request.',};constresponseKeys=Object.keys(responses);for (constkeyofresponseKeys){setTimeout(()=>{sendErrorResponseById(key,JSON.stringify(migrationErrorMessage));},timeoutInSeconds*1000);}};
Managing migrations
SuperScraper handles migrations by timing out active responses to prevent lingering requests during server transitions.
Users receive clear feedback during server migrations, maintaining stable operation.
Build your own
This guide showed how to build and manage a standby web scraper using Apify’s platform and Crawlee. The implementation handles multiple proxy configurations through PlaywrightCrawler instances while managing request-response cycles efficiently to support diverse scraping needs.
Standby mode transforms SuperScraper into a persistent API server, eliminating start-up delays. The migration handling system keeps operations stable during server transitions. You can build on this foundation to create web scraping tools tailored to your requirements.
To get started, explore the project on GitHub or learn more about Crawlee to build your own scalable web scraping tools.