WebReinvent Internal Docs
Guides

Request-Storage

Crawlee has several request storage types that are useful for specific tasks. The requests are stored on local disk to a directory defined by the CRAWLEE_STORAGE_DIR environment variable. If this variable is not defined, by default Crawlee sets CRAWLEE_STORAGE_DIR to ./storage in the current working directory.

Request queue

The request queue is a storage of URLs to crawl. The queue is used for the deep crawling of websites, where we start with several URLs and then recursively follow links to other pages. The data structure supports both breadth-first and depth-first crawling orders.

Each Crawlee project run is associated with a default request queue. Typically, it is used to store URLs to crawl in the specific crawler run. Its usage is optional.

In Crawlee, the request queue is represented by the RequestQueue class.

The request queue is managed by MemoryStorage class and its data is stored in memory, while also being off-loaded to the local directory specified by the CRAWLEE_STORAGE_DIR environment variable as follows:

import { CheerioCrawler } from 'crawlee';

// The crawler will automatically process requests from the queue.
// It's used the same way for Puppeteer/Playwright crawlers.
const crawler = new CheerioCrawler({
    // Note that we're not specifying the requestQueue here
    async requestHandler({ $, crawler, enqueueLinks }) {
        // Add new request to the queue
        await crawler.addRequests([{ url: 'https://example.com/new-page' }]);
        // Add links found on page to the queue
        await enqueueLinks();
    },
});

// Add the initial requests.
// Note that we are not opening the request queue explicitly before
await crawler.addRequests([
    { url: 'https://example.com/1' },
    { url: 'https://example.com/2' },
    { url: 'https://example.com/3' },
    // ...
]);

// Run the crawler
await crawler.run();

Request list

The request list is not a storage per se - it represents the list of URLs to crawl that is stored in a crawler run memory (or optionally in default Key-Value Store associated with the run, if specified). The list is used for the crawling of a large number of URLs, when we know all the URLs which should be visited by the crawler and no URLs would be added during the run. The URLs can be provided either in code or parsed from a text file hosted on the web.

Request list is created exclusively for the crawler run and only if its usage is explicitly specified in the code. Its usage is optional.

In Crawlee, the request list is represented by the RequestList class.

import { RequestList, PuppeteerCrawler } from 'crawlee';

// Prepare the sources array with URLs to visit
const sources = [
    { url: 'http://www.example.com/page-1' },
    { url: 'http://www.example.com/page-2' },
    { url: 'http://www.example.com/page-3' },
];

// Open the request list.
// List name is used to persist the sources and the list state in the key-value store
const requestList = await RequestList.open('my-list', sources);

// The crawler will automatically process requests from the list
// It's used the same way for Cheerio /Playwright crawlers.
const crawler = new PuppeteerCrawler({
    requestList,
    async requestHandler({ page, request }) {
        // Process the page (extract data, take page screenshot, etc).
        // No more requests could be added to the request list here
    },
});

Cleaning up the storages

Default storages are purged before the crawler starts if not specified otherwise. This happens as early as when we try to open some storage (e.g. via RequestQueue.open()) or when we try to work with a default storage via one of the helper methods (e.g. crawler.addRequests() that under the hood calls RequestQueue.open()). If we don't work with storages explicitly in our code, the purging will eventually happen when the run method of our crawler is executed. In case we need to purge the storages sooner, we can use the purgeDefaultStorages() helper explicitly:

import { purgeDefaultStorages } from 'crawlee';

await purgeDefaultStorages();

Calling this function will clean up the default request storage directory (and also the request list stored in default key-value store). This is a shortcut for running (optional) purge method on the StorageClient interface, in other words it will call the purge method of the underlying storage implementation we are currently using. You can make sure the storage is purged only once for a given execution context if you set onlyPurgeOnce to true in the options object.


Copyright © 2024