WebReinvent Internal Docs
Introduction

How-it-works

request and RequestQueue

All crawlers use instances of the Request class to determine where they need to go. Each request at the very least must hold a URL - a web page to open. But having only one URL would not make sense for crawling. Sometimes you have a pre-existing list of your own URLs that you wish to visit, perhaps a thousand. Other times you need to build this list dynamically as you crawl, adding more and more URLs to the list as you progress. Most of the time, you will use both options.

The requests are stored in RequestQueue, a dynamically queue of the Request instances. You can seed it with start URLs and also add more requests while the crawler is running. This allows the crawler to open one page, extract interesting URLs, such as links to other pages on the same domain, add them to the queue (called enqueuing) and repeat this process to build a queue of virtually unlimited number of URLs.

requesthandler

In the requestHandler you tell the crawler what to do at each and every page it visits. You can use it to handle extraction of data from the page, processing the data, saving it, calling APIs, doing calculations and so on.

The requestHandler is a user-defined function, invoked automatically by the crawler for each Request from the RequestQueue. It always receives a single argument - a CrawlingContext. Its properties change depending on the crawler class used, but it always includes the request property, which represents the currently crawled URL and related metadata.

Adding request to the crawling queue

Let's add a request to our RequestQueue

import { RequestQueue } from 'crawlee';
// First you create the request queue instance.
const requestQueue = await RequestQueue.open();
// And then you add one or more requests to it.
await requestQueue.addRequest({ url: 'https://docs.vaah.dev });

The requestQueue.addRequest() function automatically converts the object with URL string to a Request instance. So now you have a requestQueue that holds one request which points to https://docs.vaah.dev.

Building CheerioCrawler


// Add import of CheerioCrawler
import { RequestQueue, CheerioCrawler } from 'crawlee';

const requestQueue = await RequestQueue.open();
await requestQueue.addRequest({ url: 'https://docs.vaah.dev' });

// Create the crawler and add the queue with our URL
// and a request handler to process the page.
const crawler = new CheerioCrawler({
    requestQueue,
    // The `$` argument is the Cheerio object
    // which contains parsed HTML of the website.
    async requestHandler({ $, request }) {
        // Extract <title> text with Cheerio.
        // See Cheerio documentation for API docs.
        const title = $('title').text();
        console.log(`The title of "${request.url}" is: ${title}.`);
    }
})

// Start the crawler and wait for it to finish
await crawler.run();

Copyright © 2024