WebReinvent Internal Docs
Guides

Scaling

As we build our crawler, we might want to control how many requests we do to the website at a time. Crawlee provides several options to fine tune how many parallel requests should be made at any time, how many requests should be done per minute, and how should scaling work based on the available system resources.

maxRequestsperMinute

This controls how many total requests can be made per minute. It counts the amount of requests done every second, to ensure there is not a burst of requests at the maxConcurrency limit followed by a long period of waiting. By default, it is set to Infinity which means the crawler will keep going up to the maxConcurrency. We would set this if we wanted our crawler to work at full throughput, but also not keep hitting the website we're crawling with non-stop requests.

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    // Let the crawler know it can run up to 100 requests concurrently at any time
    maxConcurrency: 100,
    // ...but also ensure the crawler never exceeds 250 requests per minute
    maxRequestsPerMinute: 250,
});

minConcurrency and maxConcurrency

These control how many parallel requests can be run at any time. By default, crawlers will start with one parallel request at a time and scale up over time to a maximum of 200 requests at a time.

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    // Start the crawler right away and ensure there will always be 5 concurrent requests ran at any time
    minConcurrency: 5,
    // Ensure the crawler doesn't exceed 15 concurrent requests ran at any time
    maxConcurrency: 15,
});

Copyright © 2024