Proxy-management
IP address blocking is one of the oldest and most effective ways of preventing access to a website. It is therefore paramount for a good web scraping library to provide easy to use but powerful tools which can work around IP blocking. The most powerful weapon in our anti IP blocking arsenal is a proxy server. With Crawlee we can use our own proxy servers or proxy servers acquired from third-party providers.
Classes Provided
import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';
examaple
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
'http://proxy-1.com',
'http://proxy-2.com',
],
});
IP Rotation and session management -
proxyConfiguration.newUrl()
allows us to pass a sessionId parameter. It will then be used to create a sessionId-proxyUrl pair, and subsequent newUrl() calls with the same sessionId will always return the same proxyUrl. This is extremely useful in scraping, because we want to create the impression of a real user
When no sessionId is provided, our proxy URLs are rotated round-robin.
Browser fingerprint generation and customization -
Crawlee provides this feature with zero configuration necessary - the usage of fingerprints is enabled by default and available in PlaywrightCrawler and PuppeteerCrawler. So whenever we build a scraper that is using one of these crawlers - the fingerprints are going to be generated for the default browser and the operating system out of the box.
Also, In certain cases we want to narrow down the fingerprints used - e.g. specify a certain operating system, locale or browser. This is also possible with Crawlee - the crawler can have the generation algorithm customized to reflect the particular browser version and many more.
Let's take a look at the examples below:
import { PlaywrightCrawler } from 'crawlee';
import { BrowserName, DeviceCategory, OperatingSystemsName } from '@crawlee/browser-pool';
const crawler = new PlaywrightCrawler({
browserPoolOptions: {
useFingerprints: true, // this is the default
fingerprintOptions: {
fingerprintGeneratorOptions: {
browsers: [{
name: BrowserName.edge,
minVersion: 96,
}],
devices: [
DeviceCategory.desktop,
],
operatingSystems: [
OperatingSystemsName.windows,
],
},
},
},
});