Setting-up - WebReinvent Internal Docs

What is Crawlee

Crawlee is a web scraping and browser automation library

Features

Single interface for HTTP and headless browser crawling
Persistent queue for URLs to crawl (breadth & depth first)
Pluggable storage of both tabular data and files
Automatic scaling with available system resources
Integrated proxy rotation and session management
Lifecycles customizable with hooks
CLI to bootstrap your projects
Configurable routing, error handling and retries
Dockerfiles ready to deploy
Written in TypeScript with generics

setup

Use crawlee CLI

npx crawlee create my-crawler

Choose your crawleer

? Please select the template for your new Crawlee project (Use arrow keys)
> Getting started example [TypeScript] 
  Getting started example [JavaScript] 
  Empty project [TypeScript] 
  Empty project [JavaScript] 
  CheerioCrawler template project [TypeScript] 
  PlaywrightCrawler template project [TypeScript] 
  PuppeteerCrawler template project [TypeScript]

This is a plain HTTP crawler. It parses HTML using the Cheerio library and crawls the web using the specialized got-scraping HTTP client which masks as a browser. It's very fast and efficient, but can't handle JavaScript rendering.

PuppeteerCrawler

This crawler uses a headless browser to crawl, controlled by the Puppeteer library. It can control Chromium or Chrome. Puppeteer is the de-facto standard in headless browser automation.

PlaywrightCrawler

Playwright is a more powerful and full-featured successor to Puppeteer. It can control Chromium, Chrome, Firefox, Webkit and many other browsers.

Vaah Scraper

Vaah Scraper Section

How-it-works