WebReinvent Internal Docs
Introduction

Setting-up

What is Crawlee

Crawlee is a web scraping and browser automation library

Features

  1. Single interface for HTTP and headless browser crawling
  2. Persistent queue for URLs to crawl (breadth & depth first)
  3. Pluggable storage of both tabular data and files
  4. Automatic scaling with available system resources
  5. Integrated proxy rotation and session management
  6. Lifecycles customizable with hooks
  7. CLI to bootstrap your projects
  8. Configurable routing, error handling and retries
  9. Dockerfiles ready to deploy
  10. Written in TypeScript with generics

setup

Use crawlee CLI

npx crawlee create my-crawler

Choose your crawleer

? Please select the template for your new Crawlee project (Use arrow keys)
> Getting started example [TypeScript] 
  Getting started example [JavaScript] 
  Empty project [TypeScript] 
  Empty project [JavaScript] 
  CheerioCrawler template project [TypeScript] 
  PlaywrightCrawler template project [TypeScript] 
  PuppeteerCrawler template project [TypeScript]

CheerioCrawler

This is a plain HTTP crawler. It parses HTML using the Cheerio library and crawls the web using the specialized got-scraping HTTP client which masks as a browser. It's very fast and efficient, but can't handle JavaScript rendering.

PuppeteerCrawler

This crawler uses a headless browser to crawl, controlled by the Puppeteer library. It can control Chromium or Chrome. Puppeteer is the de-facto standard in headless browser automation.

PlaywrightCrawler

Playwright is a more powerful and full-featured successor to Puppeteer. It can control Chromium, Chrome, Firefox, Webkit and many other browsers.


Copyright © 2024