Introduction
Setting-up
What is Crawlee
Crawlee is a web scraping and browser automation library
Features
- Single interface for HTTP and headless browser crawling
- Persistent queue for URLs to crawl (breadth & depth first)
- Pluggable storage of both tabular data and files
- Automatic scaling with available system resources
- Integrated proxy rotation and session management
- Lifecycles customizable with hooks
- CLI to bootstrap your projects
- Configurable routing, error handling and retries
- Dockerfiles ready to deploy
- Written in TypeScript with generics
setup
Use crawlee CLI
npx crawlee create my-crawler
Choose your crawleer
? Please select the template for your new Crawlee project (Use arrow keys)
> Getting started example [TypeScript]
Getting started example [JavaScript]
Empty project [TypeScript]
Empty project [JavaScript]
CheerioCrawler template project [TypeScript]
PlaywrightCrawler template project [TypeScript]
PuppeteerCrawler template project [TypeScript]
CheerioCrawler
This is a plain HTTP crawler. It parses HTML using the Cheerio library and crawls the web using the specialized got-scraping HTTP client which masks as a browser. It's very fast and efficient, but can't handle JavaScript rendering.
PuppeteerCrawler
This crawler uses a headless browser to crawl, controlled by the Puppeteer library. It can control Chromium or Chrome. Puppeteer is the de-facto standard in headless browser automation.
PlaywrightCrawler
Playwright is a more powerful and full-featured successor to Puppeteer. It can control Chromium, Chrome, Firefox, Webkit and many other browsers.