WebReinvent Internal Docs
Guides

Result-storage

Crawlee has several result storage types that are useful for specific tasks

Crawlee storage is managed by MemoryStorage class. During the crawler run all information is stored in memory, while also being off-loaded to the local files in respective storage type folders.

Key-value store

The key-value store is used for saving and reading data records or files. Each data record is represented by a unique key and associated with a MIME content type. Key-value stores are ideal for saving screenshots of web pages, PDFs or to persist the state of crawlers.

In Crawlee, the key-value store is represented by the KeyValueStore class. In order to simplify access to the default key-value store, Crawlee also provides KeyValueStore.getValue() and KeyValueStore.setValue() functions.

The data is stored in the directory specified by the CRAWLEE_STORAGE_DIR environment variable as follows:

{CRAWLEE_STORAGE_DIR}/key_value_stores/{STORE_ID}/{KEY}.{EXT}

example

import { KeyValueStore } from 'crawlee';

// Get the INPUT from the default key-value store
const input = await KeyValueStore.getInput();

// Write the OUTPUT to the default key-value store
await KeyValueStore.setValue('OUTPUT', { myResult: 123 });

// Open a named key-value store
const store = await KeyValueStore.open('some-name');

// Write a record to the named key-value store.
// JavaScript object is automatically converted to JSON,
// strings and binary buffers are stored as they are
await store.setValue('some-key', { foo: 'bar' });

// Read a record from the named key-value store.
// Note that JSON is automatically parsed to a JavaScript object,
// text data is returned as a string, and other data is returned as binary buffer
const value = await store.getValue('some-key');

// Delete a record from the named key-value store
await store.setValue('some-key', null);

Dataset

Datasets are used to store structured data where each object stored has the same attributes, such as online store products or real estate offers. Dataset can be imagined as a table, where each object is a row and its attributes are columns. Dataset is an append-only storage - we can only add new records to it, but we cannot modify or remove existing records.

Each Crawlee project run is associated with a default dataset. Typically, it is used to store crawling results specific for the crawler run. Its usage is optional.

In Crawlee, the dataset is represented by the Dataset class. In order to simplify writes to the default dataset, Crawlee also provides the Dataset.pushData() function.

example

import { Dataset } from 'crawlee';

// Write a single row to the default dataset
await Dataset.pushData({ col1: 123, col2: 'val2' });

// Open a named dataset
const dataset = await Dataset.open('some-name');

// Write a single row
await dataset.pushData({ foo: 'bar' });

// Write multiple rows
await dataset.pushData([{ foo: 'bar2', col2: 'val2' }, { col3: 123 }]);

Copyright © 2024