Skip to contents

Load a dataset from Hugging Face Hub using the Datasets Server API. This is an API-first approach that doesn't require Python. For local dataset loading with Python, see the legacy function or advanced vignette.

Usage

hf_load_dataset(
  dataset,
  split = "train",
  config = NULL,
  limit = 1000,
  offset = 0,
  token = NULL
)

Arguments

dataset

Character string. Dataset name (e.g., "imdb", "squad").

split

Character string. Dataset split: "train", "test", "validation", etc. Default: "train".

config

Character string or NULL. Dataset configuration/subset name. If NULL (default), auto-detected from the dataset's available configs.

limit

Integer. Maximum number of rows to fetch. Default: 1000. Set to Inf to fetch all rows (may be slow for large datasets).

offset

Integer. Row offset for pagination. Default: 0.

token

Character string or NULL. API token for private datasets.

Value

A tibble with dataset rows, plus .dataset and .split columns.

Examples

if (FALSE) { # \dontrun{
# Load first 1000 rows of IMDB train set
imdb <- hf_load_dataset("imdb", split = "train", limit = 1000)

# Load test set
imdb_test <- hf_load_dataset("imdb", split = "test", limit = 500)
} # }