Generate embeddings for large datasets with automatic checkpointing to disk. Supports resuming interrupted processing.
Usage
hf_embed_chunks(
text,
output_dir,
model = "BAAI/bge-small-en-v1.5",
token = NULL,
chunk_size = 1000L,
batch_size = 100L,
max_active = 10L,
resume = TRUE,
progress = TRUE
)Arguments
- text
Character vector of text(s) to embed.
- output_dir
Character string. Directory to write chunk files.
- model
Character string. Model ID from Hugging Face Hub. Default: "BAAI/bge-small-en-v1.5".
- token
Character string or NULL. API token for authentication.
- chunk_size
Integer. Number of texts per disk chunk. Default: 1000.
- batch_size
Integer. Number of texts per API request. Default: 100.
- max_active
Integer. Maximum concurrent requests. Default: 10.
- resume
Logical. Skip already-completed chunks. Default: TRUE.
- progress
Logical. Show progress bar. Default: TRUE.
Examples
if (FALSE) { # \dontrun{
# Process large dataset with checkpoints
texts <- rep("sample text", 5000)
hf_embed_chunks(texts, output_dir = "embeddings_output", chunk_size = 1000)
# Read results
results <- hf_read_chunks("embeddings_output")
# Resume interrupted processing
hf_embed_chunks(more_texts, output_dir = "embeddings_output", resume = TRUE)
} # }