Getting Started with huggingfaceR

What is huggingfaceR?

huggingfaceR provides R users with direct access to over 500,000 machine learning models hosted on the Hugging Face Hub. The package uses the Hugging Face Inference API, so you can perform natural language processing tasks – classification, embeddings, chat, text generation, and more – without installing Python or managing model weights locally.

Key design principles:

No Python required. Authentication and a network connection are all you need.
Tidyverse-native. Every function accepts character vectors and returns tibbles.
Pipe-friendly. Functions compose naturally with dplyr, tidyr, and the rest of the tidyverse.

Installation

Install the released version from CRAN or the development version from GitHub:

# From CRAN
install.packages("huggingfaceR")

# Development version
# install.packages("devtools")
devtools::install_github("farach/huggingfaceR")

Authentication

Hugging Face requires an API token for inference requests. To obtain one:

Create a free account at huggingface.co.
Navigate to Settings > Access Tokens.
Generate a token with at least read access.

Then configure the token in R:

library(huggingfaceR)

# Store your token persistently (writes to .Renviron)
hf_set_token("hf_your_token_here", store = TRUE)

# Verify authentication
hf_whoami()

After storing the token, it is loaded automatically in future sessions.

Quick Tour

Classify Text

Assign labels to text using pre-trained classifiers. The default model performs sentiment analysis, but you can supply any classification model from the Hub.

# Sentiment analysis
hf_classify("I love using R for data science!")

# Zero-shot classification with custom labels (no training needed)
hf_classify_zero_shot(
  "NASA launches new Mars rover",
  labels = c("science", "politics", "sports", "entertainment")
)

Generate Embeddings

Convert text into dense numeric vectors that capture semantic meaning. Similar texts produce similar vectors.

sentences <- c(
  "The cat sat on the mat",
  "A feline rested on the rug",
  "The dog played in the park"
)

embeddings <- hf_embed(sentences)
embeddings

# Compute pairwise cosine similarity
hf_similarity(embeddings)

Chat with a Language Model

Interact with open-source large language models through a simple interface.

# Single question
hf_chat("What is the tidyverse?")

# Guide the model with a system prompt
hf_chat(
  "Explain logistic regression in two sentences.",
  system = "You are a statistics instructor. Use plain language."
)

Explore the Hub

Search for models and load datasets directly into R without leaving your session.

# Find popular text classification models
hf_search_models(task = "text-classification", limit = 5)

# Load dataset rows into a tibble
imdb <- hf_load_dataset("imdb", split = "train", limit = 100)
head(imdb)

Working with Data Frames

All huggingfaceR functions accept character vectors and return tibbles, so they integrate naturally into tidyverse pipelines.

library(dplyr)
library(tidyr)

reviews <- tibble(

  product_id = 1:5,
  review = c(
    "Excellent quality, highly recommend!",
    "Broke after one week of use",
    "Good value for the price",
    "Disappointing, not as advertised",
    "Love it! Will buy again"
  )
)

# Add sentiment scores
reviews |>
  mutate(sentiment = hf_classify(review)) |>
  unnest(sentiment) |>
  select(product_id, review, label, score)

Next Steps

For deeper coverage of each capability, see the following vignettes:

Text Classification and Zero-Shot Labeling – sentiment analysis, custom categories, and data frame workflows.
Embeddings, Similarity, and Semantic Search – vector representations, clustering, topic modeling, and visualization.
Chat, Conversations, and Text Generation – LLM interaction patterns, multi-turn conversations, and fill-mask.
Hub Discovery, Datasets, and Tidymodels Integration – searching models, loading data, and building ML pipelines with embeddings.
Analyzing the Anthropic Economic Index – a research-oriented case study using embeddings, clustering, and zero-shot classification on real-world AI adoption data.

For production workloads: The classification and embeddings vignettes cover batch processing functions (hf_embed_batch(), hf_classify_batch(), etc.) that use parallel requests and disk checkpointing for processing thousands of texts efficiently.