Skip to contents

What is huggingfaceR?

huggingfaceR provides R users with access to machine learning models hosted on the Hugging Face Hub via the Hugging Face Inference API. You can perform natural language processing tasks – classification, embeddings, chat, text generation, and more – without installing Python or managing model weights locally. Note that the Inference API serves a curated subset of the 500,000+ models on the Hub; not every model is available for serverless inference.

Key design principles:

  • No Python required. Authentication and a network connection are all you need.
  • Tidyverse-native. Every function accepts character vectors and returns tibbles.
  • Pipe-friendly. Functions compose naturally with dplyr, tidyr, and the rest of the tidyverse.

Capability matrix

Workflow Start with Returns
Sentiment and labels hf_classify(), hf_classify_zero_shot() one tibble row per text or label
Embeddings and search hf_embed(), hf_similarity(), hf_nearest_neighbors() vectors, similarities, nearest rows
Chat and agents hf_chat(), hf_conversation(), hf_tool() assistant messages and tool calls
Structured extraction hf_extract() one tidy row per input with schema columns
Text tasks hf_summarize(), hf_translate(), hf_ner(), hf_question_answer() task-specific tidy columns
Multimodal hf_transcribe(), hf_text_to_image(), hf_classify_image() transcripts, files/raw bytes, image labels
Hub workflows hf_hub_download(), hf_list_providers(), hf_push_dataset() files, provider metadata, guarded uploads

Installation

Install the released version from CRAN or the development version from GitHub:

# From CRAN
install.packages("huggingfaceR")

# Development version
# install.packages("devtools")
devtools::install_github("farach/huggingfaceR")

Authentication

Hugging Face requires an API token for inference requests. To obtain one:

  1. Create a free account at huggingface.co.
  2. Follow the Hugging Face access tokens documentation.
  3. Generate a token with at least read access.

Then configure the token in R:

library(huggingfaceR)

# Store your token persistently (writes to .Renviron)
hf_set_token("hf_your_token_here", store = TRUE)

# Verify authentication
hf_whoami()

After storing the token, it is loaded automatically in future sessions.

Quick Tour

Classify Text

Assign labels to text using pre-trained classifiers. The default model performs sentiment analysis, but you can supply any classification model available on the Hugging Face Inference API. Not all models on the Hub support serverless inference — use hf_check_inference(model_id) to verify.

# Sentiment analysis
hf_classify("I love using R for data science!")
#> # A tibble: 1 × 3
#>   text                             label    score
#>   <chr>                            <chr>    <dbl>
#> 1 I love using R for data science! POSITIVE 1.000

# Zero-shot classification with custom labels (no training needed)
hf_classify_zero_shot(
  "NASA launches new Mars rover",
  labels = c("science", "politics", "sports", "entertainment")
)
#> # A tibble: 4 × 3
#>   text                         label           score
#>   <chr>                        <chr>           <dbl>
#> 1 NASA launches new Mars rover science       0.957  
#> 2 NASA launches new Mars rover entertainment 0.0311 
#> 3 NASA launches new Mars rover sports        0.00785
#> 4 NASA launches new Mars rover politics      0.00395

Generate Embeddings

Convert text into dense numeric vectors that capture semantic meaning. Similar texts produce similar vectors.

sentences <- c(
  "The cat sat on the mat",
  "A feline rested on the rug",
  "The dog played in the park"
)

embeddings <- hf_embed(sentences)
embeddings
#> # A tibble: 3 × 3
#>   text                       embedding   n_dims
#>   <chr>                      <list>       <int>
#> 1 The cat sat on the mat     <dbl [384]>    384
#> 2 A feline rested on the rug <dbl [384]>    384
#> 3 The dog played in the park <dbl [384]>    384

# Compute pairwise cosine similarity
hf_similarity(embeddings)
#> # A tibble: 3 × 3
#>   text_1                     text_2                     similarity
#>   <chr>                      <chr>                           <dbl>
#> 1 The cat sat on the mat     A feline rested on the rug      0.748
#> 2 The cat sat on the mat     The dog played in the park      0.516
#> 3 A feline rested on the rug The dog played in the park      0.555

Chat with a Language Model

Interact with open-source large language models through a simple interface.

# Single question
hf_chat("What is the tidyverse?", max_tokens = 60)
#> # A tibble: 1 × 5
#>   role      content                                 model tokens_used tool_calls
#>   <chr>     <chr>                                   <chr>       <int> <list>    
#> 1 assistant The tidyverse is a collection of R pac… meta…          60 <list [0]>

# Guide the model with a system prompt
hf_chat(
  "Explain logistic regression in two sentences.",
  system = "You are a statistics instructor. Use plain language.",
  max_tokens = 80
)
#> # A tibble: 1 × 5
#>   role      content                                 model tokens_used tool_calls
#>   <chr>     <chr>                                   <chr>       <int> <list>    
#> 1 assistant Logistic regression is a statistical m… meta…          69 <list [0]>

Explore the Hub

Search for models and load datasets directly into R without leaving your session.

# Find popular text classification models
hf_search_models(task = "text-classification", limit = 5)
#> # A tibble: 5 × 7
#>   model_id                            author task  downloads likes tags  library
#>   <chr>                               <chr>  <chr>     <int> <int> <lis> <chr>  
#> 1 BAAI/bge-reranker-v2-m3             <NA>   text…  16443234  1053 <chr> senten…
#> 2 ProsusAI/finbert                    <NA>   text…   7648889  1184 <chr> transf…
#> 3 BAAI/bge-reranker-base              <NA>   text…   4167279   238 <chr> senten…
#> 4 cardiffnlp/twitter-roberta-base-se… <NA>   text…   3953164   813 <chr> transf…
#> 5 distilbert/distilbert-base-uncased… <NA>   text…   3644729   910 <chr> transf…

# Load dataset rows into a tibble
imdb <- hf_load_dataset("imdb", split = "train", limit = 100)
head(imdb)
#> # A tibble: 6 × 4
#>   text                                                     label .dataset .split
#>   <chr>                                                    <int> <chr>    <chr> 
#> 1 "I rented I AM CURIOUS-YELLOW from my video store becau…     0 stanfor… train 
#> 2 "\"I Am Curious: Yellow\" is a risible and pretentious …     0 stanfor… train 
#> 3 "If only to avoid making this type of film in the futur…     0 stanfor… train 
#> 4 "This film was probably inspired by Godard's Masculin, …     0 stanfor… train 
#> 5 "Oh, brother...after hearing about this ridiculous film…     0 stanfor… train 
#> 6 "I would put this at the top of my list of films in the…     0 stanfor… train

Extract Structured Data

Turn messy prose into analysis-ready columns.

hf_extract(
  "Amelie is a chef in Paris who mentions burnout.",
  c(name = "string", occupation = "string", city = "string", theme = "string")
)
#> # A tibble: 1 × 4
#>   name   occupation city  theme  
#>   <chr>  <chr>      <chr> <chr>  
#> 1 Amelie chef       Paris burnout

Work with Images and Audio

audio <- "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac"
image <- "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png"

transcript <- hf_transcribe(audio, return_timestamps = "word")
substr(transcript$text, 1, 120)
#> [1] " I have a dream that one day this nation will rise up and live out the true meaning of its creed."

hf_classify_image(image, top_k = 3)
#> # A tibble: 3 × 3
#>   image                                                              label score
#>   <chr>                                                              <chr> <dbl>
#> 1 https://huggingface.co/datasets/huggingface/documentation-images/… tabb… 0.277
#> 2 https://huggingface.co/datasets/huggingface/documentation-images/… tige… 0.276
#> 3 https://huggingface.co/datasets/huggingface/documentation-images/… Egyp… 0.140
hf_detect_objects(image, threshold = 0.5) |>
  filter(label == "cat")
#> # A tibble: 2 × 7
#>   image                                      label score  xmin  ymin  xmax  ymax
#>   <chr>                                      <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 https://huggingface.co/datasets/huggingfa… cat   0.997   156    31   385   146
#> 2 https://huggingface.co/datasets/huggingfa… cat   0.999   145   132   429   341

Working with Data Frames

All huggingfaceR functions accept character vectors and return tibbles, so they integrate naturally into tidyverse pipelines.

reviews <- tibble(

  product_id = 1:5,
  review = c(
    "Excellent quality, highly recommend!",
    "Broke after one week of use",
    "Good value for the price",
    "Disappointing, not as advertised",
    "Love it! Will buy again"
  )
)

# Add sentiment scores
reviews |>
  mutate(sentiment = hf_classify(review)) |>
  unnest(sentiment) |>
  select(product_id, review, label, score)
#> # A tibble: 5 × 4
#>   product_id review                               label    score
#>        <int> <chr>                                <chr>    <dbl>
#> 1          1 Excellent quality, highly recommend! POSITIVE 1.000
#> 2          2 Broke after one week of use          NEGATIVE 0.999
#> 3          3 Good value for the price             POSITIVE 1.000
#> 4          4 Disappointing, not as advertised     NEGATIVE 1.000
#> 5          5 Love it! Will buy again              POSITIVE 1.000

Next Steps

For deeper coverage of each capability, see the following vignettes:

For production workloads: The classification and embeddings vignettes cover batch processing functions (hf_embed_batch(), hf_classify_batch(), etc.) that use parallel requests and disk checkpointing for processing thousands of texts efficiently.

Using Dedicated Inference Endpoints

By default, huggingfaceR sends requests to the free, serverless Hugging Face Inference API. If you need to use a model that isn’t available on the serverless API, or you need dedicated capacity for production workloads, you can deploy a Dedicated Inference Endpoint and point huggingfaceR at it with the endpoint_url parameter.

# Check whether a model supports the free serverless API
hf_check_inference("my-org/my-custom-model")

# If not, deploy a Dedicated Endpoint on huggingface.co/inference-endpoints,
# then pass its URL to any huggingfaceR function:
hf_embed(
  "Embed this with my dedicated endpoint",
  model = "my-org/my-custom-model",
  endpoint_url = "https://my-endpoint-id.us-east-1.aws.endpoints.huggingface.cloud"
)

hf_classify(
  "Classify with a private model",
  model = "my-org/my-classifier",
  endpoint_url = "https://my-endpoint-id.us-east-1.aws.endpoints.huggingface.cloud"
)

# Chat and generate also support endpoint_url
hf_chat(
  "Hello from my dedicated endpoint!",
  model = "my-org/my-llm",
  endpoint_url = "https://my-endpoint-id.us-east-1.aws.endpoints.huggingface.cloud"
)

The endpoint_url parameter is available on all inference functions, including batch variants (hf_embed_batch(), hf_classify_batch(), etc.).