Embeddings, Similarity, and Semantic Search
Source:vignettes/embeddings-and-similarity.Rmd
embeddings-and-similarity.RmdIntroduction
Text embeddings are dense numeric vectors that represent the semantic content of text. Sentences with similar meanings produce vectors that are close together in high-dimensional space, even if they use different words. This property makes embeddings useful for similarity search, clustering, topic modeling, and as features for machine learning.
huggingfaceR provides a complete embedding workflow: generate vectors
with hf_embed(), measure similarity with
hf_similarity(), search with
hf_nearest_neighbors(), cluster with
hf_cluster_texts(), extract topics with
hf_extract_topics(), and visualize with
hf_embed_umap().
Generating Embeddings with hf_embed()
Basic Usage
hf_embed() accepts a character vector and returns a
tibble with three columns: text, embedding (a
list-column of numeric vectors), and n_dims (the
dimensionality of each vector).
sentences <- c(
"Machine learning is transforming healthcare",
"Deep learning models require large datasets",
"The weather forecast predicts rain tomorrow",
"Clinical trials use statistical methods",
"It will be sunny next week"
)
embeddings <- hf_embed(sentences)
embeddings
#> # A tibble: 5 x 3
#> text embedding n_dims
#> <chr> <list> <int>
#> 1 Machine learning is transforming healthcare <dbl [384]> 384
#> 2 Deep learning models require large datasets <dbl [384]> 384
#> 3 The weather forecast predicts rain tomorrow <dbl [384]> 384
#> 4 Clinical trials use statistical methods <dbl [384]> 384
#> 5 It will be sunny next week <dbl [384]> 384Accessing the Vectors
Each embedding is stored as a numeric vector inside the list-column. You can extract individual vectors or convert the entire set to a matrix.
Choosing a Model
The default model (BAAI/bge-small-en-v1.5) produces
384-dimensional embeddings and offers a good balance of speed and
quality for English text. You can specify any feature-extraction model
from the Hub.
# Use a different embedding model
embeddings_alt <- hf_embed(
sentences,
model = "BAAI/bge-base-en-v1.5" # 768-dimensional, higher quality
)| Model | Dimensions | Characteristics |
|---|---|---|
BAAI/bge-small-en-v1.5 |
384 | Fast, good quality (default) |
BAAI/bge-base-en-v1.5 |
768 | Higher quality, slower |
intfloat/multilingual-e5-small |
384 | Multilingual support |
Pairwise Similarity with hf_similarity()
hf_similarity() computes cosine similarity between all
pairs of embeddings. Cosine similarity ranges from -1 (opposite) to 1
(identical), with values near 0 indicating no semantic relationship.
hf_similarity(embeddings)
#> # A tibble: 10 x 3
#> text_1 text_2 similarity
#> <chr> <chr> <dbl>
#> 1 Machine learning is transforming healthcare Deep learning models ... 0.82
#> 2 Machine learning is transforming healthcare The weather forecast ... 0.15
#> 3 Machine learning is transforming healthcare Clinical trials use ... 0.61
#> 4 Machine learning is transforming healthcare It will be sunny ... 0.12
#> 5 Deep learning models require large datasets The weather forecast ... 0.11
#> ...Texts about related topics (ML and healthcare, ML and statistics) score highly, while unrelated pairs (ML and weather) score near zero. This is the foundation for semantic search and document deduplication.
Tidytext Integration: hf_embed_text()
For data frame workflows, hf_embed_text() adds
embeddings directly to an existing tibble. This is the recommended entry
point when you already have structured data.
Embedding a Data Frame
docs <- tibble(
doc_id = 1:6,
category = c("tech", "tech", "food", "food", "travel", "travel"),
text = c(
"Neural networks power modern AI systems",
"Cloud computing enables scalable applications",
"Fresh pasta requires only flour and eggs",
"Sourdough bread needs a mature starter",
"Tokyo offers incredible street food and temples",
"The Swiss Alps provide world-class hiking trails"
)
)
docs_embedded <- docs |>
hf_embed_text(text)The result retains all original columns and adds
embedding and n_dims.
Semantic Search with hf_nearest_neighbors()
Find the documents most similar to a query string. The function embeds the query, computes cosine similarity against all documents, and returns the top matches.
docs_embedded |>
hf_nearest_neighbors("artificial intelligence applications", k = 3)
#> # A tibble: 3 x 5
#> doc_id category text similarity
#> <int> <chr> <chr> <dbl>
#> 1 1 tech Neural networks power modern AI systems 0.78
#> 2 2 tech Cloud computing enables scalable applic... 0.52
#> 3 5 travel Tokyo offers incredible street food ... 0.18This pattern is useful for FAQ matching, document retrieval, and recommendation systems.
Clustering with hf_cluster_texts()
hf_cluster_texts() performs k-means clustering on the
embedding vectors, grouping semantically similar texts together. The
data must already have an embedding column (from
hf_embed_text() or hf_embed()).
articles <- tibble(
id = 1:12,
text = c(
# Technology cluster
"New AI chip doubles processing speed",
"Quantum computing reaches error correction milestone",
"Open-source language model rivals proprietary alternatives",
"Cybersecurity threats increase with IoT adoption",
# Health cluster
"Mediterranean diet linked to reduced heart disease risk",
"New gene therapy shows promise for rare blood disorders",
"Sleep quality affects cognitive performance in older adults",
"Vaccine development accelerates with mRNA technology",
# Environment cluster
"Arctic ice loss accelerates beyond model predictions",
"Renewable energy capacity surpasses coal globally",
"Ocean acidification threatens coral reef ecosystems",
"Urban forests reduce city temperatures by up to 5 degrees"
)
)
clustered <- articles |>
hf_embed_text(text) |>
hf_cluster_texts(k = 3)
clustered |>
select(id, text, cluster) |>
arrange(cluster)The resulting cluster column assigns each text to a
group. Texts about similar topics should be assigned to the same
cluster.
Topic Extraction with hf_extract_topics()
hf_extract_topics() builds on clustering by extracting
representative keywords for each cluster. This requires the
tidytext package for tokenization and TF-IDF
computation.
library(tidytext)
topics <- articles |>
hf_embed_text(text) |>
hf_extract_topics(text_col = "text", k = 3, top_n = 5)
topics
#> # A tibble: 15 x 3
#> cluster word tf_idf
#> <int> <chr> <dbl>
#> 1 1 ai 0.231
#> 2 1 computing 0.198
#> 3 1 chip 0.165
#> ...Each cluster is described by the words most distinctive to its documents, making it straightforward to assign human-readable topic labels.
Visualization with hf_embed_umap()
hf_embed_umap() reduces high-dimensional embeddings to
2D coordinates using UMAP (Uniform Manifold Approximation and
Projection), suitable for scatter plot visualization. This function
requires the uwot package.
library(ggplot2)
texts <- c(
# Animals
"cats are independent pets", "dogs are loyal companions",
"goldfish are low-maintenance pets", "parrots can mimic speech",
# Vehicles
"sedans are practical family cars", "trucks haul heavy loads",
"bicycles reduce carbon emissions", "motorcycles offer speed and freedom",
# Food
"pizza is a popular dinner choice", "sushi requires fresh fish",
"tacos feature various fillings", "pasta comes in many shapes"
)
coords <- hf_embed_umap(texts, n_neighbors = 4, min_dist = 0.1)
ggplot(coords, aes(umap_1, umap_2, label = text)) +
geom_point(size = 2) +
geom_text(hjust = 0, nudge_x = 0.02, size = 3) +
theme_minimal() +
labs(title = "UMAP Projection of Text Embeddings",
x = "UMAP 1", y = "UMAP 2")Semantically related texts cluster together in the 2D projection. The
n_neighbors and min_dist parameters control
the trade-off between preserving local versus global structure.
Practical Example: Document Similarity Analysis
This end-to-end example combines embedding, clustering, and visualization to analyze a set of research paper abstracts.
library(ggplot2)
# Simulate research abstracts from three fields
abstracts <- tibble(
paper_id = 1:15,
field = rep(c("NLP", "genomics", "climate"), each = 5),
abstract = c(
"Transformer architectures improve machine translation quality",
"Attention mechanisms capture long-range text dependencies",
"Pre-training on large corpora enables few-shot learning",
"Named entity recognition benefits from contextual embeddings",
"Sentiment analysis models generalize across domains",
"CRISPR enables precise genome editing in mammalian cells",
"Single-cell RNA sequencing reveals cell type heterogeneity",
"Epigenetic modifications regulate gene expression patterns",
"Protein folding prediction reaches experimental accuracy",
"Microbiome composition correlates with metabolic health",
"Global temperatures rise faster than model projections",
"Carbon capture technology scales to industrial levels",
"Sea level rise threatens coastal infrastructure worldwide",
"Deforestation reduces regional precipitation patterns",
"Methane emissions from permafrost accelerate warming"
)
)
# Embed, cluster, and project to 2D
result <- abstracts |>
hf_embed_text(abstract) |>
hf_cluster_texts(k = 3)
coords <- hf_embed_umap(abstracts$abstract, n_neighbors = 5)
# Combine for plotting
plot_data <- bind_cols(
result |> select(paper_id, field, cluster),
coords |> select(umap_1, umap_2)
)
ggplot(plot_data, aes(umap_1, umap_2, color = factor(cluster), shape = field)) +
geom_point(size = 3) +
theme_minimal() +
labs(
title = "Research Abstracts by Embedding Cluster",
color = "Cluster",
shape = "True Field"
)When the embedding model captures domain-specific semantics, the automatically discovered clusters should align with the true research fields.
See Also
- Getting Started – installation and authentication.
- Hub Discovery, Datasets, and Tidymodels Integration – using embeddings as features in supervised learning pipelines.