Working with Embeddings • foundryR

What Are Embeddings?

Embeddings are numerical representations of text that capture semantic meaning. When you convert text to an embedding, you get a vector of numbers (typically 1,536 or 3,072 dimensions depending on the model) where similar meanings result in similar vectors.

This enables powerful capabilities:

Semantic Search: Find documents related to a query by meaning, not just keywords
Similarity Comparison: Measure how similar two pieces of text are
Clustering: Group related documents together
Classification: Use embeddings as features for machine learning models

Unlike keyword matching, embeddings understand that “automobile” and “car” are semantically similar, even though they share no letters.

Generating Embeddings with foundry_embed()

The foundry_embed() function converts text into embedding vectors:

library(foundryR)

# Single text
embedding <- foundry_embed("Machine learning is transforming industries",
                           model = "my-embeddings")
embedding
#> # A tibble: 1 x 3
#>   text                                       embedding      n_dims
#>   <chr>                                      <list>          <int>
#> 1 Machine learning is transforming industries <dbl [1,536]>   1536

The result is a tibble with:

text: The original input text
embedding: A list-column containing the numeric vector
n_dims: The dimensionality of the embedding

Embedding Multiple Texts

Pass a character vector to embed multiple texts in one call:

documents <- c(
  "The quick brown fox jumps over the lazy dog",
  "A fast auburn fox leaps above a sleepy canine",
  "The stock market closed higher today",
  "Financial markets saw gains in afternoon trading"
)

doc_embeddings <- foundry_embed(documents, model = "my-embeddings")
doc_embeddings
#> # A tibble: 4 x 3
#>   text                                            embedding      n_dims
#>   <chr>                                           <list>          <int>
#> 1 The quick brown fox jumps over the lazy dog     <dbl [1,536]>    1536
#> 2 A fast auburn fox leaps above a sleepy canine   <dbl [1,536]>    1536
#> 3 The stock market closed higher today            <dbl [1,536]>    1536
#> 4 Financial markets saw gains in afternoon trading <dbl [1,536]>   1536

Controlling Dimensions

Some embedding models (like text-embedding-3-small and text-embedding-3-large) support reducing the output dimensions. Smaller dimensions mean faster similarity computations and less storage, with some trade-off in precision:

# Reduce to 256 dimensions (model must support this)
compact_embedding <- foundry_embed(
  "Hello, world!",
  model = "my-text-embedding-3-small",
  dimensions = 256
)
compact_embedding$n_dims
#> [1] 256

Computing Similarity with foundry_similarity()

Cosine similarity measures how similar two embeddings are, with values ranging from -1 (opposite) to 1 (identical). The foundry_similarity() function computes pairwise similarities for all embeddings in a tibble:

texts <- c(
  "I love programming in R",
  "R is my favorite language for data analysis",
  "The weather is nice today",
  "It's sunny and warm outside"
)

embeddings <- foundry_embed(texts, model = "my-embeddings")
similarities <- foundry_similarity(embeddings)
similarities
#> # A tibble: 6 x 3
#>   text_1                                   text_2                              similarity
#>   <chr>                                    <chr>                                    <dbl>
#> 1 The weather is nice today                It's sunny and warm outside              0.934
#> 2 I love programming in R                  R is my favorite language for da...      0.912
#> 3 I love programming in R                  The weather is nice today                0.723
#> 4 I love programming in R                  It's sunny and warm outside              0.698
#> 5 R is my favorite language for data an... The weather is nice today                0.687
#> 6 R is my favorite language for data an... It's sunny and warm outside              0.672

Results are sorted by similarity (highest first). Notice how texts about similar topics (R programming, weather) have higher similarity scores with each other.

Use Case: Finding Similar Documents

A common application is finding documents most similar to a query. Here is how to implement a simple semantic search:

library(dplyr)

# Your document collection
documents <- c(
  "How to install R packages using install.packages()",
  "Data visualization with ggplot2 in R",
  "Introduction to machine learning with Python",
  "Statistical hypothesis testing explained",
  "Building web applications with Shiny",
  "Deep learning with TensorFlow and Keras"
)

# Embed all documents
doc_embeddings <- foundry_embed(documents, model = "my-embeddings")

# User query
query <- "How do I create charts and graphs in R?"
query_embedding <- foundry_embed(query, model = "my-embeddings")

# Compute similarity between query and each document
compute_similarity <- function(emb1, emb2) {
  sum(emb1 * emb2) / (sqrt(sum(emb1^2)) * sqrt(sum(emb2^2)))
}

query_vec <- query_embedding$embedding[[1]]

results <- doc_embeddings %>%
  mutate(
    similarity = sapply(embedding, function(e) compute_similarity(query_vec, e))
  ) %>%
  arrange(desc(similarity)) %>%
  select(text, similarity)

# Top 3 most relevant documents
head(results, 3)
#> # A tibble: 3 x 2
#>   text                                          similarity
#>   <chr>                                              <dbl>
#> 1 Data visualization with ggplot2 in R              0.891
#> 2 How to install R packages using install.packages() 0.812
#> 3 Building web applications with Shiny              0.756

Use Case: Clustering Text

Embeddings work well as features for clustering algorithms. Here is an example using stats::kmeans() to group similar texts:

library(dplyr)

# Sample texts to cluster
texts <- c(
  # Tech/Programming cluster
  "Python is great for machine learning",
  "R excels at statistical analysis",
  "JavaScript powers modern web applications",
  "SQL is essential for database queries",
  # Food cluster
"Italian pasta with tomato sauce",
  "Sushi is a popular Japanese dish",
  "French croissants are flaky and buttery",
  "Mexican tacos with fresh salsa",
  # Sports cluster
  "Soccer is the world's most popular sport",
  "Basketball requires speed and agility",
  "Tennis matches can last for hours",
  "Swimming is excellent cardiovascular exercise"
)

# Generate embeddings
embeddings <- foundry_embed(texts, model = "my-embeddings")

# Convert list-column to matrix for kmeans
embedding_matrix <- do.call(rbind, embeddings$embedding)

# Cluster into 3 groups
set.seed(42)
clusters <- kmeans(embedding_matrix, centers = 3, nstart = 10)

# Add cluster assignments to our data
results <- embeddings %>%
  mutate(cluster = clusters$cluster) %>%
  arrange(cluster) %>%
  select(text, cluster)

results
#> # A tibble: 12 x 2
#>    text                                       cluster
#>    <chr>                                        <int>
#>  1 Python is great for machine learning             1
#>  2 R excels at statistical analysis                 1
#>  3 JavaScript powers modern web applications        1
#>  4 SQL is essential for database queries            1
#>  5 Italian pasta with tomato sauce                  2
#>  6 Sushi is a popular Japanese dish                 2
#>  7 French croissants are flaky and buttery          2
#>  8 Mexican tacos with fresh salsa                   2
#>  9 Soccer is the world's most popular sport         3
#> 10 Basketball requires speed and agility            3
#> 11 Tennis matches can last for hours                3
#> 12 Swimming is excellent cardiovascular exercise    3

The algorithm successfully grouped texts by topic (programming, food, sports) without any labeled training data.

Tips for Working with Embeddings

Choosing a Model

Azure AI Foundry offers several embedding models:

Model	Dimensions	Notes
text-embedding-ada-002	1,536	Previous generation, widely used
text-embedding-3-small	1,536 (configurable)	Newer, supports dimension reduction
text-embedding-3-large	3,072 (configurable)	Highest quality, more expensive

For most use cases, text-embedding-3-small offers a good balance of quality and cost.

Dimension Trade-offs

Higher dimensions generally capture more nuance but:

Require more storage space
Take longer to compute similarities
May not significantly improve results for simple tasks

Consider using reduced dimensions (256-512) for large-scale applications where speed matters more than precision.

Handling Large Document Collections

For large collections, consider:

Batch processing: Embed documents in batches to avoid rate limits
Caching: Store embeddings in a database rather than regenerating them
Approximate nearest neighbors: Use libraries like RcppAnnoy for faster similarity search on large datasets

# Example: Processing in batches
batch_embed <- function(texts, model, batch_size = 100) {
  n_batches <- ceiling(length(texts) / batch_size)
  results <- vector("list", n_batches)

  for (i in seq_len(n_batches)) {
    start_idx <- (i - 1) * batch_size + 1
    end_idx <- min(i * batch_size, length(texts))
    batch_texts <- texts[start_idx:end_idx]

    results[[i]] <- foundry_embed(batch_texts, model = model)

    # Brief pause to respect rate limits
    Sys.sleep(0.5)
  }

  dplyr::bind_rows(results)
}

Preprocessing Text

For best results:

Remove excessive whitespace and special characters
Consider whether to include or exclude punctuation based on your use case
For long documents, embed meaningful chunks (paragraphs, sentences) rather than entire documents
Normalize text (lowercase) if case distinctions are not important for your application