Introduction
foundryR integrates seamlessly with the tidymodels ecosystem through
step_foundry_embed(), a recipe step that converts text
columns into embedding vectors. This enables you to incorporate
state-of-the-art text representations into your machine learning
pipelines alongside traditional preprocessing steps.
Why Use Embeddings in ML Pipelines?
Traditional text features like bag-of-words or TF-IDF capture word frequencies but miss semantic meaning. Embeddings provide dense vector representations that understand:
- Synonyms: “happy” and “joyful” produce similar vectors
- Context: “bank” (financial) vs “bank” (river) are distinguished
- Relationships: Semantic similarities are preserved in vector space
By converting text to embeddings within a recipe, you get:
- Reproducible preprocessing: The embedding step is part of your documented workflow
- Consistent handling: Training and test data are processed identically
- Pipeline integration: Combine with other preprocessing steps seamlessly
Prerequisites
Install tidymodels if you haven’t already:
install.packages("tidymodels")Ensure foundryR is configured with your Azure credentials and you have an embedding model deployed.
Basic Usage
Creating a Recipe with Embeddings
Use step_foundry_embed() to add embedding generation to
your recipe:
library(tidymodels)
library(foundryR)
# Sample data with text and outcome
reviews <- tibble(
text = c(
"This product is amazing, highly recommend!",
"Terrible quality, waste of money",
"Good value for the price",
"Disappointed with the purchase",
"Exceeded my expectations",
"Would not buy again"
),
sentiment = factor(c("positive", "negative", "positive",
"negative", "positive", "negative"))
)
# Create recipe with embedding step
recipe_spec <- recipe(sentiment ~ text, data = reviews) %>%
step_foundry_embed(
text,
model = "text-embedding-3-small", # Your deployment name
keep_original = FALSE # Remove original text column
)
recipe_spec
#> Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 1
#>
#> Operations:
#>
#> Azure AI Foundry embeddings for textPreparing and Baking the Recipe
# Prepare the recipe (generates embeddings for training data)
prepped_recipe <- prep(recipe_spec, training = reviews)
# Bake to transform data
baked_data <- bake(prepped_recipe, new_data = NULL)
baked_data
#> # A tibble: 6 × 1,537
#> sentiment text_embed_001 text_embed_002 text_embed_003 ... text_embed_1536
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 positive 0.0234 -0.0156 0.0089 0.0123
#> 2 negative -0.0145 0.0234 -0.0067 -0.0089
#> 3 positive 0.0178 -0.0123 0.0145 0.0098
#> ...The text column is replaced with 1,536 numeric embedding dimensions (the exact number depends on your embedding model).
Complete ML Pipeline Example
Here’s a full example building a sentiment classifier:
library(tidymodels)
library(foundryR)
# Load your data
set.seed(123)
reviews <- tibble(
review_text = c(
# Positive reviews
"Absolutely love this product! Works perfectly.",
"Great quality and fast shipping. Very satisfied.",
"Best purchase I've made this year. Highly recommend!",
"Exceeded all expectations. Will buy again.",
"Perfect fit and great value for money.",
# Negative reviews
"Complete waste of money. Broke after one use.",
"Terrible customer service. Never buying again.",
"Poor quality, doesn't work as advertised.",
"Disappointed. Much smaller than expected.",
"Arrived damaged and took forever to ship."
),
sentiment = factor(rep(c("positive", "negative"), each = 5))
)
# Split data
splits <- initial_split(reviews, prop = 0.8, strata = sentiment)
train_data <- training(splits)
test_data <- testing(splits)
# Define recipe with embeddings
embedding_recipe <- recipe(sentiment ~ review_text, data = train_data) %>%
step_foundry_embed(
review_text,
model = "text-embedding-3-small",
keep_original = FALSE
) %>%
step_normalize(all_numeric_predictors()) # Normalize embedding dimensions
# Define model
log_reg_spec <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
# Create workflow
sentiment_workflow <- workflow() %>%
add_recipe(embedding_recipe) %>%
add_model(log_reg_spec)
# Fit the model
fitted_workflow <- fit(sentiment_workflow, data = train_data)
# Make predictions on test data
predictions <- predict(fitted_workflow, test_data) %>%
bind_cols(test_data)
# Evaluate
predictions %>%
metrics(truth = sentiment, estimate = .pred_class)
#> # A tibble: 2 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 accuracy binary 1
#> 2 kap binary 1Advanced Options
Controlling Embedding Dimensions
Some models support dimension reduction for faster processing:
recipe_spec <- recipe(sentiment ~ text, data = reviews) %>%
step_foundry_embed(
text,
model = "text-embedding-3-small",
dimensions = 256, # Reduce from 1536 to 256
keep_original = FALSE
)Lower dimensions mean: - Faster model training - Less memory usage - Some loss in semantic precision
Multiple Text Columns
Process multiple text columns independently:
# Data with multiple text fields
data <- tibble(
title = c("Great Product", "Terrible Experience"),
description = c("Works as expected", "Broke immediately"),
outcome = c(1, 0)
)
recipe_spec <- recipe(outcome ~ ., data = data) %>%
step_foundry_embed(title, model = "text-embedding-3-small",
prefix = "title_") %>%
step_foundry_embed(description, model = "text-embedding-3-small",
prefix = "desc_") %>%
step_rm(title, description) # Remove original text columnsKeeping Original Columns
Sometimes you want both the text and embeddings:
recipe_spec <- recipe(sentiment ~ text, data = reviews) %>%
step_foundry_embed(
text,
model = "text-embedding-3-small",
keep_original = TRUE # Keep the text column
)
# Useful when you also want to apply other text processingCustom Column Prefix
Control the naming of embedding columns:
recipe_spec <- recipe(sentiment ~ text, data = reviews) %>%
step_foundry_embed(
text,
model = "text-embedding-3-small",
prefix = "embed_" # Columns will be embed_001, embed_002, etc.
)Cross-Validation
Embeddings are generated during prep(), so
cross-validation works correctly:
# Create CV folds
folds <- vfold_cv(train_data, v = 5, strata = sentiment)
# Fit resamples
cv_results <- fit_resamples(
sentiment_workflow,
resamples = folds,
metrics = metric_set(accuracy, roc_auc)
)
# Collect metrics
collect_metrics(cv_results)
#> # A tibble: 2 × 6
#> .metric .estimator mean n std_err .config
#> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 accuracy binary 0.875 5 0.0559 Preprocessor1_Model1
#> 2 roc_auc binary 0.925 5 0.0433 Preprocessor1_Model1Hyperparameter Tuning
Tune the embedding dimensions alongside model hyperparameters:
# Recipe with tunable dimensions
tunable_recipe <- recipe(sentiment ~ text, data = train_data) %>%
step_foundry_embed(
text,
model = "text-embedding-3-small",
dimensions = tune(), # Will be tuned
keep_original = FALSE
) %>%
step_normalize(all_numeric_predictors())
# Model with tunable parameters
rf_spec <- rand_forest(
mtry = tune(),
trees = 500,
min_n = tune()
) %>%
set_engine("ranger") %>%
set_mode("classification")
# Workflow
tunable_workflow <- workflow() %>%
add_recipe(tunable_recipe) %>%
add_model(rf_spec)
# Define grid
grid <- grid_regular(
dimensions(range = c(128, 512)), # Embedding dimensions
mtry(range = c(10, 50)),
min_n(range = c(2, 10)),
levels = 3
)
# Tune (this will take a while due to API calls)
tune_results <- tune_grid(
tunable_workflow,
resamples = folds,
grid = grid,
metrics = metric_set(accuracy, roc_auc)
)
# Best parameters
show_best(tune_results, metric = "roc_auc")Performance Considerations
API Rate Limits
Embedding generation makes API calls for each text. For large datasets:
-
Use batch processing: Consider using
foundry_embed_batch()outside the recipe for large training sets - Cache embeddings: Pre-compute and store embeddings for frequently used datasets
- Reduce cross-validation folds: Fewer folds mean fewer API calls during tuning
Cost Management
Each embedding call incurs API costs. Strategies to manage costs:
- Start with smaller dimension sizes during development
- Use a subset of data for initial experimentation
- Pre-compute embeddings for production datasets
Memory Usage
With 1,536 dimensions per text and thousands of observations, memory can grow quickly:
# Estimate memory for 10,000 texts
n_texts <- 10000
n_dims <- 1536
bytes_per_double <- 8
memory_mb <- (n_texts * n_dims * bytes_per_double) / 1024^2
print(paste(round(memory_mb), "MB for embeddings alone"))
#> [1] "117 MB for embeddings alone"Consider dimension reduction for large datasets.
Troubleshooting
“Column already exists” Error
If you run prep() multiple times, column names may
conflict:
Rate Limit Errors
If you hit rate limits during prep:
# Prepare in smaller batches
small_sample <- reviews %>% slice_sample(n = 100)
prepped <- prep(recipe_spec, training = small_sample)Missing Credentials
Ensure credentials are set before creating recipes:
# Check setup
foundry_check_setup()
# Set credentials if needed
foundry_set_endpoint("https://your-resource.openai.azure.com")
foundry_set_key("your-api-key")Next Steps
- Learn about Text Embeddings in depth
- Explore Content Safety for responsible AI
- Read the tidymodels documentation for more preprocessing options