Integrating foundryR with onet2r
Source:vignettes/articles/onet2r-integration.Rmd
onet2r-integration.RmdWhy combine the two packages
onet2r reads the U.S. Department of Labor’s O*NET database: occupation titles, descriptions, skills, tasks, and technology requirements for roughly a thousand occupations. It returns tidy tibbles.
foundryR turns text into data with Azure AI Foundry: embeddings for semantic comparison, and chat completions for summarization. The two fit together naturally because onet2r produces the text that foundryR reasons over, and both speak tibbles.
This article builds one end-to-end workflow: pull real occupations from O*NET, embed their titles, and rank them against a plain-language query by meaning rather than by keyword. It closes by summarizing the top match’s real O*NET description with a chat model.
Setup
onet2r is on GitHub, not CRAN. Install both packages with pak:
Each package reads its own credentials from the environment, so nothing secret appears in your code:
# Azure AI Foundry (foundryR)
foundry_set_endpoint(Sys.getenv("AZURE_FOUNDRY_ENDPOINT"))
foundry_set_key(Sys.getenv("AZURE_FOUNDRY_KEY"))
# O*NET (onet2r) reads ONET_API_KEY. Register for a free key at
# https://services.onetcenter.org/developer/ then set:
Sys.setenv(ONET_API_KEY = "your-onet-key")Reading occupation data from O*NET
onet_search() matches occupations by keyword and returns
a tibble of code and title:
security_matches <- onet_search("information security")
security_matchesonet_occupation() returns the full record for one
occupation code as a list. The description field is the
plain-language summary O*NET writes for each job:
analyst <- onet_occupation("15-1212.00")
analyst$title
substr(analyst$description, 1, 220)Building a semantic search index
Keyword search only finds occupations whose titles contain the words you typed. Embeddings find occupations by meaning. Start by pulling a block of occupations and embedding their titles.
onet_occupations() lists occupations in O*NET-SOC code
order; the first 150 codes span management, business, and computer
occupations.
occupations <- onet_occupations(end = 150)
title_embeddings <- foundry_embed(
occupations$title,
model = "text-embedding-3-small"
)
occupation_index <- occupations |>
mutate(embedding = title_embeddings$embedding)
occupation_indexfoundry_embed() returns one row per input with the
vector in a list-column, so attaching it back onto the occupation tibble
keeps everything in one frame.
Searching by meaning
Embed a natural-language query the same way, then rank every occupation by cosine similarity to it. The query below shares no words with the official titles it should surface.
query <- "protecting company networks and data from hackers and cyber attacks"
query_vec <- foundry_embed(query, model = "text-embedding-3-small")$embedding[[1]]
cosine <- function(a, b) sum(a * b) / (sqrt(sum(a^2)) * sqrt(sum(b^2)))
ranked <- occupation_index |>
mutate(similarity = vapply(embedding, cosine, numeric(1), b = query_vec)) |>
arrange(desc(similarity)) |>
select(code, title, similarity)
head(ranked, 5)The top results are the security- and network-focused occupations, even though the query never uses their title words. That is the advantage of embeddings over keyword lookup: “protecting data from hackers” resolves to “Information Security Analysts” on meaning alone.
Inspecting skills for a match
Skill, knowledge, and task endpoints each return a tidy tibble, so they drop straight into a dplyr pipeline. Here are the first rows of the skills table for the top match:
top_code <- ranked$code[1]
onet_skills(top_code) |>
head()Summarizing the match with a chat model
The pieces combine cleanly: take the real O*NET description for the
top match and ask a chat model to rewrite it for a specific audience.
foundry_chat() takes the prompt as message and
returns a tibble; the generated text is in content.
top_occupation <- onet_occupation(top_code)
summary <- foundry_chat(
message = paste(
"Summarize this occupation for someone considering a career change,",
"in two sentences:\n\n",
top_occupation$description
)
)
cat(summary$content)Where to take it
This workflow generalizes past a single query. Embed the whole occupation index once, cache it, and you have a reusable semantic job-matcher: score a resume or a free-text career interest against every occupation, cluster occupations by skill profile, or flag near-duplicate roles. onet2r supplies the authoritative text and foundryR turns it into comparable numbers and readable summaries – all in tibbles.