Audio is one of the most useful Foundry additions for researchers. You can transcribe interviews, lectures, field recordings, meetings, and focus groups, then keep the result in a tibble with segment-level timing for downstream coding or analysis.
Two ways to reach speech models
foundryR can use audio models in two places:
-
Azure OpenAI deployments on your main resource –
whisperfor transcription and translation, and a text-to-speech model such asgpt-4o-mini-ttsfor synthesis. These reuse your main endpoint and key and are what the examples below use. Classicwhisperis exposed only on the deployment path, so passapi = "deployment". - A dedicated Speech (LLM Speech) resource for MAI-Transcribe models. This chunk is illustrative and is not run:
# Only needed for MAI-Transcribe on a dedicated Speech resource:
foundry_set_speech_endpoint(Sys.getenv("AZURE_FOUNDRY_SPEECH_ENDPOINT"))
foundry_set_speech_key("your-speech-key")A real, public-domain sample
The examples below use a short excerpt from John F. Kennedy’s 1961 inaugural address (“And so, my fellow Americans…”). This clip ships with the package and is the de facto “hello, world” of open-source speech recognition, so the transcript is easy to check against a recording everyone knows.
sample_audio <- system.file("extdata/samples/jfk.wav", package = "foundryR")
basename(sample_audio)
#> [1] "jfk.wav"Transcribe an audio file
foundry_transcribe() returns one row per file. The
text column holds the transcript and the
phrases list-column holds segment-level timing. We use the
whisper deployment on the main resource; because classic
whisper lives on the deployment path we pass
api = "deployment", and
response_format = "verbose_json" asks the service for
per-segment timing.
transcript <- foundry_transcribe(
sample_audio,
service = "openai",
model = "whisper",
api = "deployment",
response_format = "verbose_json"
)
transcript$text
#> [1] "And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country."The segment timing lives in the phrases list-column, one
row per recognized segment. A short clip like this one is a single
segment; longer recordings return many:
head(transcript$phrases[[1]])
#> # A tibble: 1 × 7
#> text locale offset_ms duration_ms confidence speaker words
#> <chr> <chr> <int> <int> <dbl> <chr> <list>
#> 1 " And so my fellow Ame… NA 0 11000 NA NA <list>Synthesize speech
foundry_speak() writes binary audio to disk and returns
the file path and byte count – handy for experiment stimuli,
accessibility assets, and demos. Use your text-to-speech deployment name
for model.
speech <- foundry_speak(
"Hello, world.",
model = "gpt-4o-mini-tts",
voice = "alloy",
path = tempfile(fileext = ".mp3")
)
speech[, c("bytes", "model", "voice", "format")]
#> # A tibble: 1 × 4
#> bytes model voice format
#> <int> <chr> <chr> <chr>
#> 1 25728 gpt-4o-mini-tts alloy mp3Those bytes are the real audio the model returned. Play them here:
Translate multilingual recordings
Use foundry_translate_audio() when you want an analysis
corpus in a common language. To keep the example fully reproducible we
first synthesize a short Spanish clip, then translate it to English with
whisper – both are real API calls.
spanish_clip <- foundry_speak(
"La reunion fue muy util.",
model = "gpt-4o-mini-tts",
voice = "alloy",
path = tempfile(fileext = ".mp3")
)Listen to the synthesized Spanish input:
Now translate it to English with the whisper deployment:
translation <- foundry_translate_audio(
spanish_clip$path,
service = "openai",
model = "whisper",
api = "deployment"
)
translation$text
#> [1] "The meeting was very useful."Notes for researchers
- Inspect
head(transcript$phrases[[1]])before processing long recordings so you know the segment structure your coding scheme has to handle. - Keep raw audio out of your project repository; store transcripts and IDs.
- Request
response_format = "verbose_json"to get segment-level timing from whisper; the default format returns the transcript text only.