Processing audio - Tinfoil Documentation

Tinfoils technical security guarantees make transcribing or voicing sensitive content (medical notes, legal drafts, internal communications) safe in a way no conventional cloud audio API can match. File transcription and text-to-speech are OpenAI-compatible, so existing clients work by changing only the base URL and key. Realtime transcription is only partially OpenAI-compatible (details below). See Audio models for the full list of models.

Realtime transcription

Voxtral Mini Realtime (voxtral-mini-4b-realtime) streams speech-to-text over a WebSocket: you send PCM16 audio chunks as the user speaks, and partial transcripts stream back word-by-word.

This is only available through the tinfoil-js WebSocket right now. Contact us if you need it in another SDK.

wss://inference.tinfoil.sh/v1/realtime

Authenticate with your API key as a Bearer token in the Authorization header. The tinfoil-js SDK handles this for you and pins the TLS connection to the attested enclave key:

import { TinfoilAI } from "tinfoil";

const client = new TinfoilAI({ apiKey: "<YOUR_API_KEY>" });
const rt = await client.realtime({ model: "voxtral-mini-4b-realtime" });

rt.on("session.created", (event) => console.log(event));
rt.send({ type: "input_audio_buffer.append", audio: base64AudioChunk });

This is Node.js only: browsers can’t expose TLS certificate details, so the connection can’t be pinned to the attested key, and verified realtime isn’t available in the browser yet. If this is a problem, contact us.

OpenAI Realtime-compatible mode

Connect with ?intent=transcription and the endpoint speaks the OpenAI Realtime transcription dialect. Existing OpenAI Realtime clients should work by changing only the URL and key:

wss://inference.tinfoil.sh/v1/realtime?intent=transcription

The session flow:

The server sends session.created.
Optionally send session.update declaring your input format (defaults to PCM16 mono at 24kHz; declare {"type": "audio/pcm", "rate": 16000} if you capture at 16kHz). The server replies session.updated.
Stream audio with input_audio_buffer.append (base64 PCM16). Partial transcripts arrive as conversation.item.input_audio_transcription.delta events.
Send input_audio_buffer.commit to end the utterance. The server replies with input_audio_buffer.committed and a conversation.item.input_audio_transcription.completed event carrying the final transcript. You can then stream the next utterance on the same connection.

import { TinfoilAI } from "tinfoil";

const client = new TinfoilAI({ apiKey: process.env.TINFOIL_API_KEY! });
const rt = await client.realtime({ model: "voxtral-mini-4b-realtime" });

// Partial transcripts stream in as deltas; the final transcript arrives on completion.
rt.on("conversation.item.input_audio_transcription.delta", (event) => {
  process.stdout.write(event.delta);
});
rt.on("conversation.item.input_audio_transcription.completed", (event) => {
  console.log("\nFinal:", event.transcript);
});

// Stream PCM16 audio as base64 chunks, then commit to end the utterance.
for (const chunk of pcm16Chunks) {
  rt.send({ type: "input_audio_buffer.append", audio: chunk.toString("base64") });
}
rt.send({ type: "input_audio_buffer.commit" });

Turn Detection: This only works in push to talk dictation mode. The openAI spec uses server_vad turn detection: the server detects silence, and emits a speech_started/stopped event. Our mode does not do that. See more in the realtime model repo

Native mode

Connecting with ?model=voxtral-mini-4b-realtime instead speaks the leaner vLLM dialect: flat transcription.delta events, a session.update carrying only {"model": ...}, client commits with a final flag, and transcription.done (with usage) after the final commit. Audio must be PCM16 mono at 16kHz. One transcription session per connection.

File transcription

For recorded audio on disk, transcribe the whole file in one request over the OpenAI-compatible /v1/audio/transcriptions endpoint. The model is required: use voxtral-small-24b for accuracy or whisper-large-v3-turbo for fast, lightweight transcription.

from tinfoil import TinfoilAI

client = TinfoilAI(api_key="<YOUR_API_KEY>")

with open("meeting.mp3", "rb") as audio:
    result = client.audio.transcriptions.create(
        model="voxtral-small-24b",  # or "whisper-large-v3-turbo"
        file=audio,
    )

print(result.text)

Text-to-speech

Synthesize speech over the OpenAI-compatible /v1/audio/speech endpoint, which returns WAV audio. Use qwen3-tts for low-latency speech or voxtral-tts for the larger multilingual model. Both require a voice.

from tinfoil import TinfoilAI

client = TinfoilAI(api_key="<YOUR_API_KEY>")

response = client.audio.speech.create(
    model="qwen3-tts",
    voice="serena",
    input="Your audio never leaves the enclave.",
)
with open("speech.wav", "wb") as f:
    f.write(response.read())

Voices are the upstream models’ default preset voices. qwen3-tts has 9: aiden, dylan, eric, ono_anna, ryan, serena, sohee, uncle_fu, vivian. voxtral-tts has 20: neutral_female, neutral_male, casual_female, casual_male, cheerful_female, ar_male, de_female, de_male, es_female, es_male, fr_female, fr_male, hi_female, hi_male, it_female, it_male, nl_female, nl_male, pt_female, pt_male.

​Realtime transcription

​OpenAI Realtime-compatible mode

​Native mode

​File transcription

​Text-to-speech

Realtime transcription

OpenAI Realtime-compatible mode

Native mode

File transcription

Text-to-speech