Realtime transcription
Voxtral Mini Realtime (voxtral-mini-4b-realtime) streams speech-to-text over a WebSocket: you send PCM16 audio chunks as the user speaks, and partial transcripts stream back word-by-word.
Bearer token in the Authorization header. The tinfoil-js SDK handles this for you and pins the TLS connection to the attested enclave key:
OpenAI Realtime-compatible mode
Connect with?intent=transcription and the endpoint speaks the OpenAI Realtime transcription dialect. Existing OpenAI Realtime clients should work by changing only the URL and key:
- The server sends
session.created. - Optionally send
session.updatedeclaring your input format (defaults to PCM16 mono at 24kHz; declare{"type": "audio/pcm", "rate": 16000}if you capture at 16kHz). The server repliessession.updated. - Stream audio with
input_audio_buffer.append(base64 PCM16). Partial transcripts arrive asconversation.item.input_audio_transcription.deltaevents. - Send
input_audio_buffer.committo end the utterance. The server replies withinput_audio_buffer.committedand aconversation.item.input_audio_transcription.completedevent carrying the final transcript. You can then stream the next utterance on the same connection.
Turn Detection: This only works in push to talk dictation mode. The openAI spec uses
server_vad turn detection: the server detects silence, and emits a speech_started/stopped event. Our mode does not do that. See more in the realtime model repoNative mode
Connecting with?model=voxtral-mini-4b-realtime instead speaks the leaner vLLM dialect: flat transcription.delta events, a session.update carrying only {"model": ...}, client commits with a final flag, and transcription.done (with usage) after the final commit. Audio must be PCM16 mono at 16kHz. One transcription session per connection.
File transcription
For recorded audio on disk, transcribe the whole file in one request over the OpenAI-compatible/v1/audio/transcriptions endpoint. The model is required: use voxtral-small-24b for accuracy or whisper-large-v3-turbo for fast, lightweight transcription.
Text-to-speech
Synthesize speech over the OpenAI-compatible/v1/audio/speech endpoint, which returns WAV audio. Use qwen3-tts for low-latency speech or voxtral-tts for the larger multilingual model. Both require a voice.
Voices are the upstream models’ default preset voices.
qwen3-tts has 9: aiden, dylan, eric, ono_anna, ryan, serena, sohee, uncle_fu, vivian. voxtral-tts has 20: neutral_female, neutral_male, casual_female, casual_male, cheerful_female, ar_male, de_female, de_male, es_female, es_male, fr_female, fr_male, hi_female, hi_male, it_female, it_male, nl_female, nl_male, pt_female, pt_male.
