Audio models

Voxtral Small 24B

voxtral-small-24b

Parameters: 24BQuantization: None (served in BF16)Capabilities: Speech-to-text transcription, audio Q&A, summarization, translation, voice-triggered function callingAudio Duration: Up to 30 minutes (transcription) or 40 minutes (understanding)Audio Format: Supports .mp3 and .wav filesLanguages: English, Spanish, French, Portuguese, Hindi, German, Dutch, ItalianBest for: Speech transcription with automatic language detection, answering questions from spoken input, generating summaries from audio, and triggering functions from voice commandsModel weights: mistralai/Voxtral-Small-24B-2507Configuration repo: tinfoilsh/confidential-voxtral-small-24b

Audio + Text: Built on Mistral Small 3.1 foundation, combining speech processing with strong text capabilities including function calling from voice commands.

Voxtral TTS

voxtral-tts

Parameters: 4BQuantization: None (served in BF16)Capabilities: Text-to-speech synthesis — converts text into natural-sounding spoken audioAudio Format: Returns generated audio via the /v1/audio/speech endpoint with model=voxtral-ttsBest for: High-quality voice output for assistants, narration, and accessibility, kept inside a secure enclaveModel weights: mistralai/Voxtral-4B-TTS-2603Configuration repo: tinfoilsh/confidential-realtime-models

Qwen3 TTS

qwen3-tts

Parameters: 1.7BQuantization: None (served in BF16)Capabilities: Text-to-speech synthesis with custom voice supportAudio Format: Returns generated audio via the /v1/audio/speech endpoint (the default model for that endpoint)Best for: Lightweight, low-latency speech generation where a smaller model keeps the GPU footprint downModel weights: Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoiceConfiguration repo: tinfoilsh/confidential-realtime-models

Voxtral Mini Realtime

voxtral-mini-4b-realtime

Parameters: 4BQuantization: None (served in BF16)Capabilities: Streaming speech-to-text transcription — partial transcripts appear word-by-word as audio arrives (~480ms behind the speaker)Audio Format: PCM16 mono over WebSocket (16kHz native; 24kHz accepted on the OpenAI-compatible endpoint and resampled in-enclave)Languages: English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, DutchBest for: Live dictation, voice interfaces, and any UI that should show words as they’re spoken rather than after the recording endsModel weights: mistralai/Voxtral-Mini-4B-Realtime-2602Configuration repo: tinfoilsh/confidential-realtime-models

This model is picky. Realtime transcription streams over a WebSocket, which currently can only be verifiably connected to using the tinfoil-js SDK, and only in Node, so not in browsers. If this is a problem, contact us.

Streaming by architecture: Causal audio encoder emits text incrementally with a fixed lookahead — transcripts are append-only and never revised. Served over the WebSocket /v1/realtime endpoint, including an OpenAI Realtime API-compatible mode. See the realtime section of the processing audio guide.

Vision models

Embedding models

⌘I

Getting Started

Model catalog

Tinfoil SDKs

Local Proxy

Tinfoil Containers

Guides

Verification & Attestation

Tutorials

Admin API

Resources