AI audio interviews

Conduct voice-based AI interviews with real-time transcription.

AI audio interviews let respondents speak their answers instead of typing. The AI conducts a real-time voice conversation with adaptive follow-ups and automatic transcription.

aiaudiovoiceinterviews10-20 minutes to set upIntermediateResearchersUX teams

Steps

Add an AI Audio Interview question
In the survey editor, click + Add and choose AI Audio Interview. This creates a voice-based interview block.
Configure voice settings
Set the interview topic, AI personality, and system prompt. Choose a session mode: Pipeline (separate STT, LLM, and TTS components for flexibility and cost control) or Realtime (native multimodal models from OpenAI or Google Gemini for the lowest latency). Select your preferred TTS and STT providers.
Review transcripts
After respondents complete the voice interview, transcripts are automatically generated and available in the Results tab alongside the audio recordings.

AI audio interviews remove the friction of typing, making it easier for respondents to share detailed, nuanced feedback through natural conversation.

Two session modes are available: Pipeline (separate STT + LLM + TTS components for maximum flexibility) and Realtime (native multimodal models from OpenAI or Google for lowest latency).

Text-to-speech providers: Cartesia Sonic-3 (recommended, ~40ms latency, multilingual), ElevenLabs Flash v2.5, Deepgram Aura-2, Rime Arcana (fast English with expressive voices), xAI TTS (20 languages, inline speech tags), Inworld TTS (emotional range for interactive contexts), and more. In Realtime mode, OpenAI and Google Gemini handle speech natively.

Speech-to-text providers: Deepgram Flux (recommended for voice agents, integrated turn detection), Cartesia Ink Whisper (fastest, 80+ languages), AssemblyAI Universal-3 Pro (promptable for domain-specific terms), Deepgram Nova-3 (best accuracy at 6.84% WER, 36 languages), and ElevenLabs Scribe V2 (99+ languages with word-level timestamps).

Pricing: Pipeline mode costs approximately $0.05/minute. OpenAI Realtime runs at approximately $0.30/minute, and Google Gemini Realtime at approximately $0.08/minute.

Built-in audio enhancement (ai-coustics) applies voice focus and speaker isolation to improve transcription accuracy in noisy environments. Adaptive interruption handling uses ML to reject false barge-ins like coughs and back-channeling.

Transcripts are automatically generated for every voice interview, making analysis straightforward even with audio data.