ML//Multimodal//TTS
- Text-to-speech: convert text into natural-sounding audio.
Text-to-speech: convert text into natural-sounding audio.
NaturalSpeech (Microsoft, now v3), ElevenLabs, Cartesia — near-human quality by 2024.
The frontier merged with omnimodels: GPT-4o and Gemini 2.0 do TTS natively.
Full-duplex speech: Moshi (Kyutai) and OpenAI Realtime API handle simultaneous listen + speak.