ML//Multimodal//TTS

- Text-to-speech: convert text into natural-sounding audio.


Text-to-speech: convert text into natural-sounding audio.

NaturalSpeech (Microsoft, now v3), ElevenLabs, Cartesia — near-human quality by 2024.

The frontier merged with omnimodels: GPT-4o and Gemini 2.0 do TTS natively.

Full-duplex speech: Moshi (Kyutai) and OpenAI Realtime API handle simultaneous listen + speak.