Direct: ElevenLabs, Deepgram, OpenAI, Google, Microsoft Azure Speech, Amazon Polly
Indirect: Retell AI, Vapi, LiveKit, PlayHT
Architectural ownership of State Space Models, the compute primitive Cartesia's founders invented, which gives the company a multi-year research lead at the layer where latency and scaling advantages compound.
Cartesia runs State Space Model architectures across the full voice stack (TTS, STT, voice agents, on-device inference, multilingual generation, emotion control), trading the quadratic cost of transformer attention for linear scaling and sub-100ms latency.
Building human-like AI voices that speak, clone, dub, and converse in 70+ languages
Having established defensible voice quality and market share through its API, ElevenLabs is now becoming a multimodal generation platform with an enterprise go-to-market engine.
Voice AI infrastructure for real-time speech-to-text, text-to-speech, and voice agents.
Deepgram controls the full vertical stack from bare-metal training hardware to a Rust inference runtime, a cost and latency moat that API competitors riding hyperscaler infrastructure cannot replicate without years of capex.