Otter.ai, Fireflies.ai, Grain (transcription/summarization-focused).
Gong, Chorus (now ZoomInfo), Clari Copilot.
Google Gemini Live, OpenAI ChatGPT Voice, Recall.ai.
Hume AI (emotion AI), Cogito (behavioral AI for calls).
Social intelligence (knowing when to speak, when to stay silent, reading conversational cues) is a technically harder problem than transcription or summarization. Meta AR/VR and Citadel Securities real-time systems backgrounds give the team rare expertise in low-latency multimodal processing. Meeting transcription is commoditized; live participation is not.
Using social-aware multimodal inference from audio and video, real-time sentiment analysis during calls, and multimodal data curation for training.
Building human-like AI voices that speak, clone, dub, and converse in 70+ languages
Having established defensible voice quality and market share through its API, ElevenLabs is now becoming a multimodal generation platform with an enterprise go-to-market engine.
Voice AI infrastructure for real-time speech-to-text, text-to-speech, and voice agents.
Deepgram controls the full vertical stack from bare-metal training hardware to a Rust inference runtime, a cost and latency moat that API competitors riding hyperscaler infrastructure cannot replicate without years of capex.