What is Sesame AI and who founded it?

Sesame AI is a voice AI and smart glasses startup founded in 2022 by Brendan Iribe (co-founder of Oculus, sold to Meta for $2B in 2014) and Ankit Kumar (former CTO of spatial computing startup Ubiquity6). The company has raised $307.6M total, including a $250M Series B led by Sequoia Capital in October 2025.

What is "Voice Presence" and how does it differ from traditional AI voice?

Voice Presence is Sesame's concept for AI voice that combines emotional intelligence, natural timing, and contextual awareness — rather than just producing accurate speech. Traditional AI voice assistants (Siri, Alexa, Google Assistant) convert text to audio through a two-step pipeline that loses tonal nuance. Sesame's CSM-1B model generates audio directly end-to-end, preserving rhythm, emotion, and expressiveness.

What is CSM-1B and can it be used commercially?

CSM-1B (Conversational Speech Model, 1 billion parameters) is Sesame's open-source voice generation model, released in March 2025 under the Apache 2.0 license — which permits commercial use. It takes text and audio as input and outputs RVQ audio codes. The model powers Maya and represents Sesame's contribution to the broader voice AI research community.

Sesame AI: $307M Raised, Maya/Miles Voice Assistants, and the Quest to Cross the Uncanny Valley of Voice

Hello, I'm Hamamoto from TIMEWELL.

"Is this really AI?" — that was the reaction from many people who tried Sesame's demo when it launched in February 2025. Within weeks, over one million users had tested it, generating more than five million minutes of conversation. What was different wasn't the information the AI provided — it was how it sounded while providing it.

What Is Sesame AI?

Sesame is a voice AI and smart glasses startup focused on one specific problem: making AI voice feel like talking to a person rather than a machine.

Company overview:

Founded: 2022
Founders: Brendan Iribe (Oculus co-founder, former CEO) and Ankit Kumar (former CTO, Ubiquity6)
Total funding: $307.6M
Key products: Maya, Miles voice assistants; CSM-1B open-source model

Funding history:

Period	Round	Amount	Key Investors
2023	Seed	Undisclosed	a16z, Spark Capital, Matrix Partners
October 2025	Series B	$250M	Sequoia, Spark Capital

Sequoia published a blog post titled "A New Era of Voice" explaining their investment thesis — which gives a sense of how seriously they're treating this category.

Maya and Miles: What Made Them Go Viral

The February 2025 demo introduced two voice assistant characters:

Maya — warm, friendly, empathetic conversational style Miles — more measured, witty, intellectually-oriented style

Users can choose based on preference. The distinction isn't just personality labeling — the voice qualities, pacing, and tonal choices are genuinely different.

What Separates Sesame from Siri, Alexa, and Google Assistant

Traditional AI voice assistants use a two-step pipeline:

[LLM] → text → [TTS engine] → audio

The text-to-speech conversion step is where expressiveness is lost. The TTS engine generates audio from text, but the text itself doesn't carry tonal information — when to pause, when to inflect, what emotional register to use. The result sounds like a system reading text aloud.

Sesame's approach:

[Conversational model] → audio directly (with rhythm, emotion, expressiveness)

The end-to-end generation means the model produces the audio itself, not a text representation that gets converted. Nuance that would be lost in transcription is preserved.

Sesame calls the result Voice Presence — a concept combining:

Emotional expression through tone (joy, surprise, empathy)
Natural timing and pauses matching human conversation rhythm
Context-aware tonal adjustment
Wit and humor in appropriate situations

The research paper supporting this approach is titled "Crossing the Uncanny Valley of Voice" — directly addressing why AI voice has felt slightly wrong, and what it takes to fix it.

CSM-1B: Open-Source Voice Generation

In March 2025, Sesame released CSM-1B (Conversational Speech Model, 1 billion parameters) under the Apache 2.0 license.

Technical specifications:

1B parameters
Apache 2.0 license (commercial use permitted)
Output format: RVQ audio codes
Input: text and audio

The open-source release serves multiple purposes: it accelerates research in voice AI, builds developer ecosystem around Sesame's approach, and establishes credibility in the research community — the same approach ElevenLabs and others have used to generate attention.

The Smart Glasses Vision

Sesame isn't positioning itself as a voice assistant company in isolation. The end goal is an ambient AI interface — a lightweight, all-day wearable that operates primarily through voice.

Brendan Iribe's Oculus experience is relevant here. Building the Oculus Rift required solving miniaturization, weight distribution, battery life, and comfort challenges at consumer scale. Those problems have direct analogies in smart glasses development.

The case for voice as the primary interface for wearables:

Hands-free — no screen interaction required
Low friction — faster than typing, natural conversational flow
Wearable-native — smart glasses and earbuds need voice as primary input

This is where Sesame's long-term strategy diverges from pure software voice AI companies. The combination of voice AI technology with hardware development capability — building both the software and the device — mirrors the vertical integration strategy that made Apple effective.

Current Limitations and Challenges

Task execution uncertainty: LLMs are strong at conversation but less reliable at executing precise actions. Error recovery in voice interfaces is harder than in visual interfaces.

Privacy: Always-on listening raises legitimate concerns about data handling. Public space use creates additional complexity.

Context limitations without vision: Voice-only interfaces lack the visual context that clarifies ambiguous requests. Multimodal integration (camera + voice) addresses this but adds hardware complexity.

Security: Prompt injection via voice — using audio to manipulate AI behavior — is an active area of security research.

Competitive Landscape

Company	Approach	Differentiator
Sesame	End-to-end voice generation, Voice Presence	Natural expressiveness, smart glasses roadmap
ElevenLabs	TTS platform, voice cloning	Multi-language, voice marketplace
OpenAI	GPT Advanced Voice Mode	Integrated with leading LLM
Amazon	Alexa	Smart home integration
Apple	Siri	Device ecosystem
Google	Google Assistant	Search and services integration

Sesame's differentiation is technical (end-to-end generation vs. two-step pipeline), hardware-oriented (smart glasses as destination), and focused (conversation quality specifically, rather than breadth of features).

Enterprise Applications

Voice AI in business contexts has specific use cases where the quality improvement Sesame offers matters:

Customer support: Human-sounding AI that handles the emotional dimensions of service conversations more naturally than robotic-sounding alternatives.

Internal assistants: Hands-free information retrieval and task logging during work sessions, without requiring screen interaction.

Sales support: Voice-based CRM data entry, pre-meeting briefings, follow-up management.

Training: Conversation practice, role-playing scenarios, language learning support.

Considerations for enterprise deployment:

Identify use cases where voice specifically outperforms text interfaces
Establish clear data handling policies for voice recordings
Plan integration with existing systems through API
Build error handling for voice-specific failure modes

TIMEWELL supports evaluation and implementation of voice AI technology through WARP consulting services, and ZEROCK provides enterprise AI infrastructure that enables integration with tools including voice AI systems.

Summary

Key facts about Sesame AI:

Founded by Oculus co-founder Brendan Iribe and AR/VR expert Ankit Kumar
Raised $307.6M total (Series B: $250M from Sequoia, October 2025)
Maya and Miles attracted 1M+ users and 5M+ minutes of conversation within weeks of launch
CSM-1B open-sourced under Apache 2.0 (commercial use permitted)
"Voice Presence" concept — emotional intelligence + natural timing + contextual awareness
Long-term goal: AI smart glasses as ambient interface

The broader signal: voice as an AI interaction modality is moving from "functional but robotic" toward genuinely natural. Companies that deploy voice AI now will have a meaningful experience advantage once that quality gap closes.

Sesame AI: $307M Raised, Maya/Miles Voice Assistants, and the Quest to Cross the Uncanny Valley of Voice

What Is Sesame AI?

Maya and Miles: What Made Them Go Viral

What Separates Sesame from Siri, Alexa, and Google Assistant

CSM-1B: Open-Source Voice Generation

The Smart Glasses Vision

Current Limitations and Challenges

Competitive Landscape

Enterprise Applications

Summary

Considering AI adoption for your organization?

Newsletter

あなたのAIリテラシー、診断してみませんか？

Related Knowledge Base

Solutions

Learn More About AIコンサル

Related Articles

The Day the Government Becomes a Startup's 'First Customer': How the New Procurement Package for Japan's 17 Strategic Sectors Changes the Deep Tech Landscape (April 2026 Update)

Management Strategy for an AI-Driven Society — Fujitsu CTO Takagi on the Reality of "Human-Centered AI x Corporate Transformation" [SusHi Tech Tokyo 2026]

AI x Education for Well-being in the Intelligent Age | The Vision of UTokyo President Fujii and Mongolia-born AI Academia at SusHi Tech Tokyo 2026

Newsletter