Hello, I'm Hamamoto from TIMEWELL.
"Is this really AI?" — that was the reaction from many people who tried Sesame's demo when it launched in February 2025. Within weeks, over one million users had tested it, generating more than five million minutes of conversation. What was different wasn't the information the AI provided — it was how it sounded while providing it.
What Is Sesame AI?
Sesame is a voice AI and smart glasses startup focused on one specific problem: making AI voice feel like talking to a person rather than a machine.
Company overview:
- Founded: 2022
- Founders: Brendan Iribe (Oculus co-founder, former CEO) and Ankit Kumar (former CTO, Ubiquity6)
- Total funding: $307.6M
- Key products: Maya, Miles voice assistants; CSM-1B open-source model
Funding history:
| Period | Round | Amount | Key Investors |
|---|---|---|---|
| 2023 | Seed | Undisclosed | a16z, Spark Capital, Matrix Partners |
| October 2025 | Series B | $250M | Sequoia, Spark Capital |
Sequoia published a blog post titled "A New Era of Voice" explaining their investment thesis — which gives a sense of how seriously they're treating this category.
Maya and Miles: What Made Them Go Viral
The February 2025 demo introduced two voice assistant characters:
Maya — warm, friendly, empathetic conversational style Miles — more measured, witty, intellectually-oriented style
Users can choose based on preference. The distinction isn't just personality labeling — the voice qualities, pacing, and tonal choices are genuinely different.
What Separates Sesame from Siri, Alexa, and Google Assistant
Traditional AI voice assistants use a two-step pipeline:
[LLM] → text → [TTS engine] → audio
The text-to-speech conversion step is where expressiveness is lost. The TTS engine generates audio from text, but the text itself doesn't carry tonal information — when to pause, when to inflect, what emotional register to use. The result sounds like a system reading text aloud.
Sesame's approach:
[Conversational model] → audio directly (with rhythm, emotion, expressiveness)
The end-to-end generation means the model produces the audio itself, not a text representation that gets converted. Nuance that would be lost in transcription is preserved.
Sesame calls the result Voice Presence — a concept combining:
- Emotional expression through tone (joy, surprise, empathy)
- Natural timing and pauses matching human conversation rhythm
- Context-aware tonal adjustment
- Wit and humor in appropriate situations
The research paper supporting this approach is titled "Crossing the Uncanny Valley of Voice" — directly addressing why AI voice has felt slightly wrong, and what it takes to fix it.
Looking for AI training and consulting?
Learn about WARP training programs and consulting services in our materials.
CSM-1B: Open-Source Voice Generation
In March 2025, Sesame released CSM-1B (Conversational Speech Model, 1 billion parameters) under the Apache 2.0 license.
Technical specifications:
- 1B parameters
- Apache 2.0 license (commercial use permitted)
- Output format: RVQ audio codes
- Input: text and audio
The open-source release serves multiple purposes: it accelerates research in voice AI, builds developer ecosystem around Sesame's approach, and establishes credibility in the research community — the same approach ElevenLabs and others have used to generate attention.
The Smart Glasses Vision
Sesame isn't positioning itself as a voice assistant company in isolation. The end goal is an ambient AI interface — a lightweight, all-day wearable that operates primarily through voice.
Brendan Iribe's Oculus experience is relevant here. Building the Oculus Rift required solving miniaturization, weight distribution, battery life, and comfort challenges at consumer scale. Those problems have direct analogies in smart glasses development.
The case for voice as the primary interface for wearables:
- Hands-free — no screen interaction required
- Low friction — faster than typing, natural conversational flow
- Wearable-native — smart glasses and earbuds need voice as primary input
This is where Sesame's long-term strategy diverges from pure software voice AI companies. The combination of voice AI technology with hardware development capability — building both the software and the device — mirrors the vertical integration strategy that made Apple effective.
Current Limitations and Challenges
Task execution uncertainty: LLMs are strong at conversation but less reliable at executing precise actions. Error recovery in voice interfaces is harder than in visual interfaces.
Privacy: Always-on listening raises legitimate concerns about data handling. Public space use creates additional complexity.
Context limitations without vision: Voice-only interfaces lack the visual context that clarifies ambiguous requests. Multimodal integration (camera + voice) addresses this but adds hardware complexity.
Security: Prompt injection via voice — using audio to manipulate AI behavior — is an active area of security research.
Competitive Landscape
| Company | Approach | Differentiator |
|---|---|---|
| Sesame | End-to-end voice generation, Voice Presence | Natural expressiveness, smart glasses roadmap |
| ElevenLabs | TTS platform, voice cloning | Multi-language, voice marketplace |
| OpenAI | GPT Advanced Voice Mode | Integrated with leading LLM |
| Amazon | Alexa | Smart home integration |
| Apple | Siri | Device ecosystem |
| Google Assistant | Search and services integration |
Sesame's differentiation is technical (end-to-end generation vs. two-step pipeline), hardware-oriented (smart glasses as destination), and focused (conversation quality specifically, rather than breadth of features).
Enterprise Applications
Voice AI in business contexts has specific use cases where the quality improvement Sesame offers matters:
Customer support: Human-sounding AI that handles the emotional dimensions of service conversations more naturally than robotic-sounding alternatives.
Internal assistants: Hands-free information retrieval and task logging during work sessions, without requiring screen interaction.
Sales support: Voice-based CRM data entry, pre-meeting briefings, follow-up management.
Training: Conversation practice, role-playing scenarios, language learning support.
Considerations for enterprise deployment:
- Identify use cases where voice specifically outperforms text interfaces
- Establish clear data handling policies for voice recordings
- Plan integration with existing systems through API
- Build error handling for voice-specific failure modes
TIMEWELL supports evaluation and implementation of voice AI technology through WARP consulting services, and ZEROCK provides enterprise AI infrastructure that enables integration with tools including voice AI systems.
Summary
Key facts about Sesame AI:
- Founded by Oculus co-founder Brendan Iribe and AR/VR expert Ankit Kumar
- Raised $307.6M total (Series B: $250M from Sequoia, October 2025)
- Maya and Miles attracted 1M+ users and 5M+ minutes of conversation within weeks of launch
- CSM-1B open-sourced under Apache 2.0 (commercial use permitted)
- "Voice Presence" concept — emotional intelligence + natural timing + contextual awareness
- Long-term goal: AI smart glasses as ambient interface
The broader signal: voice as an AI interaction modality is moving from "functional but robotic" toward genuinely natural. Companies that deploy voice AI now will have a meaningful experience advantage once that quality gap closes.
Related Articles
- The Reality of a Part-Time Employee Who Took Two Maternity Leaves and How Her View of Work Changed | TIMEWELL
- Before Taking Parental Leave — Three Things You Absolutely Must Do, Even During the Busiest Season
- Committed to Hands-On Work: How a Fifth-Generation Builder Found His Own Path at Fujita Construction
