Voice chatting with today's AI assistants often feels like a stilted CB radio exchange, where you must wait your turn to speak. You say something, then the AI responds, and you cannot interrupt or have a natural back-and-forth. Even advanced voice modes in ChatGPT, Gemini, or Alexa follow this rigid pattern. The AI cannot listen while it speaks or think while you talk—it simply processes your entire utterance, then generates a reply. This one-way-at-a-time design makes conversations feel robotic, and many users avoid voice interfaces entirely.
That could change with a new generation of interaction models developed by Thinking Machines, an AI startup founded by Mira Murati, former Chief Technology Officer of OpenAI. Murati led the development of GPT-4, ChatGPT, and DALL-E 3 before leaving to start her own venture in early 2024. Thinking Machines aims to create AI that truly engages in human dialogue—listening, thinking, and responding simultaneously, just like a person would.
The Problem with Current Voice AI
Current voice AI operates on a simple pipeline: your speech is transcribed, fed into a language model, the model generates a text response, and then text-to-speech converts it. This process takes several seconds, during which the AI is blind and deaf to everything else. It cannot perceive pauses, sighs, background noises, or physical actions like holding up an object. It also cannot change its response midstream if you interrupt. This is why voice conversations with AI feel unnatural—they lack the real-time, multi-sensory awareness that humans take for granted.
Introducing Multi-Stream, Micro-Turn Architecture
Thinking Machines' innovation is a dual-model system: a lightweight interaction model that is always present with the user, and a more powerful background model that handles complex tasks. The interaction model processes audio and visual inputs in rapid 200-millisecond chunks, continuously updating its internal state. This allows it to react instantly—interrupting when appropriate, nodding along, or even noticing when you take a sip of coffee. Meanwhile, the background model works on deeper reasoning, such as answering a research question or performing an arithmetic calculation, and hands its results to the interaction model when ready.
In demo videos, Thinking Machines shows its models (still in research preview) engaging in video chats where they track objects held up by the user, keep a running tally of words, and correct pronunciation or factual errors in real time. For instance, when a speaker mispronounces 'acai,' the AI interrupts gently with the correct pronunciation and fact-checks the claim that acai bowls originated in Argentina. This level of contextual awareness is unprecedented in commercial AI voice systems.
Background on Mira Murati and Thinking Machines
Mira Murati joined OpenAI in 2018 and rose to become CTO, overseeing the development of groundbreaking AI models. She played a key role in the safety and alignment work that made ChatGPT safer for public use. In early 2024, she announced her departure to pursue a new venture, later revealed as Thinking Machines. The startup has already attracted significant venture capital and top AI researchers. Murati's vision is to move beyond text-based prompts and create AI that understands the full context of human interaction, including tone, gesture, and environment.
The name 'Thinking Machines' evokes the classic AI company of the same name from the 1980s, which pioneered parallel processing and neural networks. Murati's new company carries that torch into the age of large language models and real-time interaction.
How It Works Under the Hood
Interaction model: A small, fast transformer that runs continuously, processing frames of audio and video (if available) every 200ms. It maintains a short-term memory of recent events and can generate micro-responses—like fillers ('uh-huh'), interruptions, or quick acknowledgments—without waiting for the background model.
Background model: A larger, slower model that runs in parallel. It receives the full context from the interaction model at intervals and performs deeper reasoning. Once it produces a result, it feeds back to the interaction model, which then delivers the final response.
This architecture is similar to how the human brain works: the reptilian brainstem handles immediate reflexes, while the neocortex processes complex thought. The interaction model acts as the 'fast brain,' and the background model as the 'slow brain.' Together, they enable fluid conversation.
Challenges and Limitations
Thinking Machines' new models are not yet ready for prime time. The company acknowledges that they struggle with very long conversations due to memory constraints, and they require reliable, low-latency connectivity to function well. The current interaction model is relatively small because larger models are too slow to serve in this setting. Scaling up while maintaining speed is a major engineering challenge.
Moreover, the models may occasionally interrupt at wrong moments or misunderstand visual cues, especially in noisy environments. Privacy concerns also arise: always-on microphones and cameras raise questions about data collection. However, Thinking Machines states they are designing with privacy in mind, processing as much as possible on-device and giving users control over what is captured.
Potential Impact on AI Assistants and Beyond
If Thinking Machines succeeds, it could revolutionize not just voice assistants but also robotics, customer service, education, and telepresence. Imagine an AI tutor that can see your confused expression and adjust its explanation, or a customer service bot that can hear frustration in your voice and escalate gracefully. On-device implementations could make smart glasses or augmented reality applications more natural, allowing AI to interact with the world in real time.
Other major players like Google, OpenAI, and Meta are also working on similar 'full-duplex' voice systems. Google has experimented with its LaMDA and PaLM models, and OpenAI's GPT-4o demonstrated some real-time capabilities during its launch. However, Thinking Machines' multi-stream approach appears more specialized for true conversational turn-taking with micro-interruptions.
The company plans to release a public demo later this year and is already in talks with hardware manufacturers to integrate the technology into smart speakers, wearables, and automotive systems. The race to natural voice AI is heating up, and Thinking Machines may have a significant lead.
Ben Patterson, a veteran technology writer who has covered AI for over two decades, notes that the current state of AI voice chat is a barrier to mass adoption. He has been testing AI assistants since the early days of Siri and believes that the lack of real-time listening and interruption is the main reason users avoid voice interfaces. He says, 'Once you experience an AI that can truly converse—that can wait while you sip coffee and then jump in when you stumble on a word—you'll never go back to the old turn-taking systems.'
Thinking Machines' research preview shows a future where AI voice chat feels as natural as talking to a friend. While challenges remain, the underlying architecture represents a fundamental shift from today's 'text-then-voice' paradigm. As the company works on scaling and robustness, the rest of the industry will be watching closely.
Source: PCWorld News