These videos are worth a watch. There are tons of impressive moments, but they had me at the very first one where a woman says: "I'm going to tell you a story," and then pauses for a long, luxurious sip from a cup of coffee, and the model ... does nothing, just waits. Take my money.
Speaking of taking my money, what's the economic model for a company like this? They've published a fair amount about their architecture - enough that I imagine frontier labs could implement. Patents? Trade secrets? It's hard for me to understand how you'd be able to beat that training compute and knowhow at Anthropic/GOOG/oAI/Meta without some sort of legal protection.
I can't wait to see what these model architectures do with like 30-40% lower latency and more model intelligence. Very appealing. For reference, these look to be roughly 1/10 the size of Opus 4.7 / GPT 5.x series -- 275B, 12B active. So there's lots of room to add intelligence, and lots of hope that we could see lower latency.
> They've published a fair amount about their architecture - enough that I imagine frontier labs could implement.
i think the real ones know this is the tip of the iceberg? hparam tuning, data recipes, data collection, custom kernels, rl/eval infra, all immensely deep topics that would condense multiple decades of phd lifetimes to produce SOTA performance (in both senses of the word) like this.
i would also calibrate what you are impressed by. simply waiting is a posttrain thing - the fact that gemini and oai have not prioritized it is not something you should overindex on as hard. what they showed with full duplex is technically far far harder to achieve
In China it's become well known that promising new companies will get an offer from either Alibaba or Tencent. In the US, it's probably simmilar. Everything that's out in the open can get acquired or simply copied. Maybe that is what Thinking Machines is hoping as well?
Yes they can. Your research papers are not the whole story. It’s like google could open source their entire monorepo and very little would change. No one else could operate it.
The noteworthy things to me are that the architecture is a transformer that takes in text, image, and audio input and produces text and audio output, all trained together, and it works in near real-time through interleaving inputs and outputs rather than pure generation of the output from a given prompt.
> Time-Aligned Micro-Turns. The interaction model works with micro-turns continuously interleaving the processing of 200ms worth of input and generation of 200ms worth of output. Rather than consuming a complete user-turn and generating a complete response, both input and output tokens are treated as streams. Working with 200ms chunks of these streams enables near real-time concurrency of multiple input and output modalities.
That's probably the main thing that distinguishes it from the multimodal models from other frontier labs as far as I can tell.
What's really interesting for me about multimodal architectures from the ground up is that we might start to see applications where different modalities are "facets" of the same thing. Like a coding agent that sees "code" + "IDE" + "memory mapping" + feedback from different plugins as different modalities. And it gets to output in them as well - text where it needs to, actions (not <action>call_something(params)</action> like we have today) and so on. Being able to "sit still" until one of the modalities triggers is really interesting.
We can do these things today, but they're "bolted on" as afterthoughts. Yet they work remarkably well. I wonder how well they'd work if trained int his combined regime, from the ground up.
Yes! This is a big thing ive noticed in all AI demos. If the best use case you can think of to show off yor tech is to book a holiday, that I could easily do myself, does your service really add much value? Or is it simply because the real uses will be nuanced and specialsed, and not suited for a quick general audience demo? I'm not sure.
In theory I would expect it to do everything the current frontier models are capable of but with the added benefit of real time interactivity for better collaboration. The biggest benefit may be the real time video input so it can take in that input in parallel with producing outputs steered by the input rather than taking in a video or all images at once and then producing a single output for all of that.
Presumably it will be possible to adjust that behavior with settings, the system prompt, etc. Not that most users will make such adjustments, though.
I'm currently teaching a class on AI-related issues at a university in Tokyo. Many of the students were surprised when I showed them that they can change the response behavior of chatbots to make them more or less verbose, sycophantic, etc. It shifted the direction of our discussions on the possible impacts of AI on the people who use it.
What I will say is that this is probably the first model after gemini live to do some of these things. It feels similar to gemini live, which I don't think is what they were going for exactly, but IMO it is still impressive as I don't think anyone else has matched full duplex video/audio/tool calling.
Next gemini releases coming next week though, we will see how that matches up!
This deserves to be at the top of HN, shame it seems like it's not going to make it. Some of the demos are hilarious. Clearly having the model appropriately choose when to speak is a major thing that has been missing from voice models to date. It seems like the latency is still a touch too high to be truly human-like though.
These videos are worth a watch. There are tons of impressive moments, but they had me at the very first one where a woman says: "I'm going to tell you a story," and then pauses for a long, luxurious sip from a cup of coffee, and the model ... does nothing, just waits. Take my money.
Speaking of taking my money, what's the economic model for a company like this? They've published a fair amount about their architecture - enough that I imagine frontier labs could implement. Patents? Trade secrets? It's hard for me to understand how you'd be able to beat that training compute and knowhow at Anthropic/GOOG/oAI/Meta without some sort of legal protection.
I can't wait to see what these model architectures do with like 30-40% lower latency and more model intelligence. Very appealing. For reference, these look to be roughly 1/10 the size of Opus 4.7 / GPT 5.x series -- 275B, 12B active. So there's lots of room to add intelligence, and lots of hope that we could see lower latency.
> They've published a fair amount about their architecture - enough that I imagine frontier labs could implement.
i think the real ones know this is the tip of the iceberg? hparam tuning, data recipes, data collection, custom kernels, rl/eval infra, all immensely deep topics that would condense multiple decades of phd lifetimes to produce SOTA performance (in both senses of the word) like this.
i would also calibrate what you are impressed by. simply waiting is a posttrain thing - the fact that gemini and oai have not prioritized it is not something you should overindex on as hard. what they showed with full duplex is technically far far harder to achieve
In China it's become well known that promising new companies will get an offer from either Alibaba or Tencent. In the US, it's probably simmilar. Everything that's out in the open can get acquired or simply copied. Maybe that is what Thinking Machines is hoping as well?
they hire leading researchers, and leading researchers won't work for you unless they're able to publish
> leading researchers won't work for you unless they're able to publish
oh, honey.
Do we want the whole humanity to get richer, or few individuals (company owners)?
Which seems bizarre. Companies can’t afford to just give things away right?
Yes they can. Your research papers are not the whole story. It’s like google could open source their entire monorepo and very little would change. No one else could operate it.
The noteworthy things to me are that the architecture is a transformer that takes in text, image, and audio input and produces text and audio output, all trained together, and it works in near real-time through interleaving inputs and outputs rather than pure generation of the output from a given prompt.
> Time-Aligned Micro-Turns. The interaction model works with micro-turns continuously interleaving the processing of 200ms worth of input and generation of 200ms worth of output. Rather than consuming a complete user-turn and generating a complete response, both input and output tokens are treated as streams. Working with 200ms chunks of these streams enables near real-time concurrency of multiple input and output modalities.
That's probably the main thing that distinguishes it from the multimodal models from other frontier labs as far as I can tell.
What's really interesting for me about multimodal architectures from the ground up is that we might start to see applications where different modalities are "facets" of the same thing. Like a coding agent that sees "code" + "IDE" + "memory mapping" + feedback from different plugins as different modalities. And it gets to output in them as well - text where it needs to, actions (not <action>call_something(params)</action> like we have today) and so on. Being able to "sit still" until one of the modalities triggers is really interesting.
We can do these things today, but they're "bolted on" as afterthoughts. Yet they work remarkably well. I wonder how well they'd work if trained int his combined regime, from the ground up.
Aside from how impressive the model is, the demos here are very well done! Quirky and short, unlike what we're used to from Anthropic and OpenAI.
Very cool! The demos felt fairly contrived - e.g., count things while I talk. I wonder what more useful or commercial applications look like.
Yes! This is a big thing ive noticed in all AI demos. If the best use case you can think of to show off yor tech is to book a holiday, that I could easily do myself, does your service really add much value? Or is it simply because the real uses will be nuanced and specialsed, and not suited for a quick general audience demo? I'm not sure.
In theory I would expect it to do everything the current frontier models are capable of but with the added benefit of real time interactivity for better collaboration. The biggest benefit may be the real time video input so it can take in that input in parallel with producing outputs steered by the input rather than taking in a video or all images at once and then producing a single output for all of that.
This looks similar to things people are already building locally with Gemma4 and TTS; just a bit fancier.
Local models will catch up soon.
Very cool demo, I wonder what would be the billion dollar applications of a thing like this.
Very cool tech. I think people are underrating how this will be used.
incredibly impressive demos. I wonder how the training data for these models look like?
is it separate batches of special "skills" that are added post training? how can they guarantee the models won't eventually lose a skill?
Simultaneous speech is best.
That's neat and definitely the next step. But to be honest, I don't want an AI talk to me like that.
Same here.
Presumably it will be possible to adjust that behavior with settings, the system prompt, etc. Not that most users will make such adjustments, though.
I'm currently teaching a class on AI-related issues at a university in Tokyo. Many of the students were surprised when I showed them that they can change the response behavior of chatbots to make them more or less verbose, sycophantic, etc. It shifted the direction of our discussions on the possible impacts of AI on the people who use it.
Really really cool. If they can serve this efficiently it would disrupt a lot of things.
am i the only person not impressed by this ? it just feels akward still with pauses and doesnt openai offer voice cadence already
Same here. I dont see anything there that nobody else can catch up on eventually. I must be missing something here. It's all cute, but mmm
What I will say is that this is probably the first model after gemini live to do some of these things. It feels similar to gemini live, which I don't think is what they were going for exactly, but IMO it is still impressive as I don't think anyone else has matched full duplex video/audio/tool calling.
Next gemini releases coming next week though, we will see how that matches up!
[dead]
This deserves to be at the top of HN, shame it seems like it's not going to make it. Some of the demos are hilarious. Clearly having the model appropriately choose when to speak is a major thing that has been missing from voice models to date. It seems like the latency is still a touch too high to be truly human-like though.