Apple's Revamped Siri Falls Short of the AI Assistant It Needs to Be
Early tests of the new Siri on Mac suggest genuine progress on some fronts, but the gap between Apple's ambitions and its current reality remains wide.
Bildnachweis: Image via CNET — Smart Home. Used under fair use for news commentary. · source
Picture a user sitting at a Mac, asking Siri to pull context from an email thread, cross-reference a calendar entry, and draft a reply. It is the kind of compound, multi-step task that every major AI lab has been promising for the better part of three years. Whether the new Siri can actually do it reliably, in 2025, without hallucinating half the details, is a more complicated question than the marketing suggests.
Apple has been unusually quiet about the technical architecture underpinning what it calls Apple Intelligence, and that reticence makes independent evaluation harder than it should be. What we do have are early hands-on tests, including a structured 10-round evaluation published by ZDNet, which found the updated Siri "off to a promising start" while acknowledging that "Apple still has more work to do." That framing, cautiously optimistic but clearly hedged, is about right based on what is publicly available.
To be precise, the new Siri is not a single monolithic model. Apple has described a layered system in which on-device models handle privacy-sensitive tasks and more capable server-side models, routed through what Apple calls Private Cloud Compute, handle heavier inference. The company has also integrated access to ChatGPT for queries that fall outside its own models' competence. This architecture is, in principle, sensible. Keeping sensitive data on-device is not just a marketing talking point; it reflects a genuine engineering constraint that most cloud-first competitors simply sidestep. The question is whether the resulting system is coherent enough to feel like a single assistant rather than a patchwork of handoffs.
Verwandte Beiträge
More in AI Models
Paris hosted a parade of AI heavyweights last week. Some of it was interesting. Some of it was the usual conference fog.
Robert "Bob" Macintosh · 3 hours ago · 4 min
Everyone's framing this as a Nvidia rivalry story. Bob thinks that's the wrong lens entirely.
Robert "Bob" Macintosh · 6 hours ago · 4 min
Bob Macintosh just wants to type a sentence without a chatbot offering to finish it for him.
Robert "Bob" Macintosh · 17 hours ago · 3 min
The 'tokenmaxxing' trend pushed companies to use AI as aggressively as possible. Then the invoices arrived, and the ROI questions got a lot harder to dodge.
Early evidence suggests the coherence is partial at best. The ZDNet test found Siri performing well on discrete, well-scoped tasks: setting reminders, answering factual questions, summarising documents within Apple's own apps. Where it struggled was with the kind of cross-application reasoning that would actually justify calling this a next-generation assistant. Asking Siri to do something that requires it to hold context across multiple apps, or to reason about ambiguous intent, still produces inconsistent results. This is not surprising, actually, the research shows that this class of task, sometimes called "agentic" or "tool-use" reasoning, remains genuinely hard even for frontier models with far more disclosed compute behind them. But it does matter for how we assess Apple's claims.
It is worth noting that Apple is entering this space with some structural disadvantages relative to Google and Microsoft. Both of those companies have spent years integrating large language models into productivity software with enormous user bases, generating the kind of real-world feedback loops that accelerate improvement. Apple's ecosystem is deep but narrower in the enterprise context where AI assistants arguably matter most. Siri's historical weakness on complex queries is well documented, and the accumulated user scepticism is not trivial to overcome. A single product cycle, even a genuinely improved one, does not erase years of "Hey Siri, I didn't get that."
The HomeKit angle adds another layer of complexity. CNET has framed the current moment as an opportunity for consumers to build out Apple-compatible smart home setups ahead of anticipated improvements to both Siri AI and Apple Home capabilities. That framing is commercially reasonable but slightly premature as a technical matter. Smart home control is, in some ways, the easiest test case for an AI assistant: the commands are relatively constrained, the latency requirements are forgiving, and the failure modes are obvious. If the new Siri cannot reliably handle "turn off the lights in the living room when I leave" without occasional misfires, the more ambitious agentic use cases are not going anywhere soon.
I know I am being picky here, but the distinction between "improved" and "genuinely capable" matters enormously when consumers are being asked to build hardware ecosystems around a software promise. HomeKit devices represent real money spent on a bet that Apple's software will catch up to the ambient computing vision the company has been gesturing at for years. The bet may well pay off. It is too early to say with confidence either way.
What does the broader research context tell us? The class of systems Apple is building, on-device small models combined with cloud-based large models and third-party integrations, is increasingly well understood in the academic literature. Work from researchers at institutions including Stanford and MIT has explored the latency-accuracy tradeoffs in hybrid on-device/cloud inference, and the consensus is roughly that the approach is sound but that the routing logic, deciding which queries go where, is harder to get right than it appears. A poorly calibrated router sends too many queries to the cloud (eroding the privacy promise) or keeps too many on-device (eroding quality). Apple has not published details of how it handles this routing, which makes external evaluation of that specific component essentially impossible.
The ChatGPT integration is also worth examining carefully. Apple has positioned it as an optional extension, something users can invoke when Siri acknowledges it cannot handle a query natively. In practice, the boundary between what Siri handles natively and what it punts to ChatGPT is not always transparent to the user, and that opacity raises legitimate questions. If a user asks a sensitive question and Siri routes it to OpenAI's servers without a clear notification, the privacy architecture Apple has carefully constructed starts to look more porous. Apple has said it will notify users before sending queries to ChatGPT, but how consistently that notification appears in real use is something that deserves ongoing scrutiny.
There is also the question of what "new" actually means here. Some of what Apple is shipping under the Apple Intelligence umbrella, writing tools, image generation, basic summarisation, is genuinely new for Apple's platform but is incremental over what competitors shipped twelve to eighteen months ago. That is not a condemnation; platform integration matters, and a feature that works seamlessly inside Mail and Notes is more useful than a standalone chatbot that requires context-switching. But it is important to be clear about the novelty gradient. The on-device model architecture, particularly the work Apple has reportedly done on model compression and efficient inference on Apple Silicon, is closer to genuinely new, at least at the consumer hardware scale Apple operates at. The end-to-end system, though, is best understood as Apple catching up to a competitive field rather than leading it.
The practical implication for consumers is this: if you are already inside Apple's ecosystem and you have been frustrated by Siri's historical limitations, the new version appears to be meaningfully better on the tasks it was already decent at, and somewhat better on the tasks it was previously terrible at. That is real progress. If you are evaluating whether to build a HomeKit-heavy smart home or to rely on Apple Intelligence for serious productivity work, the honest answer is that the system is improving but has not yet demonstrated the reliability that would justify betting heavily on it. This is based on limited public testing data; Apple has not released benchmark results under controlled conditions, and independent replication of the ZDNet-style evaluations is still sparse.
What I would want to see next, to actually update my assessment in a meaningful direction, is systematic evaluation across a diverse set of users and task types, not a single journalist's 10-round test on one machine. I would want to see Apple publish, or at least describe in technical terms, the error rates on its routing logic. I would want independent researchers to probe the privacy boundary between on-device processing and ChatGPT handoffs under adversarial conditions. And I would want to see the HomeKit integration tested not in ideal conditions but in the kind of messy, multi-device, occasionally-offline environments that real homes actually are.
Apple is building something that could, in a few product cycles, be genuinely impressive. The hardware foundation, Apple Silicon's neural engine performance, the tight software-hardware integration, the privacy architecture, is arguably stronger than what any competitor has at the consumer level. The software, for now, is sort of catching up to that foundation. Whether it gets there fast enough to matter in a market that is moving extremely quickly is the real question, and nobody has a confident answer to it yet.