A thread on Hacker News is currently exploding with 232 comments and nearly 450 points — and the discussion revolves around a demo that seems almost impossible on paper: an iPhone 17 Pro running a 400 billion parameter LLM locally, without the cloud, without external hardware.
The @anemll account on X posted the demo, and reactions range from «this changes everything» to «this is technically cheating». The truth lies somewhere in the middle.
What's actually happening?
The trick is something called Flash-MoE — an open-source approach based on the Mixture of Experts architecture. The core idea is that an MoE model doesn't need to have all weights active simultaneously. For each token, only a fraction of the model is activated. This means the iPhone's 12 GB RAM doesn't need to keep 200+ GB in live memory at once — it loads the parts it actually needs, on the fly.
The result? It works. Technically. But the speed is rough: 0.6 tokens per second. That's about one word every two seconds. Not exactly something you'd want to chat with in real-time.
Why care then?
Because this is a proof of concept, not a product. And it's exactly the kind of demo that historically signals a shift. A year ago, 7B models on phones were experimental. Now they're mainstream. The bar is consistently being lowered in hardware requirements — and the Flash-MoE approach suggests that the limit for what's «too big for a phone» might not be as fixed as we thought.
Apple itself has positioned the A19 Pro with Neural Accelerators and an improved cooling system precisely for local LLM workloads. They are obviously not aiming for 400B models — but someone outside Apple is doing it now, with existing hardware.
The HN comment section is divided. Some believe this is an engineering feat worth watching. Others point out that «loading parts of a model from storage» is not the same as true local inference in the traditional sense, and that the comparison is flawed.
Regardless: this is early signal territory. No mainstream tech editorial has picked it up yet, and that's precisely why it's worth noting now.
Source: @anemll on X, discussed on Hacker News (HN AI Best). These are community-driven observations — not yet verified by independent benchmarks.
