iPhone 17 Pro Runs 400B Model — But Don't Ask About the Speed

A demo on X shows an iPhone 17 Pro actually running a 400 billion parameter LLM locally. The catch: 0.6 tokens per second.

◉

24AI Underground

March 24, 2026·Updated April 4, 2026·2 min read

iPhone 17 Pro Runs 400B Model — But Don't Ask About the Speed

Community buzz · early signal

SIGNALS

A demo from the @anemll account on X shows an iPhone 17 Pro running a 400B LLM directly on the device
It works thanks to Flash-MoE — a technique that only loads active parts of the model, not the whole thing
Performance is woefully slow (0.6 tokens/sec), but it's the principle itself that has people talking

Early signal · community sourced · unverified

A thread on Hacker News is currently exploding with 232 comments and nearly 450 points — and the discussion revolves around a demo that seems almost impossible on paper: an iPhone 17 Pro running a 400 billion parameter LLM locally, without the cloud, without external hardware.

The @anemll account on X posted the demo, and reactions range from «this changes everything» to «this is technically cheating». The truth lies somewhere in the middle.

What's actually happening?

The trick is something called Flash-MoE — an open-source approach based on the Mixture of Experts architecture. The core idea is that an MoE model doesn't need to have all weights active simultaneously. For each token, only a fraction of the model is activated. This means the iPhone's 12 GB RAM doesn't need to keep 200+ GB in live memory at once — it loads the parts it actually needs, on the fly.

The result? It works. Technically. But the speed is rough: 0.6 tokens per second. That's about one word every two seconds. Not exactly something you'd want to chat with in real-time.

It's not usable today — but neither was 4G in 2009.

Why care then?

Because this is a proof of concept, not a product. And it's exactly the kind of demo that historically signals a shift. A year ago, 7B models on phones were experimental. Now they're mainstream. The bar is consistently being lowered in hardware requirements — and the Flash-MoE approach suggests that the limit for what's «too big for a phone» might not be as fixed as we thought.

Apple itself has positioned the A19 Pro with Neural Accelerators and an improved cooling system precisely for local LLM workloads. They are obviously not aiming for 400B models — but someone outside Apple is doing it now, with existing hardware.

The HN comment section is divided. Some believe this is an engineering feat worth watching. Others point out that «loading parts of a model from storage» is not the same as true local inference in the traditional sense, and that the comparison is flawed.

Regardless: this is early signal territory. No mainstream tech editorial has picked it up yet, and that's precisely why it's worth noting now.

Source: @anemll on X, discussed on Hacker News (HN AI Best). These are community-driven observations — not yet verified by independent benchmarks.

iPhone 17 Pro Runs 400B Model — But Don't Ask About the Speed

Related Articles

Claude Code Dug Up a 23-Year-Old Linux Vulnerability

Free AI Hidden in Your Mac — Nobody Knows About It

AMD Fights Back: Lemonade Makes Local LLM on AMD Chips Actually Usable