A thread on Hacker News that's exploding right now centers on OpenRouter's recent blog post — "Royale: Last Agent Standing" — in which they've pitted LLMs against each other in long-running agent tasks to see which ones actually survive the longest without crashing, hallucinating themselves into a corner, or simply... stopping.
The results are interesting enough on their own, but what's really setting the comments section alight is the underlying question: what happens when these agents don't just live inside a chat box, but are controlling something physical?
And here the OpenRouter data runs headlong into a rather uncomfortable reality from academia. Research from institutions including Carnegie Mellon and King's College London is unequivocal: none of today's popular LLMs are truly ready for general physical robot control. Not Claude, not Grok, not any of them.
The concrete numbers from the research are fairly sobering: prompt attacks produce an average performance degradation of 21.2%, while perception attacks hit even harder at 30.2%. In practice, that means a robot controlled by an LLM can be manipulated into doing something entirely unintended — by a note on the floor, an unusual instruction, or just a little noise in the camera feed.

It has also been documented that models in spatial navigation scenarios — think fire evacuation — have confidently recommended heading toward the server room instead of the emergency exit. Not because they're unintelligent, but because they lack what researchers call "embodiment" — a genuine understanding that mistakes in the physical world don't come with an undo button.
The HN debate revolves precisely around this point: OpenRouter's benchmark measures agent robustness in digital environments, but the community is loudly asking whether we're in the process of fooling ourselves into believing that "lasts a long time in an agent loop" equals "safe enough to move things around in the real world."
These are early signals from community sources, so take them with a grain of salt — but the temperature of the discussion is high enough that this will likely surface in more established media before long.
Worth keeping an eye on: how model providers respond to this kind of benchmark criticism, and whether we'll soon see dedicated "physical safety" evaluations as standard — not just "where did the token stream break?"
