Alibaba releases Qwen3.5 small — and the 9B model beats GPT-5 Nano

Alibaba has quietly dropped a new series of small Qwen3.5 models, and r/LocalLLaMA is buzzing. A 9B model running on an RTX 3060 and outperforming models three to nine times larger? It's worth keeping an eye on.

◉

24AI Underground

March 3, 2026·Updated March 25, 2026·2 min read

Alibaba releases Qwen3.5 small — and the 9B model beats GPT-5 Nano

Community buzz · early signal

SIGNALS

Alibaba has released Qwen3.5 models in sizes 0.8B, 2B, 4B, and 9B — all open-source, Apache 2.0
Qwen3.5-9B runs on an RTX 3060 12GB with 4-bit quantization and beats GPT-5 Nano and Gemini 2.5 Flash-Lite on vision benchmarks
A 35B MoE model activates only 3B parameters and apparently surpasses the previous generation's 235B model

Early signal · community sourced · unverified

A currently exploding thread on r/LocalLLaMA has sparked a real buzz: Alibaba's Qwen team has dropped a new series of compact models without much warning, and the community's reaction is quite clear — people are impressed.

It's not just about the models being small. It's about what they can actually achieve.

Qwen3.5-9B is the model stealing the show right now. It fits on a single RTX 3060 with 12GB VRAM at 4-bit quantization — a reasonably priced, three-year-old card. Yet, benchmarks report that it beats GPT-5 Nano and Gemini 2.5 Flash-Lite on vision tasks by double-digit margins. On MathVision, it scores 78.9 against Google's 62.2. That's no small difference.

A 9B model that outperforms Google's and OpenAI's mini-models — and runs locally on consumer hardware.

One of the most interesting aspects is the MoE model Qwen3.5-35B-A3B. It has 35 billion parameters in total but activates only 3 billion during inference — and still surpasses the previous generation's 235B-A22B model. This tells us something important: Alibaba is pushing hard on architecture and data quality rather than just stacking more parameters. It's a clear trend we're going to see more of.

All models are natively multimodal (text, image, video from the same weights), support a 262K context window — expandable to around 1M tokens — and cover 201 languages and dialects. They are already available via Ollama, LMStudio, llama.cpp, and MLX.

For the smallest models (0.8B and 2B), the situation is even more extreme: they are designed to run directly on mobile phones, requiring from 3GB to 5GB of total memory.

A couple of caveats are worth mentioning. These are early signals from community sources, and user experiences vary. Some report hallucinations on specialized coding tasks (especially Solidity), while others have diametrically opposite experiences. Such variations are common at launch, and more systematic testing will follow.

Why is this important? Because the threshold for what can run locally — on your own machine, without API costs, without data sharing — just dropped again. And it's happening fast.

Keep an eye on this. Mainstream tech media hasn't picked it up yet.

Alibaba releases Qwen3.5 small — and the 9B model beats GPT-5 Nano

Related Articles

Claude Code Dug Up a 23-Year-Old Linux Vulnerability

Free AI Hidden in Your Mac — Nobody Knows About It

AMD Fights Back: Lemonade Makes Local LLM on AMD Chips Actually Usable