A thread on r/LocalLLaMA is currently exploding, pointing to a paywalled article in the Financial Times: DeepSeek is ready to release V4 next week. Not just as an upgraded text model — but with built-in image and video generation baked right into its architecture from the ground up.
These are not modules glued on afterwards. According to what's circulating in the community, V4 is built as a true multimodal model, where text, images, and video have been training data from day one. This means that the model can theoretically reason across modalities in a more coherent way than its competitors — it understands visual context while writing, and understands textual intent while generating video.
The numbers being thrown around are impressive: videos up to 30 minutes, advanced light rendering and material reflections on par with production studio tools, plus a strong understanding of object movement and spatial relationships. And all of this from a model that reportedly activates only around 32 billion out of a total of one trillion parameters per token — an efficiency optimization that should make inference significantly cheaper than its predecessor, V3.
And that's precisely where the shoe pinches. We're still talking about early signals from community sources and a paywalled FT article. No one has seen the model run live, and comparisons to Sora, Midjourney, and Stable Diffusion are based on expected specifications — not actual benchmarks. r/LocalLLaMA is, of course, ecstatic, but enthusiasm in these threads is not the same as proof.
What makes this interesting, however, is the timing and the source. The FT is hardly a rumor mill, and DeepSeek has previously surprised the market with models that delivered far beyond what their price tag would suggest. If V4 actually launches next week with these capabilities, it's not just a jab at OpenAI and Google — it's potentially an earthquake for the entire commercial image and video generation industry.
Keep an eye on official DeepSeek channels and follow the thread on r/LocalLLaMA. This is moving fast.
