Video becomes Google's next model surface
Google has unveiled nine demos of Gemini Omni and Gemini 3.5 Flash. The most compelling element is Omni: a model Google describes as capable of combining images, audio, video, and text as input and generating video as output.
The DeepMind model card makes the case more concrete. Gemini Omni Flash is described as a transformer-based model with native multimodal support for text, vision, video, and audio inputs. The output is video with audio. That shifts Gemini from understanding media to also producing and editing it.
Video AI moves from prompt to dialogue: change the scene, keep the thread, adjust the details.
What the demos show
Among other things, Google demonstrates conversational video editing, where users can modify environments, actions, camera angles, or details across multiple turns. The point is not merely to generate a clip from a prompt, but to treat video as a working object that can be iterated upon.
The Flow update adds further context. Google says Gemini Omni Flash is coming to Google Flow and Google Flow Music, with a focus on precise video editing, agentic experiences, and creative workflows. Omni is also intended to assist with character consistency, preserving identity and voice across scenes.

Gemini 3.5 Flash is the other half of the story
This announcement is not solely about video. Google uses the same demo package to position Gemini 3.5 Flash as a model for agentic tasks. The DeepMind model card describes 3.5 Flash as a multimodal reasoning model with up to 1M token input and 64K token output.
Google says 3.5 Flash is generally available through Antigravity, the Gemini API in AI Studio, Android Studio, Gemini Enterprise Agent Platform, and Gemini Enterprise. It is also connected to AI Mode in Search and is rolling out in the Gemini app.
Use cases and pitfalls
Companies will quickly move to test tools like these for campaigns, training videos, product demos, internal communications, and social formats. The potential gains are significant: fewer costly shoots, faster iteration, and a lower barrier to localised content.
But video carries more risk than text. It looks finished even when it is wrong. Rights, privacy, labelling, synthetic personas, manipulated events, and industry regulations all need to be addressed before such tools become routine.
Conclusion
The Gemini Omni demos make clear that Google has no intention of treating video AI as a side market. The company wants to make multimodal video a core part of the Gemini platform, tightly integrated with agentic workflows, Flow, the Gemini app, and developer tooling.
For users and organisations, this represents both a genuine opportunity and a real challenge. It is a viable production capability — but only if the right routines for labelling, rights management, source verification, and human review are built alongside it.
