Gemini Omni: Reasoning and creation in one model

On the same day, Google DeepMind also released Gemini Omni — a multimodal model that unifies reasoning and content creation in a single architecture.

It takes text, images, audio, or video as input and can output across all those modalities. The tagline is clear: "where Gemini's ability to reason meets the ability to create."

Conversational video editing

Gemini Omni's headline feature is video editing through natural conversation. Think of it as "Nano Banana, but for video" — you edit step by step through dialogue, with each change maintaining consistency.

Key demo scenarios:

  • Natural language editing: "Turn the apartment lights on one by one in sync with the music" — and the video follows
  • Reference-based editing: Give it a reference image, and it adapts the video to match the style
  • Multi-turn iteration: Change the scene, then the camera angle, then the lighting — each step builds on the last
  • Physics understanding: The model understands how objects behave in the real world, keeping outputs realistic

How it works

Gemini Omni fuses reasoning and generation in a unified architecture — not a pipeline. So when you say "make the violin invisible but keep the playing motion," it doesn't regenerate the whole clip.

It accepts reference materials (images, text, audio, video) and blends them into a coherent output.

Available now in the Gemini app and Google Flow.

Industry impact

Gemini Omni pushes AI video from "generate and pray" toward "edit through conversation." This changes the game for content creation, advertising, and film pre-production.

Compared to Sora, Gemini Omni emphasizes interactive editing and natural language control rather than just text-to-video.

HN reaction

The HN post hit 271 points. People compared it to Runway, Sora, and Pika. The main skepticism was about real-world quality versus demo videos — but the direction (conversational editing + multi-turn consistency) was widely praised as the right approach.