Google pairs a 4-second image model with a 10-cent-per-second video model and stitches them into one API

Google pairs a 4-second image model with a 10-cent-per-second video model and stitches them into one API — type0 | type0

PREVIEWGoogle pairs a 4-second image model with a 10-cent-per-second video model and stitches them into one API · MD

Google did not ship one new model on Thursday. It shipped two, and then wired them together.

The company opened API access to a pair of Gemini models: Nano Banana 2 Lite, an image generator whose official handle is gemini-3.1-flash-lite-image, and Gemini Omni Flash, a video generator. Google says Nano Banana 2 Lite can render a 1K image in about four seconds for roughly $0.20 per image, and that Omni Flash produces video at $0.10 per second of output. Those numbers are not independent benchmarks; they are vendor pricing and timing claims. Read them as Google's opening bid, not as measured parity with OpenAI, Runway, or ByteDance.

The deeper story is what Google is doing with the pair. The two models are priced and exposed so that a developer can call image generation and video generation from the same Gemini API surface, hand the image output into the video model, and get back a clip. Nano Banana 2 Lite targets the boring end of the image market: e-commerce product shots, ad iterations, the kind of work that has to be cheap enough to throw away. Omni Flash, according to its DeepMind model card, is positioned as a "lightweight" video model with Gemini's world knowledge baked into generation and editing. You can talk to it in natural language to edit a clip, feed it image, text, or short video references, and ask it to keep labels, logos, or product text in sync with motion.

The composition is the product. A demo Google showed for an e-commerce "Omni product studio" generates product stills with Nano Banana 2 Lite, then animates them into short marketing clips with Omni Flash, all from a single prompt. A second demo for a renovation service called Space Lift runs the same pipeline in a different direction: text-to-image for room mockups, then image-to-video for a fly-through. The pitch is that a small team can run thousands of these per day at the listed prices. That is the point.

Google is not trying to win a capability crown against OpenAI's GPT Image 2 or Sora here. It is trying to set the floor. VentureBeat's coverage framed the release as Google's fastest and cheapest image model to date. Ars Technica made the same point and noted that the price undercuts several frontier image APIs. The framing matters because it shifts the competitive question from "which model draws better" to "which model can a pipeline call ten thousand times before lunch."

Honesty about what is missing is also part of the story. Google's own Omni Flash model card lists four launch constraints that will decide real-world fit. Output is capped at 10 seconds per clip, so the model cannot yet generate a proper narrative scene. There is no audio reference upload. The API will accept up to 3-second reference clips, but the model "does not yet handle them reliably," per the card. And character consistency degrades across scene cuts and camera moves, which is a hard ceiling for anyone who needs a protagonist to walk from one room to another without changing face.

None of these constraints is fatal for the e-commerce and renovation demos Google showed. All of them bite if a buyer tries to use Omni Flash as a general-purpose video model.

That is also the backdrop for the factual point QbitAI's coverage raises: as of this release, Google has still not shipped Gemini 3.5 Pro. The multimodal side of Gemini, by contrast, is now wide open to developers. Read together, those are two different bets on two different timelines. Google is releasing the cheap-and-serializable multimodal pieces while the bigger Gemini 3.5 Pro story stays on the bench.

The watch item is whether the image-to-video pipeline becomes a default workflow for retail, real estate, and short-form video shops, or whether buyers treat the two models as separate purchases. If pipelines win, Google's bet on throughput economics looks correct, and the 10-second video cap and the 3-second reference limit become the next constraints to lift. If buyers keep the models apart, Thursday's release looks more like a price cut than a strategy.

Google pairs a 4-second image model with a 10-cent-per-second video model and stitches them into one API

Sources