The Chinese AI lab's new model predicts each frame from the one before, in real time, in response to voice. The "no fixed end" claim is Shengshu's, not independently verified.
Shengshu Technology, a Beijing AI video lab, released Vidu S1 on Friday: an AI video model the company says can keep generating footage as long as someone keeps talking to it. Announced at Beijing's Global Digital Economy Conference, the model is built around an autoregressive diffusion architecture that conditions each new frame on the one that came before, which is closer to how a webcam produces video than how a generator traditionally works.
That distinction is what Vidu S1 is selling. Most AI video models, including OpenAI's Sora and Runway's Gen-4, take a text prompt and produce a fixed-length clip, then stop. Streaming services like ByteDance's Kling can extend a clip but still produce it in chunks. Shengshu's claim with Vidu S1, per the company's product page and Lei Feng Net's coverage, is that the model never finishes a session. It keeps predicting the next frame for as long as it keeps receiving a voice signal and visual context.
The technical backbone is a hybrid of autoregressive prediction and diffusion-based denoising. A diffusion model starts from noise and removes it step by step to form an image; an autoregressive model predicts each new element from the ones that came before. Vidu S1 combines the two: the diffusion side denoises a frame, that frame is fed back as the starting condition for the next frame, and a voice signal is injected into the conditioning input so the model can react to what the user says mid-stream. The result, in theory, is a video stream rather than a video file.
What makes this plausible on paper is the inference stack wrapped around the model. Shengshu's TurboDiffusion framework bundles several inference optimizations: sparse attention variants (SLA and SpargeAttention), low-bit attention (SageAttention), and few-step generation. The company says the full stack runs on consumer-grade GPUs at 540p resolution and 25 frames per second, with peaks to 42 fps. Shengshu also built a serving engine called TurboServe that it claims keeps the generation stable across hours of input without the character's identity drifting. None of these claims, the FPS, the resolution, or the "no drift" line, have been independently benchmarked in the sources reviewed. The numbers come from Shengshu's own product materials and the Sina Tech coverage of the launch, supplemented by ZOL's summary.
The character-creation side is a separate practical shift. Traditional AI video pipelines need a few seconds of reference footage, or a fine-tuned model per character, to keep a face consistent across shots. Vidu S1, according to Shengshu, takes a single image and infers identity, look, and style from it, with custom voices available from presets or via voice cloning. That keeps setup costs low, which is the real reason the use-case list leans toward AI companionship, virtual idols, livestream interaction, brand spokespeople, and game NPCs rather than film and TV.
The independent picture is thin. Shengshu has not released a model card, technical paper, or API pricing for Vidu S1 in the sources reviewed. The only third-party reference is the Beijing Software and Information Services Association's 2025 evaluation, which names Shengshu a "benchmark enterprise" for new-mode new-application work, a recognition of the company's standing in Beijing's digital economy, not a measurement of Vidu S1's output. The product is now live on vidu.cn and is offered as a model-as-a-service to enterprise customers, with the broader industry framing captured in a national science and technology briefing on China's AI sector.
The watch item is whether the architecture holds up under stress. Hours-long coherent video, identity consistency without per-character fine-tuning, and 25 fps on a consumer GPU are the three claims that matter most. If the product survives third-party testing, AI video moves from a production tool into a live-service primitive. If it doesn't, Vidu S1 is another streaming-inference product with a fresh name. The next signal will be the first independent developer publishing latency numbers and drift measurements from a real workload.