AIGC Daily Papers

Daily papers related to Image/Video/Multimodal Generation from cs.CV

March 01, 2026

Mode Seeking meets Mean Seeking for Fast Long Video Generation

Scaling video generation from seconds to minutes faces a critical bottleneck: while short-video data is abundant and high-fidelity, coherent long-form data is scarce and limited to narrow domains. To address this, we propose a training paradigm where Mode Seeking meets Mean Seeking, decoupling local fidelity from long-term coherence based on a unified representation via a Decoupled Diffusion Transformer. Our approach utilizes a global Flow Matching head trained via supervised learning on long videos to capture narrative structure, while simultaneously employing a local Distribution Matching head that aligns sliding windows to a frozen short-video teacher via a mode-seeking reverse-KL divergence. This strategy enables the synthesis of minute-scale videos that learns long-range coherence and motions from limited long videos via supervised flow matching, while inheriting local realism by aligning every sliding-window segment of the student to a frozen short-video teacher, resulting in a few-step fast long video generator. Evaluations show that our method effectively closes the fidelity-horizon gap by jointly improving local sharpness, motion and long-range consistency. Project website: https://primecai.github.io/mmm/.

TLDR: This paper introduces a new training paradigm called Mode Seeking meets Mean Seeking (MMM) for generating long videos by decoupling local fidelity from long-term coherence using a Decoupled Diffusion Transformer, allowing for fast, minute-scale video generation.

TLDR: 本文提出了一种名为“寻模与寻均相遇”(MMM)的训练范式,通过解耦局部保真度和长期连贯性,并使用解耦扩散Transformer生成长视频,从而实现快速的分钟级视频生成。

Relevance: (10/10)
Novelty: (9/10)
Clarity: (9/10)
Potential Impact: (9/10)
Overall: (9/10)
Read Paper (PDF)

Authors: Shengqu Cai, Weili Nie, Chao Liu, Julius Berner, Lvmin Zhang, Nanye Ma, Hansheng Chen, Maneesh Agrawala, Leonidas Guibas, Gordon Wetzstein, Arash Vahdat