Daily papers related to Image/Video/Multimodal Generation from cs.CV
February 01, 2026
While recent video diffusion models (VDMs) produce visually impressive results, they fundamentally struggle to maintain 3D structural consistency, often resulting in object deformation or spatial drift. We hypothesize that these failures arise because standard denoising objectives lack explicit incentives for geometric coherence. To address this, we introduce VideoGPA (Video Geometric Preference Alignment), a data-efficient self-supervised framework that leverages a geometry foundation model to automatically derive dense preference signals that guide VDMs via Direct Preference Optimization (DPO). This approach effectively steers the generative distribution toward inherent 3D consistency without requiring human annotations. VideoGPA significantly enhances temporal stability, physical plausibility, and motion coherence using minimal preference pairs, consistently outperforming state-of-the-art baselines in extensive experiments.
TLDR: VideoGPA improves 3D consistency in video generation by using a geometry foundation model to provide preference signals that guide video diffusion models via Direct Preference Optimization, leading to enhanced temporal stability and physical plausibility.
TLDR: VideoGPA通过使用几何基础模型提供偏好信号,并利用直接偏好优化来指导视频扩散模型,从而提高视频生成中3D一致性,最终增强了时间稳定性和物理合理性。
Read Paper (PDF)