Daily papers related to Image/Video/Multimodal Generation from cs.CV
January 01, 2026
Recent breakthroughs in video generation have demonstrated an emerging capability termed Chain-of-Frames (CoF) reasoning, where models resolve complex tasks through the generation of continuous frames. While these models show promise for Generative Video Reasoning (GVR), existing evaluation frameworks often rely on single-frame assessments, which can lead to outcome-hacking, where a model reaches a correct conclusion through an erroneous process. To address this, we propose a process-aware evaluation paradigm. We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning. Furthermore, we propose Process-outcome Consistency (POC@r), a new metric that utilizes VLM-as-Judge with a hierarchical rubric to evaluate both the validity of the intermediate steps and the final result. Our experiments reveal that state-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking. We further explore the impact of test-time scaling and sampling robustness, highlighting a substantial gap between current video generation and true generalized visual reasoning. Our benchmark will be publicly released.
TLDR: The paper introduces VIPER, a new benchmark for evaluating generative video reasoning models with a focus on process-aware assessment and a new metric, Process-outcome Consistency (POC@r), to address outcome-hacking issues in current evaluation frameworks.
TLDR: 该论文介绍了VIPER,一个新的用于评估生成视频推理模型的基准测试,专注于过程感知评估和一个新的指标,过程-结果一致性 (POC@r),以解决当前评估框架中的结果破解问题。
Read Paper (PDF)In this work, we show that the impact of model capacity varies across timesteps: it is crucial for the early and late stages but largely negligible during the intermediate stage. Accordingly, we propose FlowBlending, a stage-aware multi-model sampling strategy that employs a large model and a small model at capacity-sensitive stages and intermediate stages, respectively. We further introduce simple criteria to choose stage boundaries and provide a velocity-divergence analysis as an effective proxy for identifying capacity-sensitive regions. Across LTX-Video (2B/13B) and WAN 2.1 (1.3B/14B), FlowBlending achieves up to 1.65x faster inference with 57.35% fewer FLOPs, while maintaining the visual fidelity, temporal coherence, and semantic alignment of the large models. FlowBlending is also compatible with existing sampling-acceleration techniques, enabling up to 2x additional speedup. Project page is available at: https://jibin86.github.io/flowblending_project_page.
TLDR: The paper introduces FlowBlending, a stage-aware multi-model sampling strategy for video generation that selectively uses smaller models during less capacity-sensitive stages, achieving faster inference and fewer FLOPs without sacrificing visual quality.
TLDR: 该论文介绍了 FlowBlending,一种用于视频生成的阶段感知多模型采样策略,该策略在对容量不太敏感的阶段选择性地使用较小的模型,从而在不牺牲视觉质量的前提下实现更快的推理和更少的 FLOP。
Read Paper (PDF)Inspired by the remarkable success of autoregressive models in language modeling, this paradigm has been widely adopted in visual generation. However, the sequential token-by-token decoding mechanism inherent in traditional autoregressive models leads to low inference efficiency.In this paper, we propose RadAR, an efficient and parallelizable framework designed to accelerate autoregressive visual generation while preserving its representational capacity. Our approach is motivated by the observation that visual tokens exhibit strong local dependencies and spatial correlations with their neighbors--a property not fully exploited in standard raster-scan decoding orders. Specifically, we organize the generation process around a radial topology: an initial token is selected as the starting point, and all other tokens are systematically grouped into multiple concentric rings according to their spatial distances from this center. Generation then proceeds in a ring-wise manner, from inner to outer regions, enabling the parallel prediction of all tokens within the same ring. This design not only preserves the structural locality and spatial coherence of visual scenes but also substantially increases parallelization. Furthermore, to address the risk of inconsistent predictions arising from simultaneous token generation with limited context, we introduce a nested attention mechanism. This mechanism dynamically refines implausible outputs during the forward pass, thereby mitigating error accumulation and preventing model collapse. By integrating radial parallel prediction with dynamic output correction, RadAR significantly improves generation efficiency.
TLDR: This paper introduces RadAR, a novel autoregressive visual generation framework that uses a radial topology generation process to improve inference efficiency through parallel processing while preserving spatial coherence and addressing inconsistency with a nested attention mechanism.
TLDR: 本文介绍了一种新的自回归视觉生成框架RadAR,它采用径向拓扑生成过程,通过并行处理提高了推理效率,同时保留了空间一致性,并通过嵌套的注意力机制解决了不一致性问题。
Read Paper (PDF)Recent advances in text-to-video (T2V) generation have achieved good visual quality, yet synthesizing videos that faithfully follow physical laws remains an open challenge. Existing methods mainly based on graphics or prompt extension struggle to generalize beyond simple simulated environments or learn implicit physical reasoning. The scarcity of training data with rich physics interactions and phenomena is also a problem. In this paper, we first introduce a Physics-Augmented video data construction Pipeline, PhyAugPipe, that leverages a vision-language model (VLM) with chain-of-thought reasoning to collect a large-scale training dataset, PhyVidGen-135K. Then we formulate a principled Physics-aware Groupwise Direct Preference Optimization, PhyGDPO, framework that builds upon the groupwise Plackett-Luce probabilistic model to capture holistic preferences beyond pairwise comparisons. In PhyGDPO, we design a Physics-Guided Rewarding (PGR) scheme that embeds VLM-based physics rewards to steer optimization toward physical consistency. We also propose a LoRA-Switch Reference (LoRA-SR) scheme that eliminates memory-heavy reference duplication for efficient training. Experiments show that our method significantly outperforms state-of-the-art open-source methods on PhyGenBench and VideoPhy2. Please check our project page at https://caiyuanhao1998.github.io/project/PhyGDPO for more video results. Our code, models, and data will be released at https://github.com/caiyuanhao1998/Open-PhyGDPO
TLDR: The paper introduces PhyGDPO, a physics-aware groupwise direct preference optimization framework for text-to-video generation that leverages a physics-augmented dataset and VLM-based rewards to improve physical consistency.
TLDR: 该论文介绍了PhyGDPO,一个物理感知的分组直接偏好优化框架,用于文本到视频的生成,它利用物理增强数据集和基于VLM的奖励来提高物理一致性。
Read Paper (PDF)