ArXiv CS.CV Papers (Image/Video Generation)

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

This work represents the first effort to scale up continuous-time consistency distillation to general application-level image and video diffusion models. Although continuous-time consistency model (sCM) is theoretically principled and empirically powerful for accelerating academic-scale diffusion, its applicability to large-scale text-to-image and video tasks remains unclear due to infrastructure challenges in Jacobian-vector product (JVP) computation and the limitations of standard evaluation benchmarks. We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks. Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the "mode-covering" nature of its forward-divergence objective. To remedy this, we propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer. This integration complements sCM with the "mode-seeking" reverse divergence, effectively improving visual quality while maintaining high generation diversity. Validated on large-scale models (Cosmos-Predict2, Wan2.1) up to 14B parameters and 5-second videos, rCM matches or surpasses the state-of-the-art distillation method DMD2 on quality metrics while offering notable advantages in diversity, all without GAN tuning or extensive hyperparameter searches. The distilled models generate high-fidelity samples in only $1\sim4$ steps, accelerating diffusion sampling by $15\times\sim50\times$. These results position rCM as a practical and theoretically grounded framework for advancing large-scale diffusion distillation.

TLDR: The paper introduces score-regularized continuous-time consistency model (rCM) to improve the quality and diversity of large-scale image and video diffusion models, achieving significant acceleration and performance gains over existing methods.

TLDR: 该论文介绍了分数正则化连续时间一致性模型 (rCM)，以提高大规模图像和视频扩散模型的质量和多样性，与现有方法相比，实现了显著的加速和性能提升。

Relevance: (10/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, Qinsheng Zhang

UniVideo: Unified Understanding, Generation, and Editing for Videos

Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.

TLDR: The paper introduces UniVideo, a unified framework for video understanding, generation, and editing, leveraging a dual-stream architecture with an MLLM and MMDiT, demonstrating strong performance and generalization capabilities across various tasks.

TLDR: 该论文介绍了 UniVideo，一个用于视频理解、生成和编辑的统一框架，利用具有 MLLM 和 MMDiT 的双流架构，展示了在各种任务中的强大性能和泛化能力。

Relevance: (10/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhu Chen

LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation

Video diffusion models (DMs) have enabled high-quality video synthesis. However, their computation costs scale quadratically with sequence length because self-attention has quadratic complexity. While linear attention lowers the cost, fully replacing quadratic attention requires expensive pretraining due to the limited expressiveness of linear attention and the complexity of spatiotemporal modeling in video generation. In this paper, we present LinVideo, an efficient data-free post-training framework that replaces a target number of self-attention modules with linear attention while preserving the original model's performance. First, we observe a significant disparity in the replaceability of different layers. Instead of manual or heuristic choices, we frame layer selection as a binary classification problem and propose selective transfer, which automatically and progressively converts layers to linear attention with minimal performance impact. Additionally, to overcome the ineffectiveness and inefficiency of existing objectives for this transfer process, we introduce an anytime distribution matching (ADM) objective that aligns the distributions of samples across any timestep along the sampling trajectory. This objective is efficient and recovers model performance. Extensive experiments show that our method achieves a 1.25-2.00x speedup while preserving generation quality, and our 4-step distilled model further delivers a 15.92x latency reduction with minimal visual quality drop.

TLDR: The paper introduces LinVideo, a post-training framework for video diffusion models that achieves O(n) attention complexity by selectively replacing quadratic self-attention layers with linear attention layers, resulting in significant speedups with minimal performance degradation.

TLDR: 该论文介绍了 LinVideo，一种视频扩散模型的后训练框架，通过选择性地将二次自注意力层替换为线性注意力层来实现 O(n) 的注意力复杂度，从而在性能下降最小的情况下显著提高速度。

Relevance: (10/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Yushi Huang, Xingtong Ge, Ruihao Gong, Chengtao Lv, Jun Zhang

Beyond Textual CoT: Interleaved Text-Image Chains with Deep Confidence Reasoning for Image Editing

Image editing with natural language has gained significant popularity, yet existing methods struggle with intricate object intersections and fine-grained spatial relationships due to the lack of an explicit reasoning process. While Chain-of-Thought (CoT) has been explored to enhance reasoning, purely textual CoT or CoT augmented with coordinate information is fundamentally limited in its ability to represent intricate visual layouts and lacks the necessary visual cues to guide the generation of fine-grained, pixel-level details. To address these challenges, we propose Multimodal Reasoning Edit (MURE), a novel framework that shifts the visual editing process from purely text-based reasoning to a series of interleaved textual and visual rationales. Our framework performs image editing using a natively multimodal, interleaved text-image CoT. This approach generates a step-by-step chain of reasoning where a textual description is followed by a corresponding visual cue, such as a positional mask that defined intended edited regions or a representation of new content. Furthermore, to mitigate the hallucination phenomenon of large language models, we introduce Multimodal Deep Confidence (MMDC) reasoning paradigm. This paradigm explores a tree of visual reasoning paths at each step. By pruning low-quality branches using a deep confidence score from a reward model, it ensures the model consistently follows a high-quality trajectory towards the final edited result. The proposed method decomposes complex editing tasks into interdependent sub-tasks, achieving greater precision at each stage and yielding high-fidelity edited results. We define the formulation for interleaved text-image chains and release the first CoT-Edit-14K dataset, comprising 14K high-quality editing examples. Extensive experiments show that our method yields significant improvements across three image editing benchmarks.

TLDR: The paper introduces MURE, a novel framework for image editing that uses interleaved text-image Chain-of-Thought reasoning and Multimodal Deep Confidence to address the limitations of text-based methods in handling intricate object intersections and fine-grained spatial relationships.

TLDR: 该论文介绍了MURE，一种新颖的图像编辑框架，它使用交错的文本-图像链式思维推理和多模态深度置信度，以解决基于文本的方法在处理复杂的对象交叉和精细的空间关系方面的局限性。

Relevance: (9/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Zhentao Zou, Zhengrong Yue, Kunpeng Du, Binlei Bao, Hanting Li, Haizhen Xie, Guozheng Xu, Yue Zhou, Yali Wang, Jie Hu, Xue Jiang, Xinghao Chen

Controllable Video Synthesis via Variational Inference

Many video workflows benefit from a mixture of user controls with varying granularity, from exact 4D object trajectories and camera paths to coarse text prompts, while existing video generative models are typically trained for fixed input formats. We develop a video synthesis method that addresses this need and generates samples with high controllability for specified elements while maintaining diversity for under-specified ones. We cast the task as variational inference to approximate a composed distribution, leveraging multiple video generation backbones to account for all task constraints collectively. To address the optimization challenge, we break down the problem into step-wise KL divergence minimization over an annealed sequence of distributions, and further propose a context-conditioned factorization technique that reduces modes in the solution space to circumvent local optima. Experiments suggest that our method produces samples with improved controllability, diversity, and 3D consistency compared to prior works.

TLDR: This paper introduces a controllable video synthesis method using variational inference, allowing users to specify various control elements while maintaining diversity. It outperforms existing methods in controllability, diversity, and 3D consistency.

TLDR: 本文提出了一种基于变分推理的可控视频合成方法，允许用户指定各种控制元素，同时保持多样性。该方法在可控性、多样性和3D一致性方面优于现有方法。

Relevance: (10/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Haoyi Duan, Yunzhi Zhang, Yilun Du, Jiajun Wu

MultiCOIN: Multi-Modal COntrollable Video INbetweening

Video inbetweening creates smooth and natural transitions between two image frames, making it an indispensable tool for video editing and long-form video synthesis. Existing works in this domain are unable to generate large, complex, or intricate motions. In particular, they cannot accommodate the versatility of user intents and generally lack fine control over the details of intermediate frames, leading to misalignment with the creative mind. To fill these gaps, we introduce \modelname{}, a video inbetweening framework that allows multi-modal controls, including depth transition and layering, motion trajectories, text prompts, and target regions for movement localization, while achieving a balance between flexibility, ease of use, and precision for fine-grained video interpolation. To achieve this, we adopt the Diffusion Transformer (DiT) architecture as our video generative model, due to its proven capability to generate high-quality long videos. To ensure compatibility between DiT and our multi-modal controls, we map all motion controls into a common sparse and user-friendly point-based representation as the video/noise input. Further, to respect the variety of controls which operate at varying levels of granularity and influence, we separate content controls and motion controls into two branches to encode the required features before guiding the denoising process, resulting in two generators, one for motion and the other for content. Finally, we propose a stage-wise training strategy to ensure that our model learns the multi-modal controls smoothly. Extensive qualitative and quantitative experiments demonstrate that multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.

TLDR: The paper introduces MultiCOIN, a multi-modal controllable video inbetweening framework using Diffusion Transformers, enabling fine-grained control over intermediate frames via depth, motion trajectories, text prompts, and target regions.

TLDR: 本文介绍了MultiCOIN，一个多模态可控视频插帧框架，使用扩散Transformer，可以通过深度、运动轨迹、文本提示和目标区域对中间帧进行细粒度控制。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Maham Tanveer, Yang Zhou, Simon Niklaus, Ali Mahdavi Amiri, Hao Zhang, Krishna Kumar Singh, Nanxuan Zhao

Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing

Instruction-based image editing offers a powerful and intuitive way to manipulate images through natural language. Yet, relying solely on text instructions limits fine-grained control over the extent of edits. We introduce Kontinuous Kontext, an instruction-driven editing model that provides a new dimension of control over edit strength, enabling users to adjust edits gradually from no change to a fully realized result in a smooth and continuous manner. Kontinuous Kontext extends a state-of-the-art image editing model to accept an additional input, a scalar edit strength which is then paired with the edit instruction, enabling explicit control over the extent of the edit. To inject this scalar information, we train a lightweight projector network that maps the input scalar and the edit instruction to coefficients in the model's modulation space. For training our model, we synthesize a diverse dataset of image-edit-instruction-strength quadruplets using existing generative models, followed by a filtering stage to ensure quality and consistency. Kontinuous Kontext provides a unified approach for fine-grained control over edit strength for instruction driven editing from subtle to strong across diverse operations such as stylization, attribute, material, background, and shape changes, without requiring attribute-specific training.

TLDR: This paper introduces Kontinuous Kontext, an instruction-driven image editing model that allows continuous control over the strength of the edit, trained on a synthesized dataset of image-edit-instruction-strength quadruplets.

TLDR: 该论文介绍了一种名为 Kontinuous Kontext 的指令驱动图像编辑模型，该模型允许对编辑强度进行连续控制，并使用图像-编辑-指令-强度四元组的合成数据集进行训练。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Rishubh Parihar, Or Patashnik, Daniil Ostashev, R. Venkatesh Babu, Daniel Cohen-Or, Kuan-Chieh Wang

FlexTraj: Image-to-Video Generation with Flexible Point Trajectory Control

We present FlexTraj, a framework for image-to-video generation with flexible point trajectory control. FlexTraj introduces a unified point-based motion representation that encodes each point with a segmentation ID, a temporally consistent trajectory ID, and an optional color channel for appearance cues, enabling both dense and sparse trajectory control. Instead of injecting trajectory conditions into the video generator through token concatenation or ControlNet, FlexTraj employs an efficient sequence-concatenation scheme that achieves faster convergence, stronger controllability, and more efficient inference, while maintaining robustness under unaligned conditions. To train such a unified point trajectory-controlled video generator, FlexTraj adopts an annealing training strategy that gradually reduces reliance on complete supervision and aligned condition. Experimental results demonstrate that FlexTraj enables multi-granularity, alignment-agnostic trajectory control for video generation, supporting various applications such as motion cloning, drag-based image-to-video, motion interpolation, camera redirection, flexible action control and mesh animations.

TLDR: FlexTraj is a framework for image-to-video generation that uses a unified point-based motion representation with trajectory control, trained with an annealing strategy for robustness under unaligned conditions, supporting various applications.

TLDR: FlexTraj 是一个图像到视频生成的框架，它使用统一的基于点的运动表示并进行轨迹控制。该框架通过退火训练策略来提高在未对齐条件下的鲁棒性，并支持多种应用。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Zhiyuan Zhang, Can Wang, Dongdong Chen, Jing Liao

InstructX: Towards Unified Visual Editing with MLLM Guidance

With recent advances in Multimodal Large Language Models (MLLMs) showing strong visual understanding and reasoning, interest is growing in using them to improve the editing performance of diffusion models. Despite rapid progress, most studies lack an in-depth analysis of MLLM design choices. Moreover, the integration of MLLMs and diffusion models remains an open challenge in some difficult tasks, such as video editing. In this paper, we present InstructX, a unified framework for image and video editing. Specifically, we conduct a comprehensive study on integrating MLLMs and diffusion models for instruction-driven editing across diverse tasks. Building on this study, we analyze the cooperation and distinction between images and videos in unified modeling. (1) We show that training on image data can lead to emergent video editing capabilities without explicit supervision, thereby alleviating the constraints imposed by scarce video training data. (2) By incorporating modality-specific MLLM features, our approach effectively unifies image and video editing tasks within a single model. Extensive experiments demonstrate that our method can handle a broad range of image and video editing tasks and achieves state-of-the-art performance.

TLDR: InstructX is a unified framework for image and video editing using MLLMs and diffusion models, demonstrating that training on image data can induce video editing capabilities and achieve state-of-the-art performance.

TLDR: InstructX是一个统一的图像和视频编辑框架，它使用MLLM和扩散模型，表明在图像数据上训练可以诱导视频编辑能力，并实现最先进的性能。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Chong Mou, Qichao Sun, Yanze Wu, Pengze Zhang, Xinghui Li, Fulong Ye, Songtao Zhao, Qian He

Reinforcing Diffusion Models by Direct Group Preference Optimization

While reinforcement learning methods such as Group Relative Preference Optimization (GRPO) have significantly enhanced Large Language Models, adapting them to diffusion models remains challenging. In particular, GRPO demands a stochastic policy, yet the most cost-effective diffusion samplers are based on deterministic ODEs. Recent work addresses this issue by using inefficient SDE-based samplers to induce stochasticity, but this reliance on model-agnostic Gaussian noise leads to slow convergence. To resolve this conflict, we propose Direct Group Preference Optimization (DGPO), a new online RL algorithm that dispenses with the policy-gradient framework entirely. DGPO learns directly from group-level preferences, which utilize relative information of samples within groups. This design eliminates the need for inefficient stochastic policies, unlocking the use of efficient deterministic ODE samplers and faster training. Extensive results show that DGPO trains around 20 times faster than existing state-of-the-art methods and achieves superior performance on both in-domain and out-of-domain reward metrics. Code is available at https://github.com/Luo-Yihong/DGPO.

TLDR: The paper introduces Direct Group Preference Optimization (DGPO), a new reinforcement learning algorithm for diffusion models that directly learns from group-level preferences, allowing for efficient deterministic sampling and faster training compared to existing methods.

TLDR: 该论文介绍了一种新的扩散模型强化学习算法，即直接群组偏好优化（DGPO）。DGPO直接从群组层面的偏好中学习，从而可以使用高效的确定性采样方法并实现比现有方法更快的训练速度。

Relevance: (9/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yihong Luo, Tianyang Hu, Jing Tang

VideoVerse: How Far is Your T2V Generator from a World Model?

The recent rapid advancement of Text-to-Video (T2V) generation technologies, which are critical to build ``world models'', makes the existing benchmarks increasingly insufficient to evaluate state-of-the-art T2V models. First, current evaluation dimensions, such as per-frame aesthetic quality and temporal consistency, are no longer able to differentiate state-of-the-art T2V models. Second, event-level temporal causality, which not only distinguishes video from other modalities but also constitutes a crucial component of world models, is severely underexplored in existing benchmarks. Third, existing benchmarks lack a systematic assessment of world knowledge, which are essential capabilities for building world models. To address these issues, we introduce VideoVerse, a comprehensive benchmark that focuses on evaluating whether a T2V model could understand complex temporal causality and world knowledge in the real world. We collect representative videos across diverse domains (e.g., natural landscapes, sports, indoor scenes, science fiction, chemical and physical experiments) and extract their event-level descriptions with inherent temporal causality, which are then rewritten into text-to-video prompts by independent annotators. For each prompt, we design a suite of binary evaluation questions from the perspective of dynamic and static properties, with a total of ten carefully defined evaluation dimensions. In total, our VideoVerse comprises 300 carefully curated prompts, involving 815 events and 793 binary evaluation questions. Consequently, a human preference aligned QA-based evaluation pipeline is developed by using modern vision-language models. Finally, we perform a systematic evaluation of state-of-the-art open-source and closed-source T2V models on VideoVerse, providing in-depth analysis on how far the current T2V generators are from world models.

TLDR: The paper introduces VideoVerse, a new benchmark for Text-to-Video models that assesses their understanding of temporal causality and world knowledge, addressing limitations in existing benchmarks.

TLDR: 本文介绍了VideoVerse，一个新的文本到视频模型的基准，用于评估它们对时间因果关系和世界知识的理解，解决了现有基准的局限性

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Zeqing Wang, Xinyu Wei, Bairui Li, Zhen Guo, Jinrui Zhang, Hongyang Wei, Keze Wang, Lei Zhang

Real-Time Motion-Controllable Autoregressive Video Diffusion

Real-time motion-controllable video generation remains challenging due to the inherent latency of bidirectional diffusion models and the lack of effective autoregressive (AR) approaches. Existing AR video diffusion models are limited to simple control signals or text-to-video generation, and often suffer from quality degradation and motion artifacts in few-step generation. To address these challenges, we propose AR-Drag, the first RL-enhanced few-step AR video diffusion model for real-time image-to-video generation with diverse motion control. We first fine-tune a base I2V model to support basic motion control, then further improve it via reinforcement learning with a trajectory-based reward model. Our design preserves the Markov property through a Self-Rollout mechanism and accelerates training by selectively introducing stochasticity in denoising steps. Extensive experiments demonstrate that AR-Drag achieves high visual fidelity and precise motion alignment, significantly reducing latency compared with state-of-the-art motion-controllable VDMs, while using only 1.3B parameters. Additional visualizations can be found on our project page: https://kesenzhao.github.io/AR-Drag.github.io/.

TLDR: The paper introduces AR-Drag, a reinforcement learning-enhanced autoregressive video diffusion model for real-time, motion-controllable image-to-video generation, achieving high fidelity and precise motion alignment with reduced latency and a relatively small parameter size.

TLDR: 该论文介绍了 AR-Drag，一种强化学习增强的自回归视频扩散模型，用于实时、可运动控制的图像到视频生成，以相对较小的参数规模实现了高保真度和精确的运动对齐，并降低了延迟。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Kesen Zhao, Jiaxin Shi, Beier Zhu, Junbao Zhou, Xiaolong Shen, Yuan Zhou, Qianru Sun, Hanwang Zhang

RetouchLLM: Training-free White-box Image Retouching

Image retouching not only enhances visual quality but also serves as a means of expressing personal preferences and emotions. However, existing learning-based approaches require large-scale paired data and operate as black boxes, making the retouching process opaque and limiting their adaptability to handle diverse, user- or image-specific adjustments. In this work, we propose RetouchLLM, a training-free white-box image retouching system, which requires no training data and performs interpretable, code-based retouching directly on high-resolution images. Our framework progressively enhances the image in a manner similar to how humans perform multi-step retouching, allowing exploration of diverse adjustment paths. It comprises of two main modules: a visual critic that identifies differences between the input and reference images, and a code generator that produces executable codes. Experiments demonstrate that our approach generalizes well across diverse retouching styles, while natural language-based user interaction enables interpretable and controllable adjustments tailored to user intent.

TLDR: RetouchLLM is a training-free, white-box image retouching system that uses a visual critic and code generator to perform interpretable and controllable adjustments on high-resolution images, guided by natural language.

TLDR: RetouchLLM是一个无需训练的白盒图像修饰系统，它使用视觉评论器和代码生成器，在高分辨率图像上执行可解释和可控的调整，并由自然语言指导。

Relevance: (7/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Moon Ye-Bin, Roy Miles, Tae-Hyun Oh, Ismail Elezi, Jiankang Deng

CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving

Generative models have been widely applied to world modeling for environment simulation and future state prediction. With advancements in autonomous driving, there is a growing demand not only for high-fidelity video generation under various controls, but also for producing diverse and meaningful information such as depth estimation. To address this, we propose CVD-STORM, a cross-view video diffusion model utilizing a spatial-temporal reconstruction Variational Autoencoder (VAE) that generates long-term, multi-view videos with 4D reconstruction capabilities under various control inputs. Our approach first fine-tunes the VAE with an auxiliary 4D reconstruction task, enhancing its ability to encode 3D structures and temporal dynamics. Subsequently, we integrate this VAE into the video diffusion process to significantly improve generation quality. Experimental results demonstrate that our model achieves substantial improvements in both FID and FVD metrics. Additionally, the jointly-trained Gaussian Splatting Decoder effectively reconstructs dynamic scenes, providing valuable geometric information for comprehensive scene understanding.

TLDR: The paper introduces CVD-STORM, a cross-view video diffusion model for autonomous driving that uses a spatial-temporal reconstruction VAE to generate long-term, multi-view videos with 4D reconstruction capabilities, achieving improved FID and FVD scores.

TLDR: 该论文介绍了CVD-STORM，一个用于自动驾驶的跨视角视频扩散模型，它使用一个时空重建VAE来生成具有4D重建能力的长时程、多视角视频，并实现了改进的FID和FVD分数。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Tianrui Zhang, Yichen Liu, Zilin Guo, Yuxin Guo, Jingcheng Ni, Chenjing Ding, Dan Xu, Lewei Lu, Zehuan Wu

TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

Video Foundation Models (VFMs) exhibit remarkable visual generation performance, but struggle in compositional scenarios (e.g., motion, numeracy, and spatial relation). In this work, we introduce Test-Time Optimization and Memorization (TTOM), a training-free framework that aligns VFM outputs with spatiotemporal layouts during inference for better text-image alignment. Rather than direct intervention to latents or attention per-sample in existing work, we integrate and optimize new parameters guided by a general layout-attention objective. Furthermore, we formulate video generation within a streaming setting, and maintain historical optimization contexts with a parametric memory mechanism that supports flexible operations, such as insert, read, update, and delete. Notably, we found that TTOM disentangles compositional world knowledge, showing powerful transferability and generalization. Experimental results on the T2V-CompBench and Vbench benchmarks establish TTOM as an effective, practical, scalable, and efficient framework to achieve cross-modal alignment for compositional video generation on the fly.

TLDR: The paper introduces TTOM, a training-free framework that aligns Video Foundation Model outputs with spatiotemporal layouts during inference using optimization and memorization techniques, achieving better text-image alignment in compositional video generation tasks.

TLDR: 该论文介绍了TTOM，一个无需训练的框架，利用优化和记忆技术在推理过程中将视频基础模型输出与时空布局对齐，从而在组合视频生成任务中实现更好的文本-图像对齐。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Leigang Qu, Ziyang Wang, Na Zheng, Wenjie Wang, Liqiang Nie, Tat-Seng Chua

MONKEY: Masking ON KEY-Value Activation Adapter for Personalization

Personalizing diffusion models allows users to generate new images that incorporate a given subject, allowing more control than a text prompt. These models often suffer somewhat when they end up just recreating the subject image, and ignoring the text prompt. We observe that one popular method for personalization, the IP-Adapter automatically generates masks that we definitively segment the subject from the background during inference. We propose to use this automatically generated mask on a second pass to mask the image tokens, thus restricting them to the subject, not the background, allowing the text prompt to attend to the rest of the image. For text prompts describing locations and places, this produces images that accurately depict the subject while definitively matching the prompt. We compare our method to a few other test time personalization methods, and find our method displays high prompt and source image alignment.

TLDR: The paper introduces MONKEY, a method that improves personalized image generation in diffusion models by masking image tokens based on automatically generated subject masks, enhancing prompt alignment while preserving subject fidelity.

TLDR: 本文介绍了一种名为MONKEY的方法，该方法通过基于自动生成的主题掩码来屏蔽图像令牌，从而改进扩散模型中的个性化图像生成，增强提示对齐，同时保持主题保真度。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: James Baker

Rectified-CFG++ for Flow Based Models

Classifier-free guidance (CFG) is the workhorse for steering large diffusion models toward text-conditioned targets, yet its native application to rectified flow (RF) based models provokes severe off-manifold drift, yielding visual artifacts, text misalignment, and brittle behaviour. We present Rectified-CFG++, an adaptive predictor-corrector guidance that couples the deterministic efficiency of rectified flows with a geometry-aware conditioning rule. Each inference step first executes a conditional RF update that anchors the sample near the learned transport path, then applies a weighted conditional correction that interpolates between conditional and unconditional velocity fields. We prove that the resulting velocity field is marginally consistent and that its trajectories remain within a bounded tubular neighbourhood of the data manifold, ensuring stability across a wide range of guidance strengths. Extensive experiments on large-scale text-to-image models (Flux, Stable Diffusion 3/3.5, Lumina) show that Rectified-CFG++ consistently outperforms standard CFG on benchmark datasets such as MS-COCO, LAION-Aesthetic, and T2I-CompBench. Project page: https://rectified-cfgpp.github.io/

TLDR: The paper introduces Rectified-CFG++, an improved classifier-free guidance method for rectified flow models that addresses off-manifold drift and enhances text-to-image generation quality, demonstrating superior performance on large-scale models.

TLDR: 该论文介绍了Rectified-CFG++，一种改进的用于校正流模型的无分类器指导方法，解决了流形漂移问题，并提高了文本到图像的生成质量，并在大型模型上表现出卓越的性能。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Shreshth Saini, Shashank Gupta, Alan C. Bovik

PickStyle: Video-to-Video Style Transfer with Context-Style Adapters

We address the task of video style transfer with diffusion models, where the goal is to preserve the context of an input video while rendering it in a target style specified by a text prompt. A major challenge is the lack of paired video data for supervision. We propose PickStyle, a video-to-video style transfer framework that augments pretrained video diffusion backbones with style adapters and benefits from paired still image data with source-style correspondences for training. PickStyle inserts low-rank adapters into the self-attention layers of conditioning modules, enabling efficient specialization for motion-style transfer while maintaining strong alignment between video content and style. To bridge the gap between static image supervision and dynamic video, we construct synthetic training clips from paired images by applying shared augmentations that simulate camera motion, ensuring temporal priors are preserved. In addition, we introduce Context-Style Classifier-Free Guidance (CS-CFG), a novel factorization of classifier-free guidance into independent text (style) and video (context) directions. CS-CFG ensures that context is preserved in generated video while the style is effectively transferred. Experiments across benchmarks show that our approach achieves temporally coherent, style-faithful, and content-preserving video translations, outperforming existing baselines both qualitatively and quantitatively.

TLDR: PickStyle proposes a video style transfer framework using diffusion models and style adapters trained on paired image data with synthetic motion augmentations and a novel Context-Style Classifier-Free Guidance to achieve temporally coherent and style-accurate video translations.

TLDR: PickStyle 提出了一种视频风格迁移框架，它使用扩散模型和风格适配器，在配对的图像数据上进行训练，并结合合成运动增强和一个新颖的上下文-风格无分类器指导（Context-Style Classifier-Free Guidance）来实现时间上连贯且风格准确的视频转换。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Soroush Mehraban, Vida Adeli, Jacob Rommann, Babak Taati, Kyryl Truskovskyi

Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models

Traditional multimodal learners find unified representations for tasks like visual question answering, but rely heavily on paired datasets. However, an overlooked yet potentially powerful question is: can one leverage auxiliary unpaired multimodal data to directly enhance representation learning in a target modality? We introduce UML: Unpaired Multimodal Learner, a modality-agnostic training paradigm in which a single model alternately processes inputs from different modalities while sharing parameters across them. This design exploits the assumption that different modalities are projections of a shared underlying reality, allowing the model to benefit from cross-modal structure without requiring explicit pairs. Theoretically, under linear data-generating assumptions, we show that unpaired auxiliary data can yield representations strictly more informative about the data-generating process than unimodal training. Empirically, we show that using unpaired data from auxiliary modalities -- such as text, audio, or images -- consistently improves downstream performance across diverse unimodal targets such as image and audio. Our project page: https://unpaired-multimodal.github.io/

TLDR: The paper introduces a new training paradigm, UML, that leverages unpaired multimodal data to improve unimodal representation learning by sharing parameters across modalities, showing improved performance in image and audio tasks.

TLDR: 该论文介绍了一种新的训练范式 UML，它利用非配对的多模态数据，通过跨模态共享参数来提高单模态表征学习，并在图像和音频任务中表现出性能提升。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Sharut Gupta, Shobhita Sundaram, Chenyu Wang, Stefanie Jegelka, Phillip Isola

One Stone with Two Birds: A Null-Text-Null Frequency-Aware Diffusion Models for Text-Guided Image Inpainting

Text-guided image inpainting aims at reconstructing the masked regions as per text prompts, where the longstanding challenges lie in the preservation for unmasked regions, while achieving the semantics consistency between unmasked and inpainted masked regions. Previous arts failed to address both of them, always with either of them to be remedied. Such facts, as we observed, stem from the entanglement of the hybrid (e.g., mid-and-low) frequency bands that encode varied image properties, which exhibit different robustness to text prompts during the denoising process. In this paper, we propose a null-text-null frequency-aware diffusion models, dubbed \textbf{NTN-Diff}, for text-guided image inpainting, by decomposing the semantics consistency across masked and unmasked regions into the consistencies as per each frequency band, while preserving the unmasked regions, to circumvent two challenges in a row. Based on the diffusion process, we further divide the denoising process into early (high-level noise) and late (low-level noise) stages, where the mid-and-low frequency bands are disentangled during the denoising process. As observed, the stable mid-frequency band is progressively denoised to be semantically aligned during text-guided denoising process, which, meanwhile, serves as the guidance to the null-text denoising process to denoise low-frequency band for the masked regions, followed by a subsequent text-guided denoising process at late stage, to achieve the semantics consistency for mid-and-low frequency bands across masked and unmasked regions, while preserve the unmasked regions. Extensive experiments validate the superiority of NTN-Diff over the state-of-the-art diffusion models to text-guided diffusion models. Our code can be accessed from https://github.com/htyjers/NTN-Diff.

TLDR: The paper introduces NTN-Diff, a novel diffusion model for text-guided image inpainting that disentangles frequency bands during denoising to balance reconstruction quality and semantic consistency between masked and unmasked regions. It claims to outperform existing diffusion models in this task.

TLDR: 该论文介绍了一种名为 NTN-Diff 的新型扩散模型，用于文本引导的图像修复。该模型在去噪过程中解耦了不同频率段，以平衡重建质量和被遮盖与未遮盖区域之间的语义一致性。它声称在这项任务中优于现有的扩散模型。

Relevance: (8/10)

Novelty: (8/10)

Clarity: (7/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Haipeng Liu, Yang Wang, Meng Wang

InstructUDrag: Joint Text Instructions and Object Dragging for Interactive Image Editing

Text-to-image diffusion models have shown great potential for image editing, with techniques such as text-based and object-dragging methods emerging as key approaches. However, each of these methods has inherent limitations: text-based methods struggle with precise object positioning, while object dragging methods are confined to static relocation. To address these issues, we propose InstructUDrag, a diffusion-based framework that combines text instructions with object dragging, enabling simultaneous object dragging and text-based image editing. Our framework treats object dragging as an image reconstruction process, divided into two synergistic branches. The moving-reconstruction branch utilizes energy-based gradient guidance to move objects accurately, refining cross-attention maps to enhance relocation precision. The text-driven editing branch shares gradient signals with the reconstruction branch, ensuring consistent transformations and allowing fine-grained control over object attributes. We also employ DDPM inversion and inject prior information into noise maps to preserve the structure of moved objects. Extensive experiments demonstrate that InstructUDrag facilitates flexible, high-fidelity image editing, offering both precision in object relocation and semantic control over image content.

TLDR: The paper introduces InstructUDrag, a diffusion-based image editing framework combining text instructions and object dragging for precise object relocation and semantic control of image content.

TLDR: 该论文介绍了InstructUDrag，一个基于扩散模型的图像编辑框架，它结合了文本指令和对象拖动，以实现精确的对象重定位和图像内容的语义控制。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Haoran Yu, Yi Shi

NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

Compositional training has been the de-facto paradigm in existing Multimodal Large Language Models (MLLMs), where pre-trained vision encoders are connected with pre-trained LLMs through continuous multimodal pre-training. However, the multimodal scaling property of this paradigm remains difficult to explore due to the separated training. In this paper, we focus on the native training of MLLMs in an end-to-end manner and systematically study its design space and scaling property under a practical setting, i.e., data constraint. Through careful study of various choices in MLLM, we obtain the optimal meta-architecture that best balances performance and training cost. After that, we further explore the scaling properties of the native MLLM and indicate the positively correlated scaling relationship between visual encoders and LLMs. Based on these findings, we propose a native MLLM called NaViL, combined with a simple and cost-effective recipe. Experimental results on 14 multimodal benchmarks confirm the competitive performance of NaViL against existing MLLMs. Besides that, our findings and results provide in-depth insights for the future study of native MLLMs.

TLDR: This paper explores native, end-to-end training of Multimodal Large Language Models (MLLMs) under data constraints, identifying an optimal meta-architecture and scaling relationships to achieve competitive performance on multimodal benchmarks.

TLDR: 本文探讨了在数据约束下，多模态大型语言模型（MLLM）的原生端到端训练，确定了最佳的元架构和缩放关系，从而在多模态基准测试中实现了具有竞争力的性能。

Relevance: (5/10)

Novelty: (7/10)

Clarity: (8/10)

Potential Impact: (6/10)

Overall: (6/10)

Read Paper (PDF)

Authors: Changyao Tian, Hao Li, Gen Luo, Xizhou Zhu, Weijie Su, Hanming Deng, Jinguo Zhu, Jie Shao, Ziran Zhu, Yunpeng Liu, Lewei Lu, Wenhai Wang, Hongsheng Li, Jifeng Dai

RePainter: Empowering E-commerce Object Removal via Spatial-matting Reinforcement Learning

In web data, product images are central to boosting user engagement and advertising efficacy on e-commerce platforms, yet the intrusive elements such as watermarks and promotional text remain major obstacles to delivering clear and appealing product visuals. Although diffusion-based inpainting methods have advanced, they still face challenges in commercial settings due to unreliable object removal and limited domain-specific adaptation. To tackle these challenges, we propose Repainter, a reinforcement learning framework that integrates spatial-matting trajectory refinement with Group Relative Policy Optimization (GRPO). Our approach modulates attention mechanisms to emphasize background context, generating higher-reward samples and reducing unwanted object insertion. We also introduce a composite reward mechanism that balances global, local, and semantic constraints, effectively reducing visual artifacts and reward hacking. Additionally, we contribute EcomPaint-100K, a high-quality, large-scale e-commerce inpainting dataset, and a standardized benchmark EcomPaint-Bench for fair evaluation. Extensive experiments demonstrate that Repainter significantly outperforms state-of-the-art methods, especially in challenging scenes with intricate compositions. We will release our code and weights upon acceptance.

TLDR: The paper introduces RePainter, a reinforcement learning framework for removing unwanted objects like watermarks from e-commerce product images, along with a new dataset and benchmark for this task.

TLDR: 该论文介绍了 RePainter，一个用于从电子商务产品图像中移除水印等不需要对象的强化学习框架，并为此任务提供了一个新的数据集和基准测试。

Relevance: (5/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (6/10)

Read Paper (PDF)

Authors: Zipeng Guo, Lichen Ma, Xiaolong Fu, Gaojing Zhou, Lan Yang, Yuchen Zhou, Linkai Liu, Yu He, Ximan Liu, Shiping Dong, Jingling Fu, Zhen Chen, Yu Shi, Junshi Huang, Jason Li, Chao Gou

ComGS: Efficient 3D Object-Scene Composition via Surface Octahedral Probes

Gaussian Splatting (GS) enables immersive rendering, but realistic 3D object-scene composition remains challenging. Baked appearance and shadow information in GS radiance fields cause inconsistencies when combining objects and scenes. Addressing this requires relightable object reconstruction and scene lighting estimation. For relightable object reconstruction, existing Gaussian-based inverse rendering methods often rely on ray tracing, leading to low efficiency. We introduce Surface Octahedral Probes (SOPs), which store lighting and occlusion information and allow efficient 3D querying via interpolation, avoiding expensive ray tracing. SOPs provide at least a 2x speedup in reconstruction and enable real-time shadow computation in Gaussian scenes. For lighting estimation, existing Gaussian-based inverse rendering methods struggle to model intricate light transport and often fail in complex scenes, while learning-based methods predict lighting from a single image and are viewpoint-sensitive. We observe that 3D object-scene composition primarily concerns the object's appearance and nearby shadows. Thus, we simplify the challenging task of full scene lighting estimation by focusing on the environment lighting at the object's placement. Specifically, we capture a 360 degrees reconstructed radiance field of the scene at the location and fine-tune a diffusion model to complete the lighting. Building on these advances, we propose ComGS, a novel 3D object-scene composition framework. Our method achieves high-quality, real-time rendering at around 28 FPS, produces visually harmonious results with vivid shadows, and requires only 36 seconds for editing. Code and dataset are available at https://nju-3dv.github.io/projects/ComGS/.

TLDR: The paper introduces ComGS, a framework for efficient 3D object-scene composition using Surface Octahedral Probes (SOPs) for relightable object reconstruction and diffusion models for lighting estimation, enabling real-time rendering with realistic shadows.

TLDR: 该论文介绍了ComGS，一个用于高效3D物体-场景合成的框架。它采用表面八面体探针（SOPs）进行可重新光照的物体重建，并使用扩散模型进行光照估计，从而实现具有真实阴影的实时渲染。

Relevance: (3/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (6/10)

Overall: (5/10)

Read Paper (PDF)

Authors: Jian Gao, Mengqi Yuan, Yifei Zeng, Chang Zeng, Zhihao Li, Zhenyu Chen, Weichao Qiu, Xiao-Xiao Long, Hao Zhu, Xun Cao, Yao Yao

Hierarchical Spatial Algorithms for High-Resolution Image Quantization and Feature Extraction

This study introduces a modular framework for spatial image processing, integrating grayscale quantization, color and brightness enhancement, image sharpening, bidirectional transformation pipelines, and geometric feature extraction. A stepwise intensity transformation quantizes grayscale images into eight discrete levels, producing a posterization effect that simplifies representation while preserving structural detail. Color enhancement is achieved via histogram equalization in both RGB and YCrCb color spaces, with the latter improving contrast while maintaining chrominance fidelity. Brightness adjustment is implemented through HSV value-channel manipulation, and image sharpening is performed using a 3 * 3 convolution kernel to enhance high-frequency details. A bidirectional transformation pipeline that integrates unsharp masking, gamma correction, and noise amplification achieved accuracy levels of 76.10% and 74.80% for the forward and reverse processes, respectively. Geometric feature extraction employed Canny edge detection, Hough-based line estimation (e.g., 51.50{\deg} for billiard cue alignment), Harris corner detection, and morphological window localization. Cue isolation further yielded 81.87\% similarity against ground truth images. Experimental evaluation across diverse datasets demonstrates robust and deterministic performance, highlighting its potential for real-time image analysis and computer vision.

TLDR: The paper presents a modular image processing framework encompassing quantization, enhancement, sharpening, bidirectional transformations, and geometric feature extraction, demonstrating robust performance across diverse datasets. It focuses on traditional image processing techniques rather than generation.

TLDR: 该论文提出了一个模块化的图像处理框架，包括量化、增强、锐化、双向变换和几何特征提取，并在不同的数据集上表现出稳健的性能。它侧重于传统的图像处理技术，而不是生成。

Relevance: (2/10)

Novelty: (4/10)

Clarity: (8/10)

Potential Impact: (5/10)

Overall: (3/10)

Read Paper (PDF)

Authors: Noor Islam S. Mohammad

AIGC Daily Papers

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

UniVideo: Unified Understanding, Generation, and Editing for Videos

LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation

Beyond Textual CoT: Interleaved Text-Image Chains with Deep Confidence Reasoning for Image Editing

Controllable Video Synthesis via Variational Inference

MultiCOIN: Multi-Modal COntrollable Video INbetweening

Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing

FlexTraj: Image-to-Video Generation with Flexible Point Trajectory Control

InstructX: Towards Unified Visual Editing with MLLM Guidance

Reinforcing Diffusion Models by Direct Group Preference Optimization

VideoVerse: How Far is Your T2V Generator from a World Model?

Real-Time Motion-Controllable Autoregressive Video Diffusion

RetouchLLM: Training-free White-box Image Retouching

CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving

TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

MONKEY: Masking ON KEY-Value Activation Adapter for Personalization

Rectified-CFG++ for Flow Based Models

PickStyle: Video-to-Video Style Transfer with Context-Style Adapters

Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models

One Stone with Two Birds: A Null-Text-Null Frequency-Aware Diffusion Models for Text-Guided Image Inpainting

InstructUDrag: Joint Text Instructions and Object Dragging for Interactive Image Editing

NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

RePainter: Empowering E-commerce Object Removal via Spatial-matting Reinforcement Learning

ComGS: Efficient 3D Object-Scene Composition via Surface Octahedral Probes

Hierarchical Spatial Algorithms for High-Resolution Image Quantization and Feature Extraction