ArXiv CS.CV Papers (Image/Video Generation)

DREAM: Where Visual Understanding Meets Text-to-Image Generation

Unifying visual representation learning and text-to-image (T2I) generation within a single model remains a central challenge in multimodal learning. We introduce DREAM, a unified framework that jointly optimizes discriminative and generative objectives, while learning strong visual representations. DREAM is built on two key techniques: During training, Masking Warmup, a progressive masking schedule, begins with minimal masking to establish the contrastive alignment necessary for representation learning, then gradually transitions to full masking for stable generative training. At inference, DREAM employs Semantically Aligned Decoding to align partially masked image candidates with the target text and select the best one for further decoding, improving text-image fidelity (+6.3%) without external rerankers. Trained solely on CC12M, DREAM achieves 72.7% ImageNet linear-probing accuracy (+1.1% over CLIP) and an FID of 4.25 (+6.2% over FLUID), with consistent gains in few-shot classification, semantic segmentation, and depth estimation. These results demonstrate that discriminative and generative objectives can be synergistic, allowing unified multimodal models that excel at both visual understanding and generation.

TLDR: DREAM is a unified framework for visual representation learning and text-to-image generation that jointly optimizes discriminative and generative objectives, achieving state-of-the-art results on ImageNet linear-probing accuracy and FID score.

TLDR: DREAM是一个统一的视觉表征学习和文本到图像生成框架，它联合优化判别和生成目标，并在ImageNet线性探测精度和FID分数上实现了最先进的结果。

Relevance: (10/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Chao Li, Tianhong Li, Sai Vidyaranya Nuthalapati, Hong-You Chen, Satya Narayan Shukla, Yonghuan Yang, Jun Xiao, Xiangjun Fan, Aashu Singh, Dina Katabi, Shlok Kumar Mishra

TC-Padé: Trajectory-Consistent Padé Approximation for Diffusion Acceleration

Despite achieving state-of-the-art generation quality, diffusion models are hindered by the substantial computational burden of their iterative sampling process. While feature caching techniques achieve effective acceleration at higher step counts (e.g., 50 steps), they exhibit critical limitations in the practical low-step regime of 20-30 steps. As the interval between steps increases, polynomial-based extrapolators like TaylorSeer suffer from error accumulation and trajectory drift. Meanwhile, conventional caching strategies often overlook the distinct dynamical properties of different denoising phases. To address these challenges, we propose Trajectory-Consistent Padé approximation, a feature prediction framework grounded in Padé approximation. By modeling feature evolution through rational functions, our approach captures asymptotic and transitional behaviors more accurately than Taylor-based methods. To enable stable and trajectory-consistent sampling under reduced step counts, TC-Padé incorporates (1) adaptive coefficient modulation that leverages historical cached residuals to detect subtle trajectory transitions, and (2) step-aware prediction strategies tailored to the distinct dynamics of early, mid, and late sampling stages. Extensive experiments on DiT-XL/2, FLUX.1-dev, and Wan2.1 across both image and video generation demonstrate the effectiveness of TC-Padé. For instance, TC-Padé achieves 2.88x acceleration on FLUX.1-dev and 1.72x on Wan2.1 while maintaining high quality across FID, CLIP, Aesthetic, and VBench-2.0 metrics, substantially outperforming existing feature caching methods.

TLDR: The paper introduces TC-Padé, a novel feature prediction framework for accelerating diffusion models, particularly in low-step regimes, by using Padé approximation with adaptive coefficient modulation and step-aware prediction strategies. It achieves significant speedups while maintaining generation quality in image and video generation tasks.

TLDR: 该论文介绍了一种名为TC-Padé的新型特征预测框架，通过使用Padé近似、自适应系数调制和步长感知预测策略来加速扩散模型，尤其是在低步长模式下。它在图像和视频生成任务中实现了显著的加速，同时保持了生成质量。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Benlei Cui, Shaoxuan He, Bukun Huang, Zhizeng Ye, Yunyun Sun, Longtao Huang, Hui Xue, Yang Yang, Jingqun Tang, Zhou Zhao, Haiwen Hong

Toward Early Quality Assessment of Text-to-Image Diffusion Models

Recent text-to-image (T2I) diffusion and flow-matching models can produce highly realistic images from natural language prompts. In practical scenarios, T2I systems are often run in a ``generate--then--select'' mode: many seeds are sampled and only a few images are kept for use. However, this pipeline is highly resource-intensive since each candidate requires tens to hundreds of denoising steps, and evaluation metrics such as CLIPScore and ImageReward are post-hoc. In this work, we address this inefficiency by introducing Probe-Select, a plug-in module that enables efficient evaluation of image quality within the generation process. We observe that certain intermediate denoiser activations, even at early timesteps, encode a stable coarse structure, object layout and spatial arrangement--that strongly correlates with final image fidelity. Probe-Select exploits this property by predicting final quality scores directly from early activations, allowing unpromising seeds to be terminated early. Across diffusion and flow-matching backbones, our experiments show that early evaluation at only 20\% of the trajectory accurately ranks candidate seeds and enables selective continuation. This strategy reduces sampling cost by over 60\% while improving the quality of the retained images, demonstrating that early structural signals can effectively guide selective generation without altering the underlying generative model. Code is available at https://github.com/Guhuary/ProbeSelect.

TLDR: The paper introduces Probe-Select, a method to efficiently evaluate and terminate unpromising text-to-image diffusion model samples early in the generation process by predicting final image quality from early denoising activations, resulting in significant cost savings and improved output quality.

TLDR: 该论文介绍了 Probe-Select，一种通过预测早期去噪激活的最终图像质量来有效评估和提前终止无前景的文本到图像扩散模型样本的方法，从而显著节省成本并提高输出质量。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Huanlei Guo, Hongxin Wei, Bingyi Jing

NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing

Recent video editing models have achieved impressive results, but most still require large-scale paired datasets. Collecting such naturally aligned pairs at scale remains highly challenging and constitutes a critical bottleneck, especially for local video editing data. Existing workarounds transfer image editing to video through global motion control for pair-free video editing, but such designs struggle with background and temporal consistency. In this paper, we propose NOVA: Sparse Control \& Dense Synthesis, a new framework for unpaired video editing. Specifically, the sparse branch provides semantic guidance through user-edited keyframes distributed across the video, and the dense branch continuously incorporates motion and texture information from the original video to maintain high fidelity and coherence. Moreover, we introduce a degradation-simulation training strategy that enables the model to learn motion reconstruction and temporal consistency by training on artificially degraded videos, thus eliminating the need for paired data. Our extensive experiments demonstrate that NOVA outperforms existing approaches in edit fidelity, motion preservation, and temporal coherence.

TLDR: NOVA is a new unpaired video editing framework that uses sparse semantic guidance from user-edited keyframes and dense motion/texture information from the original video, trained with a degradation-simulation strategy to eliminate the need for paired data.

TLDR: NOVA是一个新的非配对视频编辑框架，它利用用户编辑的关键帧提供的稀疏语义指导以及原始视频中的密集运动/纹理信息，并通过降级模拟训练策略进行训练，从而消除了对配对数据的需求。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Tianlin Pan, Jiayi Dai, Chenpu Yuan, Zhengyao Lv, Binxin Yang, Hubery Yin, Chen Li, Jing Lyu, Caifeng Shan, Chenyang Si

From "What" to "How": Constrained Reasoning for Autoregressive Image Generation

Autoregressive image generation has seen recent improvements with the introduction of chain-of-thought and reinforcement learning. However, current methods merely specify "What" details to depict by rewriting the input prompt, yet fundamentally fail to reason about "How" to structure the overall image. This inherent limitation gives rise to persistent issues, such as spatial ambiguity directly causing unrealistic object overlaps. To bridge this gap, we propose CoR-Painter, a novel framework that pioneers a "How-to-What" paradigm by introducing Constrained Reasoning to guide the autoregressive generation. Specifically, it first deduces "How to draw" by deriving a set of visual constraints from the input prompt, which explicitly govern spatial relationships, key attributes, and compositional rules. These constraints steer the subsequent generation of a detailed description "What to draw", providing a structurally sound and coherent basis for accurate visual synthesis. Additionally, we introduce a Dual-Objective GRPO strategy that specifically optimizes the textual constrained reasoning and visual projection processes to ensure the coherence and quality of the entire generation pipeline. Extensive experiments on T2I-CompBench, GenEval, and WISE demonstrate that our method achieves state-of-the-art performance, with significant improvements in spatial metrics (e.g., +5.41% on T2I-CompBench).

TLDR: The paper introduces CoR-Painter, a novel autoregressive image generation framework that uses constrained reasoning to guide image generation based on spatial relationships and compositional rules, achieving state-of-the-art performance.

TLDR: 该论文介绍了CoR-Painter，一种新颖的自回归图像生成框架，它使用约束推理来指导基于空间关系和构图规则的图像生成，并取得了最先进的性能。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Ruxue Yan, Xubo Liu, Wenya Guo, Zhengkun Zhang, Ying Zhang, Xiaojie Yuan

ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interaction. ShareVerse leverages the generation capability of large video models and integrates three key innovations: 1) A dataset for large-scale multi-agent interactive world modeling is built on the CARLA simulation platform, featuring diverse scenes, weather conditions, and interactive trajectories with paired multi-view videos (front/ rear/ left/ right views per agent) and camera data. 2) We propose a spatial concatenation strategy for four-view videos of independent agents to model a broader environment and to ensure internal multi-view geometric consistency. 3) We integrate cross-agent attention blocks into the pretrained video model, which enable interactive transmission of spatial-temporal information across agents, guaranteeing shared world consistency in overlapping regions and reasonable generation in non-overlapping regions. ShareVerse, which supports 49-frame large-scale video generation, accurately perceives the position of dynamic agents and achieves consistent shared world modeling.

TLDR: ShareVerse introduces a multi-agent video generation framework with a novel dataset and architecture to achieve consistent shared world modeling, using multi-view data and cross-agent attention within a video generation model.

TLDR: ShareVerse 提出了一个多智能体视频生成框架，包含新的数据集和架构，旨在实现一致的共享世界建模，通过使用多视角数据和视频生成模型内的跨智能体注意力机制。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Jiayi Zhu, Jianing Zhang, Yiying Yang, Wei Cheng, Xiaoyun Yuan

VisionCreator: A Native Visual-Generation Agentic Model with Understanding, Thinking, Planning and Creation

Visual content creation tasks demand a nuanced understanding of design conventions and creative workflows-capabilities challenging for general models, while workflow-based agents lack specialized knowledge for autonomous creative planning. To overcome these challenges, we propose VisionCreator, a native visual-generation agentic model that unifies Understanding, Thinking, Planning, and Creation (UTPC) capabilities within an end-to-end learnable framework. Our work introduces four key contributions: (i) VisGenData-4k and its construction methodology using metacognition-based VisionAgent to generate high-quality creation trajectories with explicit UTPC structures; (ii) The VisionCreator agentic model, optimized through Progressive Specialization Training (PST) and Virtual Reinforcement Learning (VRL) within a high-fidelity simulated environment, enabling stable and efficient acquisition of UTPC capabilities for complex creation tasks; (iii) VisGenBench, a comprehensive benchmark featuring 1.2k test samples across diverse scenarios for standardized evaluation of multi-step visual creation capabilities; (iv) Remarkably, our VisionCreator-8B/32B models demonstrate superior performance over larger closed-source models across multiple evaluation dimensions. Overall, this work provides a foundation for future research in visual-generation agentic systems.

TLDR: The paper introduces VisionCreator, a visual-generation agentic model with integrated Understanding, Thinking, Planning, and Creation (UTPC) capabilities, trained using a novel data generation and training methodology, and demonstrating superior performance compared to larger closed-source models on visual creation tasks.

TLDR: 本文介绍了VisionCreator，一个集成了理解、思考、计划和创造（UTPC）能力的可视化生成代理模型。该模型使用一种新的数据生成和训练方法进行训练，并在视觉创建任务上表现出优于更大的闭源模型的性能。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Jinxiang Lai, Zexin Lu, Jiajun He, Rongwei Quan, Wenzhe Zhao, Qinyu Yang, Qi Chen, Qin Lin, Chuyue Li, Tao Gao, Yuhao Shan, Shuai Shao, Song Guo, Qinglin Lu

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Instruction-based video editing has witnessed rapid progress, yet current methods often struggle with precise visual control, as natural language is inherently limited in describing complex visual nuances. Although reference-guided editing offers a robust solution, its potential is currently bottlenecked by the scarcity of high-quality paired training data. To bridge this gap, we introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets, leveraging image generative models to create synthesized reference scaffolds. Using this pipeline, we construct RefVIE, a large-scale dataset tailored for instruction-reference-following tasks, and establish RefVIE-Bench for comprehensive evaluation. Furthermore, we propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance. Our model achieves significant gains in instruction following and reference fidelity via a progressive multi-stage training curriculum. Extensive experiments demonstrate that our data and architecture establish a new state-of-the-art in controllable video editing. All datasets, models, and code is released at https://github.com/showlab/Kiwi-Edit.

TLDR: The paper introduces Kiwi-Edit, a novel instruction-and-reference-guided video editing framework. It includes a data generation pipeline, a large-scale dataset (RefVIE), and a unified editing architecture achieving state-of-the-art results in controllable video editing.

TLDR: 本文介绍了Kiwi-Edit，一种新颖的基于指令和参考引导的视频编辑框架。它包括一个数据生成流程，一个大规模数据集(RefVIE)和一个统一的编辑架构，并在可控的视频编辑中实现了最先进的结果。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, Mike Zheng Shou

GeoDiT: Point-Conditioned Diffusion Transformer for Satellite Image Synthesis

We introduce GeoDiT, a diffusion transformer designed for text-to-satellite image generation with point-based control. Existing controlled satellite image generative models often require pixel-level maps that are time-consuming to acquire, yet semantically limited. To address this limitation, we introduce a novel point-based conditioning framework that controls the generation process through the spatial location of the points and the textual description associated with each point, providing semantically rich control signals. This approach enables flexible, annotation-friendly, and computationally simple inference for satellite image generation. To this end, we introduce an adaptive local attention mechanism that effectively regularizes the attention scores based on the input point queries. We systematically evaluate various domain-specific design choices for training GeoDiT, including the selection of satellite image representation for alignment and geolocation representation for conditioning. Our experiments demonstrate that GeoDiT achieves impressive generation performance, surpassing the state-of-the-art remote sensing generative models.

TLDR: GeoDiT is a diffusion transformer for generating satellite images from text, controlled by point locations and associated text. It addresses the limitations of pixel-level map conditioning and achieves state-of-the-art performance.

TLDR: GeoDiT是一个扩散Transformer，用于从文本生成卫星图像，并由点位置和相关文本控制。它解决了像素级地图条件的局限性，并实现了最先进的性能。

Relevance: (8/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Srikumar Sastry, Dan Cher, Brian Wei, Aayush Dhakal, Subash Khanal, Dev Gupta, Nathan Jacobs

Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective training strategy for editing tasks, infeasible. In this paper, we observe that, while generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning (RL) as a feasible solution. Motivated by this, we propose \textbf{RL3DEdit}, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT. Specifically, we leverage VGGT's robust priors learned from massive real-world data, feed the edited images, and utilize the output confidence maps and pose estimation errors as reward signals, effectively anchoring the 2D editing priors onto a 3D-consistent manifold via RL. Extensive experiments demonstrate that RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency. To promote the development of 3D editing, we will release the code and model.

TLDR: The paper introduces RL3DEdit, a reinforcement learning framework for multi-view consistent 3D scene editing that leverages the priors of a 3D foundation model for reward signals, outperforming existing methods in editing quality and efficiency.

TLDR: 该论文介绍了RL3DEdit，一个用于多视角一致的三维场景编辑的强化学习框架，该框架利用三维基础模型的先验知识作为奖励信号，在编辑质量和效率方面优于现有方法。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Jiyuan Wang, Chunyu Lin, Lei Sun, Zhi Cao, Yuyang Yin, Lang Nie, Zhenlong Yuan, Xiangxiang Chu, Yunchao Wei, Kang Liao, Guosheng Lin

AWDiff: An a trous wavelet diffusion model for lung ultrasound image synthesis

Lung ultrasound (LUS) is a safe and portable imaging modality, but the scarcity of data limits the development of machine learning methods for image interpretation and disease monitoring. Existing generative augmentation methods, such as Generative Adversarial Networks (GANs) and diffusion models, often lose subtle diagnostic cues due to resolution reduction, particularly B-lines and pleural irregularities. We propose A trous Wavelet Diffusion (AWDiff), a diffusion based augmentation framework that integrates the a trous wavelet transform to preserve fine-scale structures while avoiding destructive downsampling. In addition, semantic conditioning with BioMedCLIP, a vision language foundation model trained on large scale biomedical corpora, enforces alignment with clinically meaningful labels. On a LUS dataset, AWDiff achieved lower distortion and higher perceptual quality compared to existing methods, demonstrating both structural fidelity and clinical diversity.

TLDR: The paper introduces AWDiff, a diffusion model incorporating a trous wavelet transform and BioMedCLIP for lung ultrasound image synthesis, aiming to improve the preservation of fine-scale clinical features compared to existing generative augmentation methods.

TLDR: 该论文介绍了AWDiff，一种结合了a trous小波变换和BioMedCLIP的扩散模型，用于肺部超声图像合成，旨在比现有的生成增强方法更好地保留精细的临床特征。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Maryam Heidari, Nantheera Anantrasirichai, Steven Walker, Rahul Bhatnagar, Alin Achim

Preconditioned Score and Flow Matching

Flow matching and score-based diffusion train vector fields under intermediate distributions $p_t$, whose geometry can strongly affect their optimization. We show that the covariance $Σ_t$ of $p_t$ governs optimization bias: when $Σ_t$ is ill-conditioned, and gradient-based training rapidly fits high-variance directions while systematically under-optimizing low-variance modes, leading to learning that plateaus at suboptimal weights. We formalize this effect in analytically tractable settings and propose reversible, label-conditional \emph{preconditioning} maps that reshape the geometry of $p_t$ by improving the conditioning of $Σ_t$ without altering the underlying generative model. Rather than accelerating early convergence, preconditioning primarily mitigates optimization stagnation by enabling continued progress along previously suppressed directions. Across MNIST latent flow matching, and additional high-resolution datasets, we empirically track conditioning diagnostics and distributional metrics and show that preconditioning consistently yields better-trained models by avoiding suboptimal plateaus.

TLDR: This paper introduces a preconditioning method for flow matching and score-based diffusion models to improve training by addressing optimization biases caused by ill-conditioned data covariance, leading to better model training and avoiding suboptimal plateaus.

TLDR: 该论文介绍了一种用于流匹配和基于分数的扩散模型的预处理方法，通过解决由不良条件的数据协方差引起的优化偏差来改进训练，从而实现更好的模型训练并避免次优平台期。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Shadab Ahamed, Eshed Gal, Simon Ghyselincks, Md Shahriar Rahim Siddiqui, Moshe Eliasof, Eldad Haber

AIGC Daily Papers

DREAM: Where Visual Understanding Meets Text-to-Image Generation

TC-Padé: Trajectory-Consistent Padé Approximation for Diffusion Acceleration

Toward Early Quality Assessment of Text-to-Image Diffusion Models

NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing

From "What" to "How": Constrained Reasoning for Autoregressive Image Generation

ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

VisionCreator: A Native Visual-Generation Agentic Model with Understanding, Thinking, Planning and Creation

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

GeoDiT: Point-Conditioned Diffusion Transformer for Satellite Image Synthesis

Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

AWDiff: An a trous wavelet diffusion model for lung ultrasound image synthesis

Preconditioned Score and Flow Matching