Daily papers related to Image/Video/Multimodal Generation from cs.CV
September 28, 2025
Recent advances in driving-scene generation and reconstruction have demonstrated significant potential for enhancing autonomous driving systems by producing scalable and controllable training data. Existing generation methods primarily focus on synthesizing diverse and high-fidelity driving videos; however, due to limited 3D consistency and sparse viewpoint coverage, they struggle to support convenient and high-quality novel-view synthesis (NVS). Conversely, recent 3D/4D reconstruction approaches have significantly improved NVS for real-world driving scenes, yet inherently lack generative capabilities. To overcome this dilemma between scene generation and reconstruction, we propose \textbf{WorldSplat}, a novel feed-forward framework for 4D driving-scene generation. Our approach effectively generates consistent multi-track videos through two key steps: ((i)) We introduce a 4D-aware latent diffusion model integrating multi-modal information to produce pixel-aligned 4D Gaussians in a feed-forward manner. ((ii)) Subsequently, we refine the novel view videos rendered from these Gaussians using a enhanced video diffusion model. Extensive experiments conducted on benchmark datasets demonstrate that \textbf{WorldSplat} effectively generates high-fidelity, temporally and spatially consistent multi-track novel view driving videos.
TLDR: WorldSplat is a novel feed-forward framework for 4D driving-scene generation that combines a 4D-aware latent diffusion model with a video diffusion model to generate consistent multi-track novel view driving videos.
TLDR: WorldSplat 是一种新的前馈 4D 驾驶场景生成框架,它结合了 4D 感知潜在扩散模型和视频扩散模型,以生成一致的多轨新视角驾驶视频。
Read Paper (PDF)The integration of Reinforcement Learning (RL) into flow matching models for text-to-image (T2I) generation has driven substantial advances in generation quality. However, these gains often come at the cost of exhaustive exploration and inefficient sampling strategies due to slight variation in the sampling group. Building on this insight, we propose Dynamic-TreeRPO, which implements the sliding-window sampling strategy as a tree-structured search with dynamic noise intensities along depth. We perform GRPO-guided optimization and constrained Stochastic Differential Equation (SDE) sampling within this tree structure. By sharing prefix paths of the tree, our design effectively amortizes the computational overhead of trajectory search. With well-designed noise intensities for each tree layer, Dynamic-TreeRPO can enhance the variation of exploration without any extra computational cost. Furthermore, we seamlessly integrate Supervised Fine-Tuning (SFT) and RL paradigm within Dynamic-TreeRPO to construct our proposed LayerTuning-RL, reformulating the loss function of SFT as a dynamically weighted Progress Reward Model (PRM) rather than a separate pretraining method. By associating this weighted PRM with dynamic-adaptive clipping bounds, the disruption of exploration process in Dynamic-TreeRPO is avoided. Benefiting from the tree-structured sampling and the LayerTuning-RL paradigm, our model dynamically explores a diverse search space along effective directions. Compared to existing baselines, our approach demonstrates significant superiority in terms of semantic consistency, visual fidelity, and human preference alignment on established benchmarks, including HPS-v2.1, PickScore, and ImageReward. In particular, our model outperforms SoTA by $4.9\%$, $5.91\%$, and $8.66\%$ on those benchmarks, respectively, while improving the training efficiency by nearly $50\%$.
TLDR: Dynamic-TreeRPO improves text-to-image generation by using a tree-structured search with dynamic noise intensities, a novel LayerTuning-RL paradigm, and shows significant performance and efficiency gains on standard benchmarks.
TLDR: Dynamic-TreeRPO通过使用具有动态噪声强度的树状结构搜索和一种新颖的LayerTuning-RL范式来改进文本到图像的生成,并在标准基准测试中显示出显着的性能和效率提升。
Read Paper (PDF)Existing image generation models face critical challenges regarding the trade-off between computation and fidelity. Specifically, models relying on a pretrained Variational Autoencoder (VAE) suffer from information loss, limited detail, and the inability to support end-to-end training. In contrast, models operating directly in the pixel space incur prohibitive computational cost. Although cascade models can mitigate computational cost, stage-wise separation prevents effective end-to-end optimization, hampers knowledge sharing, and often results in inaccurate distribution learning within each stage. To address these challenges, we introduce a unified multistage generative framework based on our proposed Conditional Dependent Coupling strategy. It decomposes the generative process into interpolant trajectories at multiple stages, ensuring accurate distribution learning while enabling end-to-end optimization. Importantly, the entire process is modeled as a single unified Diffusion Transformer, eliminating the need for disjoint modules and also enabling knowledge sharing. Extensive experiments demonstrate that our method achieves both high fidelity and efficiency across multiple resolutions.
TLDR: The paper presents a novel multistage generative framework, Stochastic Interpolants via Conditional Dependent Coupling, using a unified Diffusion Transformer for efficient and high-fidelity image generation through end-to-end optimization and knowledge sharing.
TLDR: 该论文提出了一种新的多阶段生成框架,即通过条件依赖耦合的随机插值法,它使用统一的扩散Transformer,通过端到端优化和知识共享来实现高效、高保真的图像生成。
Read Paper (PDF)Despite their exceptional generative quality, diffusion models have limited applicability to world modeling tasks, such as novel view generation from sparse inputs. This limitation arises because diffusion models generate outputs in a non-causal manner, often leading to distortions or inconsistencies across views, and making it difficult to incrementally adapt accumulated knowledge to new queries. In contrast, autoregressive (AR) models operate in a causal fashion, generating each token based on all previously generated tokens. In this work, we introduce \textbf{ARSS}, a novel framework that leverages a GPT-style decoder-only AR model to generate novel views from a single image, conditioned on a predefined camera trajectory. We employ a video tokenizer to map continuous image sequences into discrete tokens and propose a camera encoder that converts camera trajectories into 3D positional guidance. Then to enhance generation quality while preserving the autoregressive structure, we propose a autoregressive transformer module that randomly permutes the spatial order of tokens while maintaining their temporal order. Extensive qualitative and quantitative experiments on public datasets demonstrate that our method performs comparably to, or better than, state-of-the-art view synthesis approaches based on diffusion models. Our code will be released upon paper acceptance.
TLDR: The paper presents ARSS, a novel autoregressive framework using a GPT-style decoder to generate novel views from a single image conditioned on a camera trajectory, outperforming diffusion-based methods in view synthesis.
TLDR: 该论文提出了ARSS,一种新颖的自回归框架,采用GPT风格的解码器,根据相机轨迹从单张图像生成新的视角,在视角合成方面优于基于扩散的方法。
Read Paper (PDF)One-step generators distilled from Masked Diffusion Models (MDMs) compress multiple sampling steps into a single forward pass, enabling efficient text and image synthesis. However, they suffer two key limitations: they inherit modeling bias from the teacher, and their discrete token outputs block gradient flow, preventing post-distillation refinements such as adversarial training, reward-based fine-tuning, and Test-Time Embedding Optimization (TTEO). In this work, we introduce soft embeddings, a simple relaxation that replaces discrete tokens with the expected embeddings under the generator's output distribution. Soft embeddings preserve representation fidelity for one-step discrete generator while providing a fully differentiable continuous surrogate that is compatible with teacher backbones and tokenizer decoders. Integrating soft embeddings into the Di[M]O distillation framework (denoted Soft-Di[M]O) makes one-step generators end-to-end trainable and enables straightforward application of GAN-based refinement, differentiable reward fine-tuning, and TTEO. Empirically, across multiple MDM teachers (e.g., MaskBit, MaskGen), Soft-Di[M]O achieves state-of-the-art one-step results: improved class-to-image performance, a one-step FID of 1.56 on ImageNet-256 with GAN-based refinement, along with higher GenEval and HPS scores on text-to-image with reward fine-tuning, and further gains from TTEO.
TLDR: This paper introduces soft embeddings within the Di[M]O framework (Soft-Di[M]O) to improve one-step discrete image generation by enabling end-to-end training and refinement through GANs and reward fine-tuning, achieving state-of-the-art results.
TLDR: 本文提出了一种在Di[M]O框架中使用软嵌入的方法 (Soft-Di[M]O),通过支持端到端训练和GANs以及奖励微调的优化,改进了一步离散图像生成,并取得了最先进的结果。
Read Paper (PDF)In recent years, event cameras have gained significant attention due to their bio-inspired properties, such as high temporal resolution and high dynamic range. However, obtaining large-scale labeled ground-truth data for event-based vision tasks remains challenging and costly. In this paper, we present ControlEvents, a diffusion-based generative model designed to synthesize high-quality event data guided by diverse control signals such as class text labels, 2D skeletons, and 3D body poses. Our key insight is to leverage the diffusion prior from foundation models, such as Stable Diffusion, enabling high-quality event data generation with minimal fine-tuning and limited labeled data. Our method streamlines the data generation process and significantly reduces the cost of producing labeled event datasets. We demonstrate the effectiveness of our approach by synthesizing event data for visual recognition, 2D skeleton estimation, and 3D body pose estimation. Our experiments show that the synthesized labeled event data enhances model performance in all tasks. Additionally, our approach can generate events based on unseen text labels during training, illustrating the powerful text-based generation capabilities inherited from foundation models.
TLDR: The paper introduces ControlEvents, a diffusion-based generative model that synthesizes event camera data controlled by text, 2D skeletons, and 3D poses, leveraging priors from foundation models like Stable Diffusion to generate high-quality labeled event data with minimal fine-tuning.
TLDR: 该论文介绍了ControlEvents,一种基于扩散的生成模型,通过文本、2D骨骼和3D姿势控制合成事件相机数据,利用Stable Diffusion等基础模型的先验知识,以最少的微调生成高质量的标记事件数据。
Read Paper (PDF)This paper investigates image inpainting with preference alignment. Instead of introducing a novel method, we go back to basics and revisit fundamental problems in achieving such alignment. We leverage the prominent direct preference optimization approach for alignment training and employ public reward models to construct preference training datasets. Experiments are conducted across nine reward models, two benchmarks, and two baseline models with varying structures and generative algorithms. Our key findings are as follows: (1) Most reward models deliver valid reward scores for constructing preference data, even if some of them are not reliable evaluators. (2) Preference data demonstrates robust trends in both candidate scaling and sample scaling across models and benchmarks. (3) Observable biases in reward models, particularly in brightness, composition, and color scheme, render them susceptible to cause reward hacking. (4) A simple ensemble of these models yields robust and generalizable results by mitigating such biases. Built upon these observations, our alignment models significantly outperform prior models across standard metrics, GPT-4 assessments, and human evaluations, without any changes to model structures or the use of new datasets. We hope our work can set a simple yet solid baseline, pushing this promising frontier. Our code is open-sourced at: https://github.com/shenytzzz/Follow-Your-Preference.
TLDR: This paper explores preference-aligned image inpainting using direct preference optimization and public reward models. It identifies biases in reward models and proposes a simple ensemble to improve performance, achieving state-of-the-art results without modifying model structures.
TLDR: 本文探讨了基于偏好对齐的图像修复,利用直接偏好优化和公开奖励模型。它发现了奖励模型中的偏差,并提出了一种简单的集成方法来提高性能,无需修改模型结构即可达到最先进的效果。
Read Paper (PDF)Generative AI has established the opportunity to readily transform content from one medium to another. This capability is especially powerful for storytelling, where visual illustrations can illuminate a story originally expressed in text. In this paper, we focus on the task of narrative scene illustration, which involves automatically generating an image depicting a scene in a story. Motivated by recent progress on text-to-image models, we consider a pipeline that uses LLMs as an interface for prompting text-to-image models to generate scene illustrations given raw story text. We apply variations of this pipeline to a prominent story corpus in order to synthesize illustrations for scenes in these stories. We conduct a human annotation task to obtain pairwise quality judgments for these illustrations. The outcome of this process is the SceneIllustrations dataset, which we release as a new resource for future work on cross-modal narrative transformation. Through our analysis of this dataset and experiments modeling illustration quality, we demonstrate that LLMs can effectively verbalize scene knowledge implicitly evoked by story text. Moreover, this capability is impactful for generating and evaluating illustrations.
TLDR: This paper explores using LLMs to prompt text-to-image models for narrative scene illustration, introduces the SceneIllustrations dataset, and demonstrates LLMs' effectiveness in verbalizing scene knowledge for better illustration generation and evaluation.
TLDR: 本文研究了使用大型语言模型(LLM)来提示文本到图像模型,以实现叙事场景的插图生成,介绍了SceneIllustrations数据集,并展示了LLM在将场景知识转化为文字方面的有效性,从而改进插图生成和评估。
Read Paper (PDF)