Daily papers related to Image/Video/Multimodal Generation from cs.CV
November 25, 2025
Group Relative Policy Optimization (GRPO) has emerged as a powerful reinforcement learning paradigm for post-training video generation models. However, existing GRPO pipelines rely on static, fixed-capacity reward models whose evaluation behavior is frozen during training. Such rigid rewards introduce distributional bias, saturate quickly as the generator improves, and ultimately limit the stability and effectiveness of reinforcement-based alignment. We propose Self-Paced GRPO, a competence-aware GRPO framework in which reward feedback co-evolves with the generator. Our method introduces a progressive reward mechanism that automatically shifts its emphasis from coarse visual fidelity to temporal coherence and fine-grained text-video semantic alignment as generation quality increases. This self-paced curriculum alleviates reward-policy mismatch, mitigates reward exploitation, and yields more stable optimization. Experiments on VBench across multiple video generation backbones demonstrate consistent improvements in both visual quality and semantic alignment over GRPO baselines with static rewards, validating the effectiveness and generality of Self-Paced GRPO.
TLDR: This paper introduces Self-Paced GRPO, a novel reinforcement learning framework for video generation where the reward model adapts to the generator's competence, progressively emphasizing different aspects of video quality. It demonstrates improved visual quality and semantic alignment compared to static reward approaches.
TLDR: 本文介绍了一种名为Self-Paced GRPO的视频生成强化学习框架,其中奖励模型会适应生成器的能力,逐步强调视频质量的不同方面。实验表明,与静态奖励方法相比,该方法在视觉质量和语义对齐方面均有改进。
Read Paper (PDF)We present HunyuanVideo 1.5, a lightweight yet powerful open-source video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture featuring selective and sliding tile attention (SSTA), enhanced bilingual understanding through glyph-aware text encoding, progressive pre-training and post-training, and an efficient video super-resolution network. Leveraging these designs, we developed a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and resolutions.Extensive experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source video generation models. By releasing the code and model weights, we provide the community with a high-performance foundation that lowers the barrier to video creation and research, making advanced video generation accessible to a broader audience. All open-source assets are publicly available at https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5.
TLDR: HunyuanVideo 1.5 is an open-source video generation model with 8.3B parameters that achieves state-of-the-art performance and motion coherence, enabling efficient inference on consumer GPUs.
TLDR: HunyuanVideo 1.5是一个拥有83亿参数的开源视频生成模型,它实现了最先进的性能和运动连贯性,并且可以在消费级GPU上进行高效推理。
Read Paper (PDF)Recent progress in video generative models has enabled the creation of high-quality videos from multimodal prompts that combine text and images. While these systems offer enhanced controllability, they also introduce new safety risks, as harmful content can emerge from individual modalities or their interaction. Existing safety methods are often text-only, require prior knowledge of the risk category, or operate as post-generation auditors, struggling to proactively mitigate such compositional, multimodal risks. To address this challenge, we present ConceptGuard, a unified safeguard framework for proactively detecting and mitigating unsafe semantics in multimodal video generation. ConceptGuard operates in two stages: First, a contrastive detection module identifies latent safety risks by projecting fused image-text inputs into a structured concept space; Second, a semantic suppression mechanism steers the generative process away from unsafe concepts by intervening in the prompt's multimodal conditioning. To support the development and rigorous evaluation of this framework, we introduce two novel benchmarks: ConceptRisk, a large-scale dataset for training on multimodal risks, and T2VSafetyBench-TI2V, the first benchmark adapted from T2VSafetyBench for the Text-and-Image-to-Video (TI2V) safety setting. Comprehensive experiments on both benchmarks show that ConceptGuard consistently outperforms existing baselines, achieving state-of-the-art results in both risk detection and safe video generation.
TLDR: ConceptGuard proactively detects and mitigates safety risks in text-and-image-to-video generation using a contrastive detection module and semantic suppression, outperforming existing baselines on novel safety benchmarks.
TLDR: ConceptGuard采用对比检测模块和语义抑制,主动检测并减轻文本-图像到视频生成中的安全风险,并在新的安全基准测试中优于现有基线。
Read Paper (PDF)Reinforcement learning (RL) has become a powerful tool for post-training visual generative models, with Group Relative Policy Optimization (GRPO) increasingly used to align generators with human preferences. However, existing GRPO pipelines rely on a single scalar reward per sample, treating each image or video as a holistic entity and ignoring the rich spatial and temporal structure of visual content. This coarse supervision hinders the correction of localized artifacts and the modeling of fine-grained perceptual cues. We introduce Visual Preference Policy Optimization (ViPO), a GRPO variant that lifts scalar feedback into structured, pixel-level advantages. ViPO employs a Perceptual Structuring Module that uses pretrained vision backbones to construct spatially and temporally aware advantage maps, redistributing optimization pressure toward perceptually important regions while preserving the stability of standard GRPO. Across both image and video benchmarks, ViPO consistently outperforms vanilla GRPO, improving in-domain alignment with human-preference rewards and enhancing generalization on out-of-domain evaluations. The method is architecture-agnostic, lightweight, and fully compatible with existing GRPO training pipelines, providing a more expressive and informative learning signal for visual generation.
TLDR: The paper introduces Visual Preference Policy Optimization (ViPO), a reinforcement learning method for visual generation that uses pixel-level advantages derived from a perceptual structuring module to improve alignment with human preferences and generalization capabilities compared to standard GRPO.
TLDR: 该论文介绍了一种视觉偏好策略优化 (ViPO) 方法,用于视觉生成。该方法使用从感知结构模块导出的像素级优势,与标准的GRPO 相比,提高了与人类偏好的一致性和泛化能力。
Read Paper (PDF)Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This approach avoids the limitations of VAE in the two-stage latent diffusion, offering higher model capacity. Existing pixel diffusion models suffer from slow training and inference, as they usually model both high-frequency signals and low-frequency semantics within a single diffusion transformer (DiT). To pursue a more efficient pixel diffusion paradigm, we propose the frequency-DeCoupled pixel diffusion framework. With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance from the DiT. This thus frees the DiT to specialize in modeling low-frequency semantics. In addition, we introduce a frequency-aware flow-matching loss that emphasizes visually salient frequencies while suppressing insignificant ones. Extensive experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of 1.62 (256x256) and 2.22 (512x512) on ImageNet, closing the gap with latent diffusion methods. Furthermore, our pretrained text-to-image model achieves a leading overall score of 0.86 on GenEval in system-level comparison. Codes are publicly available at https://github.com/Zehong-Ma/DeCo.
TLDR: The paper introduces DeCo, a frequency-decoupled pixel diffusion framework that improves the efficiency and performance of image generation by separating the modeling of high-frequency details and low-frequency semantics, achieving state-of-the-art results among pixel diffusion models.
TLDR: 该论文介绍了DeCo,一种频率解耦的像素扩散框架,通过分离高频细节和低频语义的建模,提高了图像生成的效率和性能,并在像素扩散模型中取得了最先进的结果。
Read Paper (PDF)RL (reinforcement learning) methods (e.g., GRPO) for MLLM (Multimodal LLM) perception ability has attracted wide research interest owing to its remarkable generalization ability. Nevertheless, existing reinforcement learning methods still face the problem of low data quality, where data samples cannot elicit diverse responses from MLLMs, thus restricting the exploration scope for MLLM reinforcement learning. Some methods attempt to mitigate this problem by imposing constraints on entropy, but none address it at its root. Therefore, to tackle this problem, this work proposes Syn-GRPO (Synthesis-GRPO), which employs an online data generator to synthesize high-quality training data with diverse responses in GRPO training. Specifically, Syn-GRPO consists of two components: (1) data server; (2) GRPO workflow. The data server synthesizes new samples from existing ones using an image generation model, featuring a decoupled and asynchronous scheme to achieve high generation efficiency. The GRPO workflow provides the data server with the new image descriptions, and it leverages a diversity reward to supervise the MLLM to predict image descriptions for synthesizing samples with diverse responses. Experiment results across three visual perception tasks demonstrate that Syn-GRPO improves the data quality by a large margin, achieving significant superior performance to existing MLLM perception methods, and Syn-GRPO presents promising potential for scaling long-term self-evolving RL. Our code is available at https://github.com/hqhQAQ/Syn-GRPO.
TLDR: The paper introduces Syn-GRPO, a method that uses online data synthesis to improve the quality and diversity of training data for MLLM perception, demonstrating superior performance in visual perception tasks via a self-evolving RL framework.
TLDR: 该论文介绍了Syn-GRPO,一种利用在线数据合成来提高MLLM感知训练数据质量和多样性的方法。实验表明,该方法在视觉感知任务中表现优异,并通过自演化RL框架展现了潜力。
Read Paper (PDF)Diffusion Transformer(DiT) based video generation models have recently achieved impressive visual quality and temporal coherence, but they still frequently violate basic physical laws and commonsense dynamics, revealing a lack of explicit world knowledge. In this work, we explore how to equip them with a plug-and-play memory that injects useful world knowledge. Motivated by in-context memory in Transformer-based LLMs, we conduct empirical studies to show that DiT can be steered via interventions on its hidden states, and simple low-pass and high-pass filters in the embedding space naturally disentangle low-level appearance and high-level physical/semantic cues, enabling targeted guidance. Building on these observations, we propose a learnable memory encoder DiT-Mem, composed of stacked 3D CNNs, low-/high-pass filters, and self-attention layers. The encoder maps reference videos into a compact set of memory tokens, which are concatenated as the memory within the DiT self-attention layers. During training, we keep the diffusion backbone frozen, and only optimize the memory encoder. It yields a rather efficient training process on few training parameters (150M) and 10K data samples, and enables plug-and-play usage at inference time. Extensive experiments on state-of-the-art models demonstrate the effectiveness of our method in improving physical rule following and video fidelity. Our code and data are publicly released here: https://thrcle421.github.io/DiT-Mem-Web/.
TLDR: The paper introduces DiT-Mem, a plug-and-play memory module for Diffusion Transformer-based video generation models, enhancing their ability to adhere to physical laws and commonsense dynamics by injecting world knowledge. It uses a learned memory encoder trained on a small dataset while keeping the diffusion backbone frozen and shows improved video fidelity and rule following.
TLDR: 该论文介绍了一种即插即用的记忆模块DiT-Mem,用于基于Diffusion Transformer的视频生成模型,通过注入世界知识来提高其遵守物理规律和常识动态的能力。它使用一个学习到的记忆编码器,在保持diffusion骨干冻结的情况下,在小数据集上进行训练,并显示出改进的视频保真度和规则遵循。
Read Paper (PDF)MeanFlow promises high-quality generative modeling in few steps, by jointly learning instantaneous and average velocity fields. Yet, the underlying training dynamics remain unclear. We analyze the interaction between the two velocities and find: (i) well-established instantaneous velocity is a prerequisite for learning average velocity; (ii) learning of instantaneous velocity benefits from average velocity when the temporal gap is small, but degrades as the gap increases; and (iii) task-affinity analysis indicates that smooth learning of large-gap average velocities, essential for one-step generation, depends on the prior formation of accurate instantaneous and small-gap average velocities. Guided by these observations, we design an effective training scheme that accelerates the formation of instantaneous velocity, then shifts emphasis from short- to long-interval average velocity. Our enhanced MeanFlow training yields faster convergence and significantly better few-step generation: With the same DiT-XL backbone, our method reaches an impressive FID of 2.87 on 1-NFE ImageNet 256x256, compared to 3.43 for the conventional MeanFlow baseline. Alternatively, our method matches the performance of the MeanFlow baseline with 2.5x shorter training time, or with a smaller DiT-L backbone.
TLDR: This paper analyzes the training dynamics of MeanFlow, a few-step generative model, identifies key factors for its performance, and proposes an improved training scheme that achieves faster convergence and better few-step image generation results.
TLDR: 本文分析了 MeanFlow(一种少步生成模型)的训练动态,指出了其性能的关键因素,并提出了一种改进的训练方案,该方案实现了更快的收敛速度和更好的少步图像生成结果。
Read Paper (PDF)Direct Preference Optimization (DPO) has shown promising results in aligning generative outputs with human preferences by distinguishing between chosen and rejected samples. However, a critical limitation of DPO is likelihood displacement, where the probabilities of chosen samples paradoxically decrease during training, undermining the quality of generation. Although this issue has been investigated in autoregressive models, its impact within diffusion-based models remains largely unexplored. This gap leads to suboptimal performance in tasks involving video generation. To address this, we conduct a formal analysis of DPO loss through updating policy within the diffusion framework, which describes how the updating of specific training samples influences the model's predictions on other samples. Using this tool, we identify two main failure modes: (1) Optimization Conflict, which arises from small reward margins between chosen and rejected samples, and (2) Suboptimal Maximization, caused by large reward margins. Informed by these insights, we introduce a novel solution named Policy-Guided DPO (PG-DPO), combining Adaptive Rejection Scaling (ARS) and Implicit Preference Regularization (IPR) to effectively mitigate likelihood displacement. Experiments show that PG-DPO outperforms existing methods in both quantitative metrics and qualitative evaluations, offering a robust solution for improving preference alignment in video generation tasks.
TLDR: This paper introduces Policy-Guided DPO (PG-DPO) to address likelihood displacement in Diffusion Models, specifically for video generation, by analyzing DPO loss and mitigating optimization conflicts and suboptimal maximization with Adaptive Rejection Scaling (ARS) and Implicit Preference Regularization (IPR). The method outperforms existing approaches in preference alignment.
TLDR: 本文提出了一种策略引导的DPO(PG-DPO),旨在解决扩散模型中(特别是视频生成任务)的似然位移问题。该方法通过分析DPO损失,并使用自适应拒绝缩放(ARS)和隐式偏好正则化(IPR)来缓解优化冲突和次优最大化。实验表明,该方法在偏好对齐方面优于现有方法。
Read Paper (PDF)Video generation models have made significant progress in generating realistic content, enabling applications in simulation, gaming, and film making. However, current generated videos still contain visual artifacts arising from 3D inconsistencies, e.g., objects and structures deforming under changes in camera pose, which can undermine user experience and simulation fidelity. Motivated by recent findings on representation alignment for diffusion models, we hypothesize that improving the multi-view consistency of video diffusion representations will yield more 3D-consistent video generation. Through detailed analysis on multiple recent camera-controlled video diffusion models we reveal strong correlations between 3D-consistent representations and videos. We also propose ViCoDR, a new approach for improving the 3D consistency of video models by learning multi-view consistent diffusion representations. We evaluate ViCoDR on camera controlled image-to-video, text-to-video, and multi-view generation models, demonstrating significant improvements in the 3D consistency of the generated videos. Project page: https://danier97.github.io/ViCoDR.
TLDR: The paper introduces ViCoDR, a novel approach to improve 3D consistency in video generation models by learning multi-view consistent diffusion representations, demonstrated through experiments on various generation tasks.
TLDR: 该论文介绍了ViCoDR,一种通过学习多视角一致的扩散表示来提高视频生成模型中3D一致性的新方法,并通过在各种生成任务上的实验证明了其有效性。
Read Paper (PDF)Flow Matching (FM) has recently emerged as a principled and efficient alternative to diffusion models. Standard FM encourages the learned velocity field to follow a target direction; however, it may accumulate errors along the trajectory and drive samples off the data manifold, leading to perceptual degradation, especially in lightweight or low-step configurations. To enhance stability and generalization, we extend FM into a balanced attract-repel scheme that provides explicit guidance on both "where to go" and "where not to go." To be formal, we propose \textbf{Velocity Contrastive Regularization (VeCoR)}, a complementary training scheme for flow-based generative modeling that augments the standard FM objective with contrastive, two-sided supervision. VeCoR not only aligns the predicted velocity with a stable reference direction (positive supervision) but also pushes it away from inconsistent, off-manifold directions (negative supervision). This contrastive formulation transforms FM from a purely attractive, one-sided objective into a two-sided training signal, regularizing trajectory evolution and improving perceptual fidelity across datasets and backbones. On ImageNet-1K 256$\times$256, VeCoR yields 22\% and 35\% relative FID reductions on SiT-XL/2 and REPA-SiT-XL/2 backbones, respectively, and achieves further FID gains (32\% relative) on MS-COCO text-to-image generation, demonstrating consistent improvements in stability, convergence, and image quality, particularly in low-step and lightweight settings. Project page: https://p458732.github.io/VeCoR_Project_Page/
TLDR: The paper introduces Velocity Contrastive Regularization (VeCoR), a novel training scheme for Flow Matching that uses both positive and negative supervision to improve stability and image quality, especially in low-step configurations.
TLDR: 该论文介绍了速度对比正则化(VeCoR),一种用于Flow Matching的新型训练方案,它使用正向和负向监督来提高稳定性和图像质量,尤其是在低步数配置中。
Read Paper (PDF)We present One4D, a unified framework for 4D generation and reconstruction that produces dynamic 4D content as synchronized RGB frames and pointmaps. By consistently handling varying sparsities of conditioning frames through a Unified Masked Conditioning (UMC) mechanism, One4D can seamlessly transition between 4D generation from a single image, 4D reconstruction from a full video, and mixed generation and reconstruction from sparse frames. Our framework adapts a powerful video generation model for joint RGB and pointmap generation, with carefully designed network architectures. The commonly used diffusion finetuning strategies for depthmap or pointmap reconstruction often fail on joint RGB and pointmap generation, quickly degrading the base video model. To address this challenge, we introduce Decoupled LoRA Control (DLC), which employs two modality-specific LoRA adapters to form decoupled computation branches for RGB frames and pointmaps, connected by lightweight, zero-initialized control links that gradually learn mutual pixel-level consistency. Trained on a mixture of synthetic and real 4D datasets under modest computational budgets, One4D produces high-quality RGB frames and accurate pointmaps across both generation and reconstruction tasks. This work represents a step toward general, high-quality geometry-based 4D world modeling using video diffusion models. Project page: https://mizhenxing.github.io/One4D
TLDR: One4D is a unified framework for 4D generation and reconstruction, producing synchronized RGB frames and pointmaps, using a unified masked conditioning mechanism and Decoupled LoRA Control to handle joint RGB and pointmap generation.
TLDR: One4D是一个统一的4D生成和重建框架,它使用统一的掩码条件机制和解耦LoRA控制,生成同步的RGB帧和点云贴图,从而处理联合的RGB和点云贴图的生成。
Read Paper (PDF)Group Relative Policy Optimization (GRPO) has emerged as an effective and lightweight framework for post-training visual generative models. However, its performance is fundamentally limited by the ambiguity of textual visual correspondence: a single prompt may validly describe diverse visual outputs, and a single image or video may support multiple equally correct interpretations. This many to many relationship leads reward models to generate uncertain and weakly discriminative signals, causing GRPO to underutilize reliable feedback and overfit noisy ones. We introduce Bayesian Prior-Guided Optimization (BPGO), a novel extension of GRPO that explicitly models reward uncertainty through a semantic prior anchor. BPGO adaptively modulates optimization trust at two levels: inter-group Bayesian trust allocation emphasizes updates from groups consistent with the prior while down-weighting ambiguous ones, and intra-group prior-anchored renormalization sharpens sample distinctions by expanding confident deviations and compressing uncertain scores. Across both image and video generation tasks, BPGO delivers consistently stronger semantic alignment, enhanced perceptual fidelity, and faster convergence than standard GRPO and recent variants.
TLDR: This paper introduces Bayesian Prior-Guided Optimization (BPGO), an extension of Group Relative Policy Optimization (GRPO), to improve visual generation by explicitly modeling reward uncertainty using a semantic prior anchor, leading to better semantic alignment and faster convergence.
TLDR: 本文介绍了贝叶斯先验引导优化(BPGO),它是群体相对策略优化(GRPO)的扩展,通过使用语义先验锚显式地建模奖励不确定性来改进视觉生成,从而实现更好的语义对齐和更快的收敛。
Read Paper (PDF)Visual autoregressive models achieve remarkable generation quality through next-scale predictions across multi-scale token pyramids. However, the conventional method uses uniform scale downsampling to build these pyramids, leading to aliasing artifacts that compromise fine details and introduce unwanted jaggies and moiré patterns. To tackle this issue, we present \textbf{FVAR}, which reframes the paradigm from \emph{next-scale prediction} to \emph{next-focus prediction}, mimicking the natural process of camera focusing from blur to clarity. Our approach introduces three key innovations: \textbf{1) Next-Focus Prediction Paradigm} that transforms multi-scale autoregression by progressively reducing blur rather than simply downsampling; \textbf{2) Progressive Refocusing Pyramid Construction} that uses physics-consistent defocus kernels to build clean, alias-free multi-scale representations; and \textbf{3) High-Frequency Residual Learning} that employs a specialized residual teacher network to effectively incorporate alias information during training while maintaining deployment simplicity. Specifically, we construct optical low-pass views using defocus point spread function (PSF) kernels with decreasing radius, creating smooth blur-to-clarity transitions that eliminate aliasing at its source. To further enhance detail generation, we introduce a High-Frequency Residual Teacher that learns from both clean structure and alias residuals, distilling this knowledge to a vanilla VAR deployment network for seamless inference. Extensive experiments on ImageNet demonstrate that FVAR substantially reduces aliasing artifacts, improves fine detail preservation, and enhances text readability, achieving superior performance with perfect compatibility to existing VAR frameworks.
TLDR: The paper introduces FVAR, a visual autoregressive model that addresses aliasing artifacts in multi-scale image generation by using physics-consistent defocus kernels for progressive refocusing and high-frequency residual learning, leading to improved detail preservation and readability.
TLDR: 该论文介绍了FVAR,一种视觉自回归模型,通过使用物理一致的散焦核进行渐进式重聚焦和高频残差学习,解决了多尺度图像生成中的混叠伪影问题,从而提高了细节保留和可读性。
Read Paper (PDF)With the success of flow matching in visual generation, sampling efficiency remains a critical bottleneck for its practical application. Among flow models' accelerating methods, ReFlow has been somehow overlooked although it has theoretical consistency with flow matching. This is primarily due to its suboptimal performance in practical scenarios compared to consistency distillation and score distillation. In this work, we investigate this issue within the ReFlow framework and propose FlowSteer, a method unlocks the potential of ReFlow-based distillation by guiding the student along teacher's authentic generation trajectories. We first identify that Piecewised ReFlow's performance is hampered by a critical distribution mismatch during the training and propose Online Trajectory Alignment(OTA) to resolve it. Then, we introduce a adversarial distillation objective applied directly on the ODE trajectory, improving the student's adherence to the teacher's generation trajectory. Furthermore, we find and fix a previously undiscovered flaw in the widely-used FlowMatchEulerDiscreteScheduler that largely degrades few-step inference quality. Our experiment result on SD3 demonstrates our method's efficacy.
TLDR: The paper introduces FlowSteer, a method to improve the sampling efficiency of flow matching models by guiding the student model along the teacher's authentic generation trajectories using online trajectory alignment and adversarial distillation, also fixing a scheduler flaw.
TLDR: 该论文介绍了FlowSteer,一种通过在线轨迹对齐和对抗蒸馏引导学生模型沿着教师模型的真实生成轨迹,从而提高流匹配模型的采样效率的方法,同时也修复了一个调度器的缺陷。
Read Paper (PDF)Video-to-Audio (V2A) generation requires balancing four critical perceptual dimensions: semantic consistency, audio-visual temporal synchrony, aesthetic quality, and spatial accuracy; yet existing methods suffer from objective entanglement that conflates competing goals in single loss functions and lack human preference alignment. We introduce PrismAudio, the first framework to integrate Reinforcement Learning into V2A generation with specialized Chain-of-Thought (CoT) planning. Our approach decomposes monolithic reasoning into four specialized CoT modules (Semantic, Temporal, Aesthetic, and Spatial CoT), each paired with targeted reward functions. This CoT-reward correspondence enables multidimensional RL optimization that guides the model to jointly generate better reasoning across all perspectives, solving the objective entanglement problem while preserving interpretability. To make this optimization computationally practical, we propose Fast-GRPO, which employs hybrid ODE-SDE sampling that dramatically reduces the training overhead compared to existing GRPO implementations. We also introduce AudioCanvas, a rigorous benchmark that is more distributionally balanced and covers more realistically diverse and challenging scenarios than existing datasets, with 300 single-event classes and 501 multi-event samples. Experimental results demonstrate that PrismAudio achieves state-of-the-art performance across all four perceptual dimensions on both the in-domain VGGSound test set and out-of-domain AudioCanvas benchmark. The project page is available at https://PrismAudio-Project.github.io.
TLDR: PrismAudio introduces a Reinforcement Learning based Video-to-Audio generation framework with decomposed Chain-of-Thought reasoning and multi-dimensional rewards, addressing objective entanglement and improving performance across semantic, temporal, aesthetic, and spatial dimensions, benchmarked on a new dataset, AudioCanvas.
TLDR: PrismAudio 提出了一个基于强化学习的视频到音频生成框架,该框架具有分解的思维链推理和多维奖励,解决了目标缠结问题,并在语义、时间、美学和空间维度上提高了性能,并在一个新的数据集 AudioCanvas 上进行了基准测试。
Read Paper (PDF)Diffusion models face a fundamental trade-off between generation quality and computational efficiency. Latent Diffusion Models (LDMs) offer an efficient solution but suffer from potential information loss and non-end-to-end training. In contrast, existing pixel space models bypass VAEs but are computationally prohibitive for high-resolution synthesis. To resolve this dilemma, we propose DiP, an efficient pixel space diffusion framework. DiP decouples generation into a global and a local stage: a Diffusion Transformer (DiT) backbone operates on large patches for efficient global structure construction, while a co-trained lightweight Patch Detailer Head leverages contextual features to restore fine-grained local details. This synergistic design achieves computational efficiency comparable to LDMs without relying on a VAE. DiP is accomplished with up to 10$\times$ faster inference speeds than previous method while increasing the total number of parameters by only 0.3%, and achieves an 1.90 FID score on ImageNet 256$\times$256.
TLDR: The paper proposes DiP, an efficient pixel space diffusion framework using a two-stage approach (global structure and local detail refinement) to achieve high-quality image generation at speeds comparable to LDMs without VAEs.
TLDR: 该论文提出了一种高效的像素空间扩散框架 DiP,它采用两阶段方法(全局结构和局部细节细化)以实现高质量图像生成,其速度与 LDM 相当,且无需 VAE。
Read Paper (PDF)Diffusion models have emerged as a dominant paradigm for generative modeling across a wide range of domains, including prompt-conditional generation. The vast majority of samplers, however, rely on forward discretization of the reverse diffusion process and use score functions that are learned from data. Such forward and explicit discretizations can be slow and unstable, requiring a large number of sampling steps to produce good-quality samples. In this work we develop a text-to-image (T2I) diffusion model based on backward discretizations, dubbed ProxT2I, relying on learned and conditional proximal operators instead of score functions. We further leverage recent advances in reinforcement learning and policy optimization to optimize our samplers for task-specific rewards. Additionally, we develop a new large-scale and open-source dataset comprising 15 million high-quality human images with fine-grained captions, called LAION-Face-T2I-15M, for training and evaluation. Our approach consistently enhances sampling efficiency and human-preference alignment compared to score-based baselines, and achieves results on par with existing state-of-the-art and open-source text-to-image models while requiring lower compute and smaller model size, offering a lightweight yet performant solution for human text-to-image generation.
TLDR: The paper introduces ProxT2I, a text-to-image diffusion model that uses learned proximal operators and reinforcement learning for efficient and human-aligned image generation. It also presents a new large-scale face dataset, LAION-Face-T2I-15M.
TLDR: 该论文介绍了 ProxT2I,一种文本到图像的扩散模型,它使用学习到的近端算子和强化学习来实现高效且与人类偏好对齐的图像生成。该论文还提出了一个新的大型人脸数据集 LAION-Face-T2I-15M。
Read Paper (PDF)Realistic 3D city generation is fundamental to a wide range of applications, including virtual reality and digital twins. However, most existing methods rely on training a single diffusion model, which limits their ability to generate personalized and boundless city-scale scenes. In this paper, we present Yo'City, a novel agentic framework that enables user-customized and infinitely expandable 3D city generation by leveraging the reasoning and compositional capabilities of off-the-shelf large models. Specifically, Yo'City first conceptualize the city through a top-down planning strategy that defines a hierarchical "City-District-Grid" structure. The Global Planner determines the overall layout and potential functional districts, while the Local Designer further refines each district with detailed grid-level descriptions. Subsequently, the grid-level 3D generation is achieved through a "produce-refine-evaluate" isometric image synthesis loop, followed by image-to-3D generation. To simulate continuous city evolution, Yo'City further introduces a user-interactive, relationship-guided expansion mechanism, which performs scene graph-based distance- and semantics-aware layout optimization, ensuring spatially coherent city growth. To comprehensively evaluate our method, we construct a diverse benchmark dataset and design six multi-dimensional metrics that assess generation quality from the perspectives of semantics, geometry, texture, and layout. Extensive experiments demonstrate that Yo'City consistently outperforms existing state-of-the-art methods across all evaluation aspects.
TLDR: The paper introduces Yo'City, a novel agentic framework that uses large language models to generate personalized and infinitely expandable 3D city scenes through a hierarchical planning and interactive expansion mechanism.
TLDR: 本文介绍了一种名为 Yo'City 的新型代理框架,该框架利用大型语言模型,通过分层规划和交互式扩展机制,生成个性化且无限扩展的 3D 城市场景。
Read Paper (PDF)Recent advances in diffusion transformers have shown remarkable generalization in visual synthesis, yet most dense perception methods still rely on text-to-image (T2I) generators designed for stochastic generation. We revisit this paradigm and show that image editing diffusion models are inherently image-to-image consistent, providing a more suitable foundation for dense perception task. We introduce Edit2Perceive, a unified diffusion framework that adapts editing models for depth, normal, and matting. Built upon the FLUX.1 Kontext architecture, our approach employs full-parameter fine-tuning and a pixel-space consistency loss to enforce structure-preserving refinement across intermediate denoising states. Moreover, our single-step deterministic inference yields up to faster runtime while training on relatively small datasets. Extensive experiments demonstrate comprehensive state-of-the-art results across all three tasks, revealing the strong potential of editing-oriented diffusion transformers for geometry-aware perception.
TLDR: The paper introduces Edit2Perceive, a framework that leverages image editing diffusion models for dense perception tasks like depth estimation, normal prediction, and matting, achieving state-of-the-art results with faster inference and smaller datasets.
TLDR: 该论文介绍了Edit2Perceive,一个利用图像编辑扩散模型进行密集感知任务(如深度估计、法线预测和抠图)的框架,通过更快的推理速度和更小的数据集实现了最先进的结果。
Read Paper (PDF)Novel View Synthesis (NVS) is the task of generating new images of a scene from viewpoints that were not part of the original input. Diffusion-based NVS can generate high-quality, temporally consistent images, however, remains computationally prohibitive. Conversely, regression-based NVS offers suboptimal generation quality despite requiring significantly lower compute; leaving the design objective of a high-quality, inference-efficient NVS framework an open challenge. To close this critical gap, we present Sphinx, a training-free hybrid inference framework that achieves diffusion-level fidelity at a significantly lower compute. Sphinx proposes to use regression-based fast initialization to guide and reduce the denoising workload for the diffusion model. Additionally, it integrates selective refinement with adaptive noise scheduling, allowing more compute to uncertain regions and frames. This enables Sphinx to provide flexible navigation of the performance-quality trade-off, allowing adaptation to latency and fidelity requirements for dynamically changing inference scenarios. Our evaluation shows that Sphinx achieves an average 1.8x speedup over diffusion model inference with negligible perceptual degradation of less than 5%, establishing a new Pareto frontier between quality and latency in NVS serving.
TLDR: The paper introduces Sphinx, a hybrid inference framework for Novel View Synthesis (NVS) that combines regression-based initialization with selective diffusion refinement to achieve diffusion-level quality with significantly lower compute, offering a Pareto frontier between quality and latency.
TLDR: 该论文介绍了Sphinx,一种用于新视角合成(NVS)的混合推理框架,它结合了基于回归的初始化和选择性扩散细化,以显著降低计算量实现扩散级别的质量,并在质量和延迟之间提供了帕累托前沿。
Read Paper (PDF)Text-to-motion generation, which synthesizes 3D human motions from text inputs, holds immense potential for applications in gaming, film, and robotics. Recently, diffusion-based methods have been shown to generate more diversity and realistic motion. However, there exists a misalignment between text and motion distributions in diffusion models, which leads to semantically inconsistent or low-quality motions. To address this limitation, we propose Reward-guided sampling Alignment (ReAlign), comprising a step-aware reward model to assess alignment quality during the denoising sampling and a reward-guided strategy that directs the diffusion process toward an optimally aligned distribution. This reward model integrates step-aware tokens and combines a text-aligned module for semantic consistency and a motion-aligned module for realism, refining noisy motions at each timestep to balance probability density and alignment. Extensive experiments of both motion generation and retrieval tasks demonstrate that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods.
TLDR: This paper introduces ReAlign, a reward-guided diffusion approach for text-to-motion generation that improves text-motion alignment and motion quality by using a step-aware reward model to guide the denoising process.
TLDR: 该论文提出了一种名为 ReAlign 的奖励引导扩散方法,用于文本到动作的生成,通过使用步进感知的奖励模型来引导去噪过程,从而提高文本动作对齐和动作质量。
Read Paper (PDF)Film set design plays a pivotal role in cinematic storytelling and shaping the visual atmosphere. However, the traditional process depends on expert-driven manual modeling, which is labor-intensive and time-consuming. To address this issue, we introduce FilmSceneDesigner, an automated scene generation system that emulates professional film set design workflow. Given a natural language description, including scene type, historical period, and style, we design an agent-based chaining framework to generate structured parameters aligned with film set design workflow, guided by prompt strategies that ensure parameter accuracy and coherence. On the other hand, we propose a procedural generation pipeline which executes a series of dedicated functions with the structured parameters for floorplan and structure generation, material assignment, door and window placement, and object retrieval and layout, ultimately constructing a complete film scene from scratch. Moreover, to enhance cinematic realism and asset diversity, we construct SetDepot-Pro, a curated dataset of 6,862 film-specific 3D assets and 733 materials. Experimental results and human evaluations demonstrate that our system produces structurally sound scenes with strong cinematic fidelity, supporting downstream tasks such as virtual previs, construction drawing and mood board creation.
TLDR: The paper introduces FilmSceneDesigner, an automated system that generates film scenes from natural language descriptions using an agent-based chaining framework and procedural generation pipeline, along with a new dataset of film-specific 3D assets.
TLDR: 该论文介绍了FilmSceneDesigner,一个自动化系统,可以使用基于代理的链式框架和程序生成管道,从自然语言描述中生成电影场景,并提供了一个新的电影专用3D资产数据集。
Read Paper (PDF)Text-to-LiDAR generation can customize 3D data with rich structures and diverse scenes for downstream tasks. However, the scarcity of Text-LiDAR pairs often causes insufficient training priors, generating overly smooth 3D scenes. Moreover, low-quality text descriptions may degrade generation quality and controllability. In this paper, we propose a Text-to-LiDAR Diffusion Model for scene generation, named T2LDM, with a Self-Conditioned Representation Guidance (SCRG). Specifically, SCRG, by aligning to the real representations, provides the soft supervision with reconstruction details for the Denoising Network (DN) in training, while decoupled in inference. In this way, T2LDM can perceive rich geometric structures from data distribution, generating detailed objects in scenes. Meanwhile, we construct a content-composable Text-LiDAR benchmark, T2nuScenes, along with a controllability metric. Based on this, we analyze the effects of different text prompts for LiDAR generation quality and controllability, providing practical prompt paradigms and insights. Furthermore, a directional position prior is designed to mitigate street distortion, further improving scene fidelity. Additionally, by learning a conditional encoder via frozen DN, T2LDM can support multiple conditional tasks, including Sparse-to-Dense, Dense-to-Sparse, and Semantic-to-LiDAR generation. Extensive experiments in unconditional and conditional generation demonstrate that T2LDM outperforms existing methods, achieving state-of-the-art scene generation.
TLDR: This paper introduces T2LDM, a text-to-LiDAR diffusion model with self-conditioned representation guidance, addressing the scarcity of training data and low-quality text descriptions in generating realistic 3D scenes, also offering conditional generation capabilities.
TLDR: 本文介绍了一种名为T2LDM的文本到激光雷达扩散模型,该模型具有自条件表示引导,旨在解决生成逼真3D场景中训练数据稀缺和低质量文本描述的问题,并提供条件生成能力。
Read Paper (PDF)Recent works have sought to enhance the controllability and precision of text-driven motion generation. Some approaches leverage large language models (LLMs) to produce more detailed texts, while others incorporate global 3D coordinate sequences as additional control signals. However, the former often introduces misaligned details and lacks explicit temporal cues, and the latter incurs significant computational cost when converting coordinates to standard motion representations. To address these issues, we propose FineXtrol, a novel control framework for efficient motion generation guided by temporally-aware, precise, user-friendly, and fine-grained textual control signals that describe specific body part movements over time. In support of this framework, we design a hierarchical contrastive learning module that encourages the text encoder to produce more discriminative embeddings for our novel control signals, thereby improving motion controllability. Quantitative results show that FineXtrol achieves strong performance in controllable motion generation, while qualitative analysis demonstrates its flexibility in directing specific body part movements.
TLDR: The paper introduces FineXtrol, a framework for controllable motion generation using fine-grained, temporally-aware textual control signals for specific body parts, addressing limitations of previous approaches leveraging LLMs or 3D coordinates.
TLDR: 该论文介绍了FineXtrol,一个通过细粒度、时间感知的文本控制信号来控制身体特定部位运动的框架,解决了先前使用LLM或3D坐标的方法的局限性。
Read Paper (PDF)Video editing and synthesis often introduce object inconsistencies, such as frame flicker and identity drift that degrade perceptual quality. To address these issues, we introduce ObjectAlign, a novel framework that seamlessly blends perceptual metrics with symbolic reasoning to detect, verify, and correct object-level and temporal inconsistencies in edited video sequences. The novel contributions of ObjectAlign are as follows: First, we propose learnable thresholds for metrics characterizing object consistency (i.e. CLIP-based semantic similarity, LPIPS perceptual distance, histogram correlation, and SAM-derived object-mask IoU). Second, we introduce a neuro-symbolic verifier that combines two components: (a) a formal, SMT-based check that operates on masked object embeddings to provably guarantee that object identity does not drift, and (b) a temporal fidelity check that uses a probabilistic model checker to verify the video's formal representation against a temporal logic specification. A frame transition is subsequently deemed "consistent" based on a single logical assertion that requires satisfying both the learned metric thresholds and this unified neuro-symbolic constraint, ensuring both low-level stability and high-level temporal correctness. Finally, for each contiguous block of flagged frames, we propose a neural network based interpolation for adaptive frame repair, dynamically choosing the interpolation depth based on the number of frames to be corrected. This enables reconstruction of the corrupted frames from the last valid and next valid keyframes. Our results show up to 1.4 point improvement in CLIP Score and up to 6.1 point improvement in warp error compared to SOTA baselines on the DAVIS and Pexels video datasets.
TLDR: ObjectAlign is a neuro-symbolic framework for detecting and correcting object inconsistencies in edited videos, using learned metric thresholds and formal verification for improved perceptual quality and temporal correctness.
TLDR: ObjectAlign是一个神经-符号框架,用于检测和纠正编辑视频中的对象不一致性,它使用学习的度量阈值和形式验证来提高感知质量和时间正确性。
Read Paper (PDF)Surgical planning and training based on machine learning requires a large amount of 3D anatomical models reconstructed from medical imaging, which is currently one of the major bottlenecks. Obtaining these data from real patients and during surgery is very demanding, if even possible, due to legal, ethical, and technical challenges. It is especially difficult for soft tissue organs with poor imaging contrast, such as the prostate. To overcome these challenges, we present a novel workflow for automated 3D anatomical data generation using data obtained from physical organ models. We additionally use a 3D Generative Adversarial Network (GAN) to obtain a manifold of 3D models useful for other downstream machine learning tasks that rely on 3D data. We demonstrate our workflow using an artificial prostate model made of biomimetic hydrogels with imaging contrast in multiple zones. This is used to physically simulate endoscopic surgery. For evaluation and 3D data generation, we place it into a customized ultrasound scanner that records the prostate before and after the procedure. A neural network is trained to segment the recorded ultrasound images, which outperforms conventional, non-learning-based computer vision techniques in terms of intersection over union (IoU). Based on the segmentations, a 3D mesh model is reconstructed, and performance feedback is provided.
TLDR: This paper introduces a GAN-based workflow for generating 3D anatomical models from ultrasound images of a prostate phantom, addressing the scarcity of 3D medical data for surgical planning and training.
TLDR: 本文介绍了一种基于GAN的工作流程,用于从前列腺模型的超声图像中生成3D解剖模型,从而解决了外科手术规划和训练中3D医疗数据稀缺的问题。
Read Paper (PDF)