Daily papers related to Image/Video/Multimodal Generation from cs.CV
October 21, 2025
Recent advances in training-free attention control methods have enabled flexible and efficient text-guided editing capabilities for existing generation models. However, current approaches struggle to simultaneously deliver strong editing strength while preserving consistency with the source. This limitation becomes particularly critical in multi-round and video editing, where visual errors can accumulate over time. Moreover, most existing methods enforce global consistency, which limits their ability to modify individual attributes such as texture while preserving others, thereby hindering fine-grained editing. Recently, the architectural shift from U-Net to MM-DiT has brought significant improvements in generative performance and introduced a novel mechanism for integrating text and vision modalities. These advancements pave the way for overcoming challenges that previous methods failed to resolve. Through an in-depth analysis of MM-DiT, we identify three key insights into its attention mechanisms. Building on these, we propose ConsistEdit, a novel attention control method specifically tailored for MM-DiT. ConsistEdit incorporates vision-only attention control, mask-guided pre-attention fusion, and differentiated manipulation of the query, key, and value tokens to produce consistent, prompt-aligned edits. Extensive experiments demonstrate that ConsistEdit achieves state-of-the-art performance across a wide range of image and video editing tasks, including both structure-consistent and structure-inconsistent scenarios. Unlike prior methods, it is the first approach to perform editing across all inference steps and attention layers without handcraft, significantly enhancing reliability and consistency, which enables robust multi-round and multi-region editing. Furthermore, it supports progressive adjustment of structural consistency, enabling finer control.
TLDR: The paper introduces ConsistEdit, a training-free attention control method tailored for MM-DiT, enabling consistent and precise text-guided image and video editing with fine-grained control and state-of-the-art performance.
TLDR: 该论文介绍了ConsistEdit,一种为MM-DiT量身定制的无需训练的注意力控制方法,能够实现一致且精确的文本引导图像和视频编辑,具有细粒度控制和最先进的性能。
Read Paper (PDF)Recent high-performing image-to-video (I2V) models based on variants of the diffusion transformer (DiT) have displayed remarkable inherent world-modeling capabilities by virtue of training on large scale video datasets. We investigate whether these models can generate realistic pedestrian movement patterns in crowded public scenes. Our framework conditions I2V models on keyframes extracted from pedestrian trajectory benchmarks, then evaluates their trajectory prediction performance using quantitative measures of pedestrian dynamics.
TLDR: The paper explores the ability of image-to-video diffusion transformer models to simulate realistic pedestrian movement in crowded scenes by conditioning them on keyframes from trajectory benchmarks and evaluating their trajectory prediction performance.
TLDR: 该论文研究了基于图像到视频扩散变换器模型在模拟拥挤场景中行人运动的真实性。通过使用行人轨迹基准数据集中的关键帧作为条件,并评估其轨迹预测性能来实现。
Read Paper (PDF)While diffusion models achieve state-of-the-art generation quality, they still suffer from computationally expensive sampling. Recent works address this issue with gradient-based optimization methods that distill a few-step ODE diffusion solver from the full sampling process, reducing the number of function evaluations from dozens to just a few. However, these approaches often rely on intricate training techniques and do not explicitly focus on preserving fine-grained details. In this paper, we introduce the Generalized Solver: a simple parameterization of the ODE sampler that does not require additional training tricks and improves quality over existing approaches. We further combine the original distillation loss with adversarial training, which mitigates artifacts and enhances detail fidelity. We call the resulting method the Generalized Adversarial Solver and demonstrate its superior performance compared to existing solver training methods under similar resource constraints. Code is available at https://github.com/3145tttt/GAS.
TLDR: The paper introduces a novel, training-trick-free ODE solver parameterization for diffusion models, named Generalized Adversarial Solver (GAS), which aims to improve sampling quality and reduce artifacts compared to existing methods by combining distillation loss and adversarial training.
TLDR: 该论文介绍了一种新型的、无需训练技巧的扩散模型ODE求解器参数化方法,名为广义对抗求解器(GAS),旨在通过结合蒸馏损失和对抗训练,提高采样质量并减少伪影,优于现有方法。
Read Paper (PDF)Image editing has achieved remarkable progress recently. Modern editing models could already follow complex instructions to manipulate the original content. However, beyond completing the editing instructions, the accompanying physical effects are the key to the generation realism. For example, removing an object should also remove its shadow, reflections, and interactions with nearby objects. Unfortunately, existing models and benchmarks mainly focus on instruction completion but overlook these physical effects. So, at this moment, how far are we from physically realistic image editing? To answer this, we introduce PICABench, which systematically evaluates physical realism across eight sub-dimension (spanning optics, mechanics, and state transitions) for most of the common editing operations (add, remove, attribute change, etc). We further propose the PICAEval, a reliable evaluation protocol that uses VLM-as-a-judge with per-case, region-level human annotations and questions. Beyond benchmarking, we also explore effective solutions by learning physics from videos and construct a training dataset PICA-100K. After evaluating most of the mainstream models, we observe that physical realism remains a challenging problem with large rooms to explore. We hope that our benchmark and proposed solutions can serve as a foundation for future work moving from naive content editing toward physically consistent realism.
TLDR: The paper introduces PICABench, a new benchmark for evaluating the physical realism of image editing models, highlighting the gap between instruction completion and physically plausible results. It also explores solutions and provides a dataset to encourage research in this area.
TLDR: 该论文介绍了PICABench,一个新的用于评估图像编辑模型物理真实性的基准,突出了指令完成和物理上合理的结果之间的差距。它还探索了解决方案并提供了一个数据集,以鼓励在该领域的研究。
Read Paper (PDF)Human communication combines speech with expressive nonverbal cues such as hand gestures that serve manifold communicative functions. Yet, current generative gesture generation approaches are restricted to simple, repetitive beat gestures that accompany the rhythm of speaking but do not contribute to communicating semantic meaning. This paper tackles a core challenge in co-speech gesture synthesis: generating iconic or deictic gestures that are semantically coherent with a verbal utterance. Such gestures cannot be derived from language input alone, which inherently lacks the visual meaning that is often carried autonomously by gestures. We therefore introduce a zero-shot system that generates gestures from a given language input and additionally is informed by imagistic input, without manual annotation or human intervention. Our method integrates an image analysis pipeline that extracts key object properties such as shape, symmetry, and alignment, together with a semantic matching module that links these visual details to spoken text. An inverse kinematics engine then synthesizes iconic and deictic gestures and combines them with co-generated natural beat gestures for coherent multimodal communication. A comprehensive user study demonstrates the effectiveness of our approach. In scenarios where speech alone was ambiguous, gestures generated by our system significantly improved participants' ability to identify object properties, confirming their interpretability and communicative value. While challenges remain in representing complex shapes, our results highlight the importance of context-aware semantic gestures for creating expressive and collaborative virtual agents or avatars, marking a substantial step forward towards efficient and robust, embodied human-agent interaction. More information and example videos are available here: https://review-anon-io.github.io/ImaGGen.github.io/
TLDR: The paper introduces ImaGGen, a zero-shot system for generating co-speech iconic and deictic gestures grounded in both language and image inputs, significantly improving object property identification in user studies.
TLDR: 该论文介绍了ImaGGen,一个零样本系统,用于生成基于语言和图像输入的、与语音同步的标志性和指示性手势,并在用户研究中显著提高了物体属性识别能力。
Read Paper (PDF)In recent years, large-scale generative models for visual content (\textit{e.g.,} images, videos, and 3D objects/scenes) have made remarkable progress. However, training large-scale video generation models remains particularly challenging and resource-intensive due to cross-modal text-video alignment, the long sequences involved, and the complex spatiotemporal dependencies. To address these challenges, we present a training framework that optimizes four pillars: (i) data processing, (ii) model architecture, (iii) training strategy, and (iv) infrastructure for large-scale video generation models. These optimizations delivered significant efficiency gains and performance improvements across all stages of data preprocessing, video compression, parameter scaling, curriculum-based pretraining, and alignment-focused post-training. Our resulting model, MUG-V 10B, matches recent state-of-the-art video generators overall and, on e-commerce-oriented video generation tasks, surpasses leading open-source baselines in human evaluations. More importantly, we open-source the complete stack, including model weights, Megatron-Core-based large-scale training code, and inference pipelines for video generation and enhancement. To our knowledge, this is the first public release of large-scale video generation training code that exploits Megatron-Core to achieve high training efficiency and near-linear multi-node scaling, details are available in \href{https://github.com/Shopee-MUG/MUG-V}{our webpage}.
TLDR: The paper introduces MUG-V 10B, a highly efficient training pipeline for large video generation models, and open-sources the complete stack, including model weights and training code.
TLDR: 该论文介绍了MUG-V 10B,一个用于大型视频生成模型的高效训练管道,并开源了包括模型权重和训练代码在内的完整堆栈。
Read Paper (PDF)Recent advances in video diffusion models have significantly enhanced text-to-video generation, particularly through alignment tuning using reward models trained on human preferences. While these methods improve visual quality, they can unintentionally encode and amplify social biases. To systematically trace how such biases evolve throughout the alignment pipeline, we introduce VideoBiasEval, a comprehensive diagnostic framework for evaluating social representation in video generation. Grounded in established social bias taxonomies, VideoBiasEval employs an event-based prompting strategy to disentangle semantic content (actions and contexts) from actor attributes (gender and ethnicity). It further introduces multi-granular metrics to evaluate (1) overall ethnicity bias, (2) gender bias conditioned on ethnicity, (3) distributional shifts in social attributes across model variants, and (4) the temporal persistence of bias within videos. Using this framework, we conduct the first end-to-end analysis connecting biases in human preference datasets, their amplification in reward models, and their propagation through alignment-tuned video diffusion models. Our results reveal that alignment tuning not only strengthens representational biases but also makes them temporally stable, producing smoother yet more stereotyped portrayals. These findings highlight the need for bias-aware evaluation and mitigation throughout the alignment process to ensure fair and socially responsible video generation.
TLDR: This paper introduces VideoBiasEval, a framework for analyzing social biases in video diffusion models, and demonstrates how alignment tuning exacerbates these biases, emphasizing the need for bias-aware evaluation and mitigation strategies.
TLDR: 本文介绍了VideoBiasEval,一个用于分析视频扩散模型中社会偏见的框架,并展示了对齐调整如何加剧这些偏见,强调了在整个对齐过程中进行偏见感知评估和缓解策略的必要性。
Read Paper (PDF)Masked Autoregressive (MAR) models promise better efficiency in visual generation than autoregressive (AR) models for the ability of parallel generation, yet their acceleration potential remains constrained by the modeling complexity of spatially correlated visual tokens in a single step. To address this limitation, we introduce Generation then Reconstruction (GtR), a training-free hierarchical sampling strategy that decomposes generation into two stages: structure generation establishing global semantic scaffolding, followed by detail reconstruction efficiently completing remaining tokens. Assuming that it is more difficult to create an image from scratch than to complement images based on a basic image framework, GtR is designed to achieve acceleration by computing the reconstruction stage quickly while maintaining the generation quality by computing the generation stage slowly. Moreover, observing that tokens on the details of an image often carry more semantic information than tokens in the salient regions, we further propose Frequency-Weighted Token Selection (FTS) to offer more computation budget to tokens on image details, which are localized based on the energy of high frequency information. Extensive experiments on ImageNet class-conditional and text-to-image generation demonstrate 3.72x speedup on MAR-H while maintaining comparable quality (e.g., FID: 1.59, IS: 304.4 vs. original 1.59, 299.1), substantially outperforming existing acceleration methods across various model scales and generation tasks. Our codes will be released in https://github.com/feihongyan1/GtR.
TLDR: The paper introduces a two-stage sampling strategy, Generation then Reconstruction (GtR), for accelerating Masked Autoregressive (MAR) models in image generation, achieving significant speedups while maintaining comparable quality by prioritizing detail reconstruction and using frequency-weighted token selection.
TLDR: 该论文提出了一种两阶段采样策略,即生成后重建(GtR),用于加速图像生成中的掩蔽自回归(MAR)模型。通过优先考虑细节重建和使用频率加权Token选择,在保持相当质量的同时,实现了显著的加速。
Read Paper (PDF)Diffusion and flow models achieve high generative quality but remain computationally expensive due to slow multi-step sampling. Distillation methods accelerate them by training fast student generators, yet most existing objectives lack a unified theoretical foundation. In this work, we propose Di-Bregman, a compact framework that formulates diffusion distillation as Bregman divergence-based density-ratio matching. This convex-analytic view connects several existing objectives through a common lens. Experiments on CIFAR-10 and text-to-image generation demonstrate that Di-Bregman achieves improved one-step FID over reverse-KL distillation and maintains high visual fidelity compared to the teacher model. Our results highlight Bregman density-ratio matching as a practical and theoretically-grounded route toward efficient one-step diffusion generation.
TLDR: The paper introduces Di-Bregman, a new framework for one-step diffusion model distillation based on Bregman divergence-based density-ratio matching, achieving improved FID and visual fidelity compared to existing methods.
TLDR: 该论文介绍了一种名为 Di-Bregman 的新框架,用于基于 Bregman 散度密度比匹配的单步扩散模型蒸馏,与现有方法相比,实现了更高的 FID 和视觉保真度。
Read Paper (PDF)The advancement of vision-language models (VLMs) is hampered by a fragmented landscape of inconsistent and contaminated public datasets. We introduce FineVision, a meticulously collected, curated, and unified corpus of 24 million samples - the largest open resource of its kind. We unify more than 200 sources into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs to verify faithful consumption of annotations, appropriate formatting and diversity, and safety; issues trigger targeted fixes and re-runs. The workflow further applies rigorous de-duplication within and across sources and decontamination against 66 public benchmarks. FineVision also encompasses agentic/GUI tasks with a unified action space; reviewers validate schemas and inspect a sample of trajectories to confirm executable fidelity. Models trained on FineVision consistently outperform those trained on existing open mixtures across a broad evaluation suite, underscoring the benefits of scale, data hygiene, and balanced automation with human oversight. We release the corpus and curation tools to accelerate data-centric VLM research.
TLDR: FineVision is a large, meticulously curated, and unified open dataset of 24 million vision-language samples designed to improve VLM training. Models trained on it outperform those trained on existing open mixtures.
TLDR: FineVision是一个大型、精心策划和统一的开放数据集,包含2400万个视觉-语言样本,旨在改善VLM训练。基于此训练的模型优于基于现有开放混合数据训练的模型。
Read Paper (PDF)Diffusion-based methods, leveraging pre-trained large models like Stable Diffusion via ControlNet, have achieved remarkable performance in several low-level vision tasks. However, Pre-Trained Diffusion-Based (PTDB) methods often sacrifice content fidelity to attain higher perceptual realism. This issue is exacerbated in low-light scenarios, where severely degraded information caused by the darkness limits effective control. We identify two primary causes of fidelity loss: the absence of suitable conditional latent modeling and the lack of bidirectional interaction between the conditional latent and noisy latent in the diffusion process. To address this, we propose a novel optimization strategy for conditioning in pre-trained diffusion models, enhancing fidelity while preserving realism and aesthetics. Our method introduces a mechanism to recover spatial details lost during VAE encoding, i.e., a latent refinement pipeline incorporating generative priors. Additionally, the refined latent condition interacts dynamically with the noisy latent, leading to improved restoration performance. Our approach is plug-and-play, seamlessly integrating into existing diffusion networks to provide more effective control. Extensive experiments demonstrate significant fidelity improvements in PTDB methods.
TLDR: This paper addresses fidelity loss in pre-trained diffusion-based low-light image enhancement by refining the condition latent and introducing bidirectional interaction, improving restoration performance without sacrificing perceptual realism.
TLDR: 该论文通过细化条件潜变量和引入双向交互,解决了预训练扩散模型在低光图像增强中保真度损失的问题,从而在不牺牲感知真实感的前提下,提高了图像恢复性能。
Read Paper (PDF)Articulated objects, such as laptops and drawers, exhibit significant challenges for 3D reconstruction and pose estimation due to their multi-part geometries and variable joint configurations, which introduce structural diversity across different states. To address these challenges, we propose KineDiff3D: Kinematic-Aware Diffusion for Category-Level Articulated Object Shape Reconstruction and Generation, a unified framework for reconstructing diverse articulated instances and pose estimation from single view input. Specifically, we first encode complete geometry (SDFs), joint angles, and part segmentation into a structured latent space via a novel Kinematic-Aware VAE (KA-VAE). In addition, we employ two conditional diffusion models: one for regressing global pose (SE(3)) and joint parameters, and another for generating the kinematic-aware latent code from partial observations. Finally, we produce an iterative optimization module that bidirectionally refines reconstruction accuracy and kinematic parameters via Chamfer-distance minimization while preserving articulation constraints. Experimental results on synthetic, semi-synthetic, and real-world datasets demonstrate the effectiveness of our approach in accurately reconstructing articulated objects and estimating their kinematic properties.
TLDR: The paper introduces KineDiff3D, a diffusion-based framework for reconstructing and estimating the pose of articulated objects from single view input, leveraging a kinematic-aware VAE and conditional diffusion models.
TLDR: 该论文介绍了KineDiff3D,一个基于扩散模型的框架,用于从单视图输入重建和估计铰接对象的姿态,利用了运动学感知的VAE和条件扩散模型。
Read Paper (PDF)