Daily papers related to Image/Video/Multimodal Generation from cs.CV
September 24, 2025
We proposed Lavida-O, a unified multi-modal Masked Diffusion Model (MDM) capable of image understanding and generation tasks. Unlike existing multimodal diffsion language models such as MMaDa and Muddit which only support simple image-level understanding tasks and low-resolution image generation, Lavida-O exhibits many new capabilities such as object grounding, image-editing, and high-resolution (1024px) image synthesis. It is also the first unified MDM that uses its understanding capabilities to improve image generation and editing results through planning and iterative self-reflection. To allow effective and efficient training and sampling, Lavida-O ntroduces many novel techniques such as Elastic Mixture-of-Transformer architecture, universal text conditioning, and stratified sampling. \ours~achieves state-of-the-art performance on a wide range of benchmarks such as RefCOCO object grounding, GenEval text-to-image generation, and ImgEdit image editing, outperforming existing autoregressive and continuous diffusion models such as Qwen2.5-VL and FluxKontext-dev, while offering considerable speedup at inference.
TLDR: Lavida-O is a novel unified masked diffusion model (MDM) that achieves state-of-the-art performance in image understanding and generation tasks like object grounding, image editing, and high-resolution image synthesis, offering significant speedups compared to existing models.
TLDR: Lavida-O 是一种新颖的统一掩码扩散模型 (MDM),在图像理解和生成任务(如对象定位、图像编辑和高分辨率图像合成)方面实现了最先进的性能,与现有模型相比,显着提高了速度。
Read Paper (PDF)Unified multimodal models have recently attracted considerable attention for their remarkable abilities in jointly understanding and generating diverse content. However, as contexts integrate increasingly numerous interleaved multimodal tokens, the iterative processes of diffusion denoising and autoregressive decoding impose significant computational overhead. To address this, we propose Hyper-Bagel, a unified acceleration framework designed to simultaneously speed up both multimodal understanding and generation tasks. Our approach uses a divide-and-conquer strategy, employing speculative decoding for next-token prediction and a multi-stage distillation process for diffusion denoising. The framework delivers substantial performance gains, achieving over a 2x speedup in multimodal understanding. For generative tasks, our resulting lossless 6-NFE model yields a 16.67x speedup in text-to-image generation and a 22x speedup in image editing, all while preserving the high-quality output of the original model. We further develop a highly efficient 1-NFE model that enables near real-time interactive editing and generation. By combining advanced adversarial distillation with human feedback learning, this model achieves ultimate cost-effectiveness and responsiveness, making complex multimodal interactions seamless and instantaneous.
TLDR: The paper introduces Hyper-Bagel, a unified acceleration framework that significantly speeds up both multimodal understanding and generation tasks through speculative decoding and multi-stage distillation, achieving impressive speedups in text-to-image generation and image editing.
TLDR: 该论文介绍了 Hyper-Bagel,一个统一的加速框架,通过推测性解码和多阶段蒸馏,显著加速了多模态理解和生成任务,并在文本到图像生成和图像编辑方面实现了令人印象深刻的加速。
Read Paper (PDF)Conditional generative modeling aims to learn a conditional data distribution from samples containing data-condition pairs. For this, diffusion and flow-based methods have attained compelling results. These methods use a learned (flow) model to transport an initial standard Gaussian noise that ignores the condition to the conditional data distribution. The model is hence required to learn both mass transport and conditional injection. To ease the demand on the model, we propose Condition-Aware Reparameterization for Flow Matching (CAR-Flow) -- a lightweight, learned shift that conditions the source, the target, or both distributions. By relocating these distributions, CAR-Flow shortens the probability path the model must learn, leading to faster training in practice. On low-dimensional synthetic data, we visualize and quantify the effects of CAR. On higher-dimensional natural image data (ImageNet-256), equipping SiT-XL/2 with CAR-Flow reduces FID from 2.07 to 1.68, while introducing less than 0.6% additional parameters.
TLDR: The paper introduces CAR-Flow, a condition-aware reparameterization technique for flow matching that shifts source and target distributions to ease the learning process for conditional generative models, resulting in faster training and improved performance on image generation tasks.
TLDR: 该论文介绍了CAR-Flow,一种条件感知重参数化技术,用于流匹配,通过移动源和目标分布来简化条件生成模型的学习过程,从而加快训练速度并提高图像生成任务的性能。
Read Paper (PDF)Despite steady progress in layout-to-image generation, current methods still struggle with layouts containing significant overlap between bounding boxes. We identify two primary challenges: (1) large overlapping regions and (2) overlapping instances with minimal semantic distinction. Through both qualitative examples and quantitative analysis, we demonstrate how these factors degrade generation quality. To systematically assess this issue, we introduce OverLayScore, a novel metric that quantifies the complexity of overlapping bounding boxes. Our analysis reveals that existing benchmarks are biased toward simpler cases with low OverLayScore values, limiting their effectiveness in evaluating model performance under more challenging conditions. To bridge this gap, we present OverLayBench, a new benchmark featuring high-quality annotations and a balanced distribution across different levels of OverLayScore. As an initial step toward improving performance on complex overlaps, we also propose CreatiLayout-AM, a model fine-tuned on a curated amodal mask dataset. Together, our contributions lay the groundwork for more robust layout-to-image generation under realistic and challenging scenarios. Project link: https://mlpc-ucsd.github.io/OverLayBench.
TLDR: This paper introduces OverLayBench, a new benchmark and OverLayScore metric to specifically evaluate layout-to-image generation models' performance on scenes with dense overlapping bounding boxes, a known weakness of current methods. They also propose a fine-tuned model, CreatiLayout-AM, as a starting point for improvement.
TLDR: 该论文介绍了OverLayBench,一个新的基准测试和OverLayScore指标,专门用于评估布局到图像生成模型在具有密集重叠边界框的场景中的性能,这是当前方法的一个已知弱点。他们还提出了一个微调模型CreatiLayout-AM,作为改进的起点。
Read Paper (PDF)Autonomous vehicles (AVs) are expected to revolutionize transportation by improving efficiency and safety. Their success relies on 3D vision systems that effectively sense the environment and detect traffic agents. Among sensors AVs use to create a comprehensive view of surroundings, LiDAR provides high-resolution depth data enabling accurate object detection, safe navigation, and collision avoidance. However, collecting real-world LiDAR data is time-consuming and often affected by noise and sparsity due to adverse weather or sensor limitations. This work applies a denoising diffusion probabilistic model (DDPM), enhanced with novel noise scheduling and time-step embedding techniques to generate high-quality synthetic data for augmentation, thereby improving performance across a range of computer vision tasks, particularly in AV perception. These modifications impact the denoising process and the model's temporal awareness, allowing it to produce more realistic point clouds based on the projection. The proposed method was extensively evaluated under various configurations using the IAMCV and KITTI-360 datasets, with four performance metrics compared against state-of-the-art (SOTA) methods. The results demonstrate the model's superior performance over most existing baselines and its effectiveness in mitigating the effects of noisy and sparse LiDAR data, producing diverse point clouds with rich spatial relationships and structural detail.
TLDR: This paper presents a DDPM-based approach for generating high-quality synthetic LiDAR point cloud data to address the challenges of noise and sparsity in real-world data, demonstrating improved performance in AV perception tasks.
TLDR: 本文提出了一种基于DDPM的方法,用于生成高质量的合成LiDAR点云数据,以解决真实世界数据中的噪声和稀疏性问题,并在AV感知任务中表现出改进的性能。
Read Paper (PDF)Fusing cross-category objects to a single coherent object has gained increasing attention in text-to-image (T2I) generation due to its broad applications in virtual reality, digital media, film, and gaming. However, existing methods often produce biased, visually chaotic, or semantically inconsistent results due to overlapping artifacts and poor integration. Moreover, progress in this field has been limited by the absence of a comprehensive benchmark dataset. To address these problems, we propose \textbf{Adaptive Group Swapping (AGSwap)}, a simple yet highly effective approach comprising two key components: (1) Group-wise Embedding Swapping, which fuses semantic attributes from different concepts through feature manipulation, and (2) Adaptive Group Updating, a dynamic optimization mechanism guided by a balance evaluation score to ensure coherent synthesis. Additionally, we introduce \textbf{Cross-category Object Fusion (COF)}, a large-scale, hierarchically structured dataset built upon ImageNet-1K and WordNet. COF includes 95 superclasses, each with 10 subclasses, enabling 451,250 unique fusion pairs. Extensive experiments demonstrate that AGSwap outperforms state-of-the-art compositional T2I methods, including GPT-Image-1 using simple and complex prompts.
TLDR: The paper introduces AGSwap, a method for fusing cross-category objects in text-to-image generation using group-wise embedding swapping and adaptive group updating, along with a new large-scale dataset, COF, for benchmarking.
TLDR: 该论文介绍了AGSwap,一种通过组嵌入交换和自适应组更新在文本到图像生成中融合跨类别对象的方法,以及一个新的大规模数据集COF,用于基准测试。
Read Paper (PDF)Diffusion models have significantly advanced image manipulation techniques, and their ability to generate photorealistic images is beginning to transform retail workflows, particularly in presale visualization. Beyond artistic style transfer, the capability to perform fine-grained visual feature transfer is becoming increasingly important. Embroidery is a textile art form characterized by intricate interplay of diverse stitch patterns and material properties, which poses unique challenges for existing style transfer methods. To explore the customization for such fine-grained features, we propose a novel contrastive learning framework that disentangles fine-grained style and content features with a single reference image, building on the classic concept of image analogy. We first construct an image pair to define the target style, and then adopt a similarity metric based on the decoupled representations of pretrained diffusion models for style-content separation. Subsequently, we propose a two-stage contrastive LoRA modulation technique to capture fine-grained style features. In the first stage, we iteratively update the whole LoRA and the selected style blocks to initially separate style from content. In the second stage, we design a contrastive learning strategy to further decouple style and content through self-knowledge distillation. Finally, we build an inference pipeline to handle image or text inputs with only the style blocks. To evaluate our method on fine-grained style transfer, we build a benchmark for embroidery customization. Our approach surpasses prior methods on this task and further demonstrates strong generalization to three additional domains: artistic style transfer, sketch colorization, and appearance transfer.
TLDR: This paper introduces a contrastive learning framework with LoRA modulation for fine-grained style transfer, specifically applied to embroidery customization with diffusion models, showing strong generalization to other domains.
TLDR: 本文介绍了一种基于对比学习框架和LoRA调制的细粒度风格迁移方法,专门应用于扩散模型的刺绣定制,并展示了对其他领域的强大泛化能力。
Read Paper (PDF)The scarcity of annotated surgical data poses a significant challenge for developing deep learning systems in computer-assisted interventions. While diffusion models can synthesize realistic images, they often suffer from data memorization, resulting in inconsistent or non-diverse samples that may fail to improve, or even harm, downstream performance. We introduce \emph{Surgical Application-Aligned Diffusion} (SAADi), a new framework that aligns diffusion models with samples preferred by downstream models. Our method constructs pairs of \emph{preferred} and \emph{non-preferred} synthetic images and employs lightweight fine-tuning of diffusion models to align the image generation process with downstream objectives explicitly. Experiments on three surgical datasets demonstrate consistent gains of $7$--$9\%$ in classification and $2$--$10\%$ in segmentation tasks, with the considerable improvements observed for underrepresented classes. Iterative refinement of synthetic samples further boosts performance by $4$--$10\%$. Unlike baseline approaches, our method overcomes sample degradation and establishes task-aware alignment as a key principle for mitigating data scarcity and advancing surgical vision applications.
TLDR: The paper introduces Surgical Application-Aligned Diffusion (SAADi), a method to fine-tune diffusion models to generate synthetic surgical images preferred by downstream tasks, leading to improved performance in classification and segmentation, especially for underrepresented classes.
TLDR: 该论文介绍了手术应用对齐扩散 (SAADi),一种微调扩散模型以生成下游任务偏好的合成手术图像的方法,从而提高了分类和分割的性能,特别是对于代表性不足的类别。
Read Paper (PDF)Inverting corrupted images into the latent space of diffusion models is challenging. Current methods, which encode an image into a single latent vector, struggle to balance structural fidelity with semantic accuracy, leading to reconstructions with semantic drift, such as blurred details or incorrect attributes. To overcome this, we introduce Prompt-Guided Dual Latent Steering (PDLS), a novel, training-free framework built upon Rectified Flow models for their stable inversion paths. PDLS decomposes the inversion process into two complementary streams: a structural path to preserve source integrity and a semantic path guided by a prompt. We formulate this dual guidance as an optimal control problem and derive a closed-form solution via a Linear Quadratic Regulator (LQR). This controller dynamically steers the generative trajectory at each step, preventing semantic drift while ensuring the preservation of fine detail without costly, per-image optimization. Extensive experiments on FFHQ-1K and ImageNet-1K under various inversion tasks, including Gaussian deblurring, motion deblurring, super-resolution and freeform inpainting, demonstrate that PDLS produces reconstructions that are both more faithful to the original image and better aligned with the semantic information than single-latent baselines.
TLDR: The paper introduces Prompt-Guided Dual Latent Steering (PDLS), a training-free framework leveraging Rectified Flow models for improved image inversion, balancing structural fidelity and semantic accuracy using a dual-stream approach guided by optimal control.
TLDR: 该论文介绍了提示引导的双重潜在引导(PDLS),一个无需训练的框架,利用修正流模型改进图像反演,通过最优控制引导的双流方法来平衡结构保真度和语义准确性。
Read Paper (PDF)Towards intelligent image editing, object removal should eliminate both the target object and its causal visual artifacts, such as shadows and reflections. However, existing image appearance-based methods either follow strictly mask-aligned training and fail to remove these causal effects which are not explicitly masked, or adopt loosely mask-aligned strategies that lack controllability and may unintentionally over-erase other objects. We identify that these limitations stem from ignoring the causal relationship between an object's geometry presence and its visual effects. To address this limitation, we propose a geometry-aware two-stage framework that decouples object removal into (1) geometry removal and (2) appearance rendering. In the first stage, we remove the object directly from the geometry (e.g., depth) using strictly mask-aligned supervision, enabling structure-aware editing with strong geometric constraints. In the second stage, we render a photorealistic RGB image conditioned on the updated geometry, where causal visual effects are considered implicitly as a result of the modified 3D geometry. To guide learning in the geometry removal stage, we introduce a preference-driven objective based on positive and negative sample pairs, encouraging the model to remove objects as well as their causal visual artifacts while avoiding new structural insertions. Extensive experiments demonstrate that our method achieves state-of-the-art performance in removing both objects and their associated artifacts on two popular benchmarks. The code is available at https://github.com/buxiangzhiren/GeoRemover.
TLDR: The paper introduces GeoRemover, a geometry-aware framework for removing objects and their causal visual artifacts (shadows, reflections) from images, using a two-stage process of geometry removal followed by appearance rendering based on the updated geometry.
TLDR: 该论文介绍了GeoRemover,一个几何感知框架,用于从图像中移除物体及其因果视觉伪影(阴影、反射),使用两阶段过程:几何移除,然后是基于更新后的几何体的外观渲染。
Read Paper (PDF)