Daily papers related to Image/Video/Multimodal Generation from cs.CV
August 23, 2025
Recent advancements in video generation have substantially improved visual quality and temporal coherence, making these models increasingly appealing for applications such as autonomous driving, particularly in the context of driving simulation and so-called "world models". In this work, we investigate the effects of existing fine-tuning video generation approaches on structured driving datasets and uncover a potential trade-off: although visual fidelity improves, spatial accuracy in modeling dynamic elements may degrade. We attribute this degradation to a shift in the alignment between visual quality and dynamic understanding objectives. In datasets with diverse scene structures within temporal space, where objects or perspective shift in varied ways, these objectives tend to highly correlated. However, the very regular and repetitive nature of driving scenes allows visual quality to improve by modeling dominant scene motion patterns, without necessarily preserving fine-grained dynamic behavior. As a result, fine-tuning encourages the model to prioritize surface-level realism over dynamic accuracy. To further examine this phenomenon, we show that simple continual learning strategies, such as replay from diverse domains, can offer a balanced alternative by preserving spatial accuracy while maintaining strong visual quality.
TLDR: The paper identifies a trade-off in fine-tuning video generators for driving simulation: improved visual fidelity can degrade spatial accuracy in dynamic elements. They propose continual learning strategies for a better balance.
TLDR: 该论文指出,在微调用于驾驶模拟的视频生成器时存在一个权衡:提高视觉保真度可能会降低动态元素中的空间准确性。他们提出了持续学习策略,以实现更好的平衡。
Read Paper (PDF)Diffusion models have emerged as a powerful paradigm for generative tasks such as image synthesis and video generation, with Transformer architectures further enhancing performance. However, the high computational cost of diffusion Transformers-stemming from a large number of sampling steps and complex per-step computations-presents significant challenges for real-time deployment. In this paper, we introduce OmniCache, a training-free acceleration method that exploits the global redundancy inherent in the denoising process. Unlike existing methods that determine caching strategies based on inter-step similarities and tend to prioritize reusing later sampling steps, our approach originates from the sampling perspective of DIT models. We systematically analyze the model's sampling trajectories and strategically distribute cache reuse across the entire sampling process. This global perspective enables more effective utilization of cached computations throughout the diffusion trajectory, rather than concentrating reuse within limited segments of the sampling procedure.In addition, during cache reuse, we dynamically estimate the corresponding noise and filter it out to reduce its impact on the sampling direction.Extensive experiments demonstrate that our approach accelerates the sampling process while maintaining competitive generative quality, offering a promising and practical solution for efficient deployment of diffusion-based generative models.
TLDR: The paper introduces OmniCache, a training-free caching method for accelerating diffusion transformer models by strategically reusing computations across the entire sampling trajectory, considering a global perspective.
TLDR: 该论文介绍了OmniCache,一种无需训练的缓存方法,通过策略性地重用整个采样轨迹中的计算,并从全局角度考虑,从而加速扩散Transformer模型。
Read Paper (PDF)Diffusion Transformers (DiTs) have demonstrated exceptional performance in high-fidelity image and video generation. To reduce their substantial computational costs, feature caching techniques have been proposed to accelerate inference by reusing hidden representations from previous timesteps. However, current methods often struggle to maintain generation quality at high acceleration ratios, where prediction errors increase sharply due to the inherent instability of long-step forecasting. In this work, we adopt an ordinary differential equation (ODE) perspective on the hidden-feature sequence, modeling layer representations along the trajectory as a feature-ODE. We attribute the degradation of existing caching strategies to their inability to robustly integrate historical features under large skipping intervals. To address this, we propose FoCa (Forecast-then-Calibrate), which treats feature caching as a feature-ODE solving problem. Extensive experiments on image synthesis, video generation, and super-resolution tasks demonstrate the effectiveness of FoCa, especially under aggressive acceleration. Without additional training, FoCa achieves near-lossless speedups of 5.50 times on FLUX, 6.45 times on HunyuanVideo, 3.17 times on Inf-DiT, and maintains high quality with a 4.53 times speedup on DiT.
TLDR: The paper introduces FoCa, a novel feature caching technique for Diffusion Transformers that uses an ODE perspective to improve inference speed while maintaining generation quality, particularly at high acceleration ratios.
TLDR: 该论文介绍了FoCa,一种新颖的Diffusion Transformer特征缓存技术,它使用ODE视角来提高推理速度,同时保持生成质量,尤其是在高加速比下。
Read Paper (PDF)Although diffusion model has made good progress in the field of image generation, GAN\cite{huang2023adaptive} still has a large development space due to its unique advantages, such as WGAN\cite{liu2021comparing}, SSGAN\cite{guibas2021adaptive} \cite{zhang2022vsa} \cite{zhou2024adapt} and so on. In this paper, we propose a novel two-flow feedback multi-scale progressive generative adversarial network (MSPG-SEN) for GAN models. This paper has four contributions: 1) : We propose a two-flow feedback multi-scale progressive Generative Adversarial network (MSPG-SEN), which not only improves image quality and human visual perception on the basis of retaining the advantages of the existing GAN model, but also simplifies the training process and reduces the training cost of GAN networks. Our experimental results show that, MSPG-SEN has achieved state-of-the-art generation results on the following five datasets,INKK The dataset is 89.7\%,AWUN The dataset is 78.3\%,IONJ The dataset is 85.5\%,POKL The dataset is 88.7\%,OPIN The dataset is 96.4\%. 2) : We propose an adaptive perception-behavioral feedback loop (APFL), which effectively improves the robustness and training stability of the model and reduces the training cost. 3) : We propose a globally connected two-flow dynamic residual network(). After ablation experiments, it can effectively improve the training efficiency and greatly improve the generalization ability, with stronger flexibility. 4) : We propose a new dynamic embedded attention mechanism (DEMA). After experiments, the attention can be extended to a variety of image processing tasks, which can effectively capture global-local information, improve feature separation capability and feature expression capabilities, and requires minimal computing resources only 88.7\% with INJK With strong cross-task capability.
TLDR: This paper introduces a novel two-flow feedback multi-scale progressive GAN (MSPG-SEN) with adaptive perception-behavioral feedback, a globally connected two-flow dynamic residual network, and a dynamic embedded attention mechanism (DEMA), claiming state-of-the-art results on several image datasets. The method purports to improve image quality, training stability, and generalization ability while reducing training cost.
TLDR: 本文提出了一种新型的双流反馈多尺度渐进式GAN (MSPG-SEN),它具有自适应感知-行为反馈、全局连接的双流动态残差网络和动态嵌入注意力机制 (DEMA)。该方法声称在多个图像数据集上取得了最先进的结果,旨在提高图像质量、训练稳定性和泛化能力,同时降低训练成本。
Read Paper (PDF)Quantitative microstructural characterization is fundamental to materials science, where electron micrograph (EM) provides indispensable high-resolution insights. However, progress in deep learning-based EM characterization has been hampered by the scarcity of large-scale, diverse, and expert-annotated datasets, due to acquisition costs, privacy concerns, and annotation complexity. To address this issue, we introduce UniEM-3M, the first large-scale and multimodal EM dataset for instance-level understanding. It comprises 5,091 high-resolution EMs, about 3 million instance segmentation labels, and image-level attribute-disentangled textual descriptions, a subset of which will be made publicly available. Furthermore, we are also releasing a text-to-image diffusion model trained on the entire collection to serve as both a powerful data augmentation tool and a proxy for the complete data distribution. To establish a rigorous benchmark, we evaluate various representative instance segmentation methods on the complete UniEM-3M and present UniEM-Net as a strong baseline model. Quantitative experiments demonstrate that this flow-based model outperforms other advanced methods on this challenging benchmark. Our multifaceted release of a partial dataset, a generative model, and a comprehensive benchmark -- available at huggingface -- will significantly accelerate progress in automated materials analysis.
TLDR: The paper introduces UniEM-3M, a large-scale electron micrograph dataset with instance segmentation labels and textual descriptions, along with a text-to-image diffusion model for data augmentation and a benchmark for instance segmentation methods, aiming to accelerate automated materials analysis.
TLDR: 该论文介绍了UniEM-3M,一个大规模的电子显微镜图像数据集,包含实例分割标签和文本描述,以及一个用于数据增强的文本到图像扩散模型和一个实例分割方法的基准,旨在加速自动化材料分析。
Read Paper (PDF)Multi-modal creative writing (MMCW) aims to produce illustrated articles. Unlike common multi-modal generative (MMG) tasks such as storytelling or caption generation, MMCW is an entirely new and more abstract challenge where textual and visual contexts are not strictly related to each other. Existing methods for related tasks can be forcibly migrated to this track, but they require specific modality inputs or costly training, and often suffer from semantic inconsistencies between modalities. Therefore, the main challenge lies in economically performing MMCW with flexible interactive patterns, where the semantics between the modalities of the output are more aligned. In this work, we propose FlexMUSE with a T2I module to enable optional visual input. FlexMUSE promotes creativity and emphasizes the unification between modalities by proposing the modality semantic alignment gating (msaGate) to restrict the textual input. Besides, an attention-based cross-modality fusion is proposed to augment the input features for semantic enhancement. The modality semantic creative direct preference optimization (mscDPO) within FlexMUSE is designed by extending the rejected samples to facilitate the writing creativity. Moreover, to advance the MMCW, we expose a dataset called ArtMUSE which contains with around 3k calibrated text-image pairs. FlexMUSE achieves promising results, demonstrating its consistency, creativity and coherence.
TLDR: The paper introduces FlexMUSE, a framework for multi-modal creative writing (text and image generation) with flexible interaction, addressing semantic inconsistencies and costly training issues in existing methods. It also includes a new dataset, ArtMUSE.
TLDR: 该论文介绍了一种用于多模态创意写作(文本和图像生成)的框架FlexMUSE,它具有灵活的交互方式,旨在解决现有方法中存在的语义不一致和训练成本高昂的问题。论文还提供了一个新的数据集ArtMUSE。
Read Paper (PDF)The success of diffusion models has enabled effortless, high-quality image modifications that precisely align with users' intentions, thereby raising concerns about their potential misuse by malicious actors. Previous studies have attempted to mitigate such misuse through adversarial attacks. However, these approaches heavily rely on image-level inconsistencies, which pose fundamental limitations in addressing the influence of textual prompts. In this paper, we propose PromptFlare, a novel adversarial protection method designed to protect images from malicious modifications facilitated by diffusion-based inpainting models. Our approach leverages the cross-attention mechanism to exploit the intrinsic properties of prompt embeddings. Specifically, we identify and target shared token of prompts that is invariant and semantically uninformative, injecting adversarial noise to suppress the sampling process. The injected noise acts as a cross-attention decoy, diverting the model's focus away from meaningful prompt-image alignments and thereby neutralizing the effect of prompt. Extensive experiments on the EditBench dataset demonstrate that our method achieves state-of-the-art performance across various metrics while significantly reducing computational overhead and GPU memory usage. These findings highlight PromptFlare as a robust and efficient protection against unauthorized image manipulations. The code is available at https://github.com/NAHOHYUN-SKKU/PromptFlare.
TLDR: The paper introduces PromptFlare, a novel adversarial defense method against malicious image modifications in diffusion-based inpainting models by injecting adversarial noise targeting uninformative prompt tokens, achieving state-of-the-art performance with reduced computational cost.
TLDR: 该论文介绍了一种名为PromptFlare的新型对抗防御方法,通过注入对抗噪声来针对基于扩散的图像修复模型中的恶意图像修改,该噪声针对的是无信息的提示标记,从而以更低的计算成本实现了最先进的性能。
Read Paper (PDF)Our goal is to train a generative model of 3D hand motions, conditioned on natural language descriptions specifying motion characteristics such as handshapes, locations, finger/hand/arm movements. To this end, we automatically build pairs of 3D hand motions and their associated textual labels with unprecedented scale. Specifically, we leverage a large-scale sign language video dataset, along with noisy pseudo-annotated sign categories, which we translate into hand motion descriptions via an LLM that utilizes a dictionary of sign attributes, as well as our complementary motion-script cues. This data enables training a text-conditioned hand motion diffusion model HandMDM, that is robust across domains such as unseen sign categories from the same sign language, but also signs from another sign language and non-sign hand movements. We contribute extensive experimental investigation of these scenarios and will make our trained models and data publicly available to support future research in this relatively new field.
TLDR: The paper introduces HandMDM, a text-conditioned diffusion model for generating 3D hand motions from natural language descriptions, trained on a large-scale sign language dataset with LLM-generated labels, and demonstrates its robustness across different sign languages and non-sign hand movements.
TLDR: 本文介绍了一种名为HandMDM的文本条件扩散模型,用于从自然语言描述生成3D手部动作。该模型使用大规模手语数据集进行训练,并利用LLM生成标签,证明了其在不同手语和非手语手部运动中的鲁棒性。
Read Paper (PDF)