AIGC Daily Papers

Daily papers related to Image/Video/Multimodal Generation from cs.CV

May 17, 2025

DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling

Diffusion Transformer (DiT), a promising diffusion model for visual generation, demonstrates impressive performance but incurs significant computational overhead. Intriguingly, analysis of pre-trained DiT models reveals that global self-attention is often redundant, predominantly capturing local patterns-highlighting the potential for more efficient alternatives. In this paper, we revisit convolution as an alternative building block for constructing efficient and expressive diffusion models. However, naively replacing self-attention with convolution typically results in degraded performance. Our investigations attribute this performance gap to the higher channel redundancy in ConvNets compared to Transformers. To resolve this, we introduce a compact channel attention mechanism that promotes the activation of more diverse channels, thereby enhancing feature diversity. This leads to Diffusion ConvNet (DiCo), a family of diffusion models built entirely from standard ConvNet modules, offering strong generative performance with significant efficiency gains. On class-conditional ImageNet benchmarks, DiCo outperforms previous diffusion models in both image quality and generation speed. Notably, DiCo-XL achieves an FID of 2.05 at 256x256 resolution and 2.53 at 512x512, with a 2.7x and 3.1x speedup over DiT-XL/2, respectively. Furthermore, our largest model, DiCo-H, scaled to 1B parameters, reaches an FID of 1.90 on ImageNet 256x256-without any additional supervision during training. Code: https://github.com/shallowdream204/DiCo.

TLDR: This paper introduces Diffusion ConvNet (DiCo), a convolutional neural network architecture for diffusion models that achieves state-of-the-art image generation performance with significant efficiency gains compared to Diffusion Transformers (DiT). It addresses the channel redundancy issue in ConvNets using a compact channel attention mechanism.

TLDR: 本文介绍了扩散卷积神经网络 (DiCo),一种用于扩散模型的卷积神经网络架构,与扩散 Transformer (DiT) 相比,它在图像生成方面实现了最先进的性能,并显着提高了效率。它使用紧凑的通道注意力机制解决了卷积神经网络中的通道冗余问题。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (9/10)
Overall: (9/10)
Read Paper (PDF)

Authors: Yuang Ai, Qihang Fan, Xuefeng Hu, Zhenheng Yang, Ran He, Huaibo Huang

A Fourier Space Perspective on Diffusion Models

Diffusion models are state-of-the-art generative models on data modalities such as images, audio, proteins and materials. These modalities share the property of exponentially decaying variance and magnitude in the Fourier domain. Under the standard Denoising Diffusion Probabilistic Models (DDPM) forward process of additive white noise, this property results in high-frequency components being corrupted faster and earlier in terms of their Signal-to-Noise Ratio (SNR) than low-frequency ones. The reverse process then generates low-frequency information before high-frequency details. In this work, we study the inductive bias of the forward process of diffusion models in Fourier space. We theoretically analyse and empirically demonstrate that the faster noising of high-frequency components in DDPM results in violations of the normality assumption in the reverse process. Our experiments show that this leads to degraded generation quality of high-frequency components. We then study an alternate forward process in Fourier space which corrupts all frequencies at the same rate, removing the typical frequency hierarchy during generation, and demonstrate marked performance improvements on datasets where high frequencies are primary, while performing on par with DDPM on standard imaging benchmarks.

TLDR: This paper analyzes the frequency bias in diffusion models, showing that high-frequency components are degraded faster during the forward process, leading to suboptimal generation. They propose and demonstrate an improved forward process that corrupts all frequencies equally, resulting in better performance on datasets with high-frequency importance.

TLDR: 该论文分析了扩散模型中的频率偏见,表明前向过程中高频分量退化更快,导致生成效果不佳。他们提出并展示了一种改进的前向过程,该过程以相同的速率破坏所有频率,从而在具有高频重要性的数据集上获得更好的性能。

Relevance: (8/10)
Novelty: (7/10)
Clarity: (9/10)
Potential Impact: (7/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Fabian Falck, Teodora Pandeva, Kiarash Zahirnia, Rachel Lawrence, Richard Turner, Edward Meeds, Javier Zazo, Sushrut Karmalkar

Diffusion-NPO: Negative Preference Optimization for Better Preference Aligned Generation of Diffusion Models

Diffusion models have made substantial advances in image generation, yet models trained on large, unfiltered datasets often yield outputs misaligned with human preferences. Numerous methods have been proposed to fine-tune pre-trained diffusion models, achieving notable improvements in aligning generated outputs with human preferences. However, we argue that existing preference alignment methods neglect the critical role of handling unconditional/negative-conditional outputs, leading to a diminished capacity to avoid generating undesirable outcomes. This oversight limits the efficacy of classifier-free guidance~(CFG), which relies on the contrast between conditional generation and unconditional/negative-conditional generation to optimize output quality. In response, we propose a straightforward but versatile effective approach that involves training a model specifically attuned to negative preferences. This method does not require new training strategies or datasets but rather involves minor modifications to existing techniques. Our approach integrates seamlessly with models such as SD1.5, SDXL, video diffusion models and models that have undergone preference optimization, consistently enhancing their alignment with human preferences.

TLDR: The paper introduces Diffusion-NPO, a method to improve preference alignment in diffusion models by specifically addressing negative preferences, leading to better generation quality and seamless integration with existing models.

TLDR: 该论文介绍了Diffusion-NPO,一种通过专门处理负面偏好来提高扩散模型中偏好对齐的方法,从而实现更好的生成质量并与现有模型无缝集成。

Relevance: (9/10)
Novelty: (7/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Fu-Yun Wang, Yunhao Shui, Jingtan Piao, Keqiang Sun, Hongsheng Li

Towards Self-Improvement of Diffusion Models via Group Preference Optimization

Aligning text-to-image (T2I) diffusion models with Direct Preference Optimization (DPO) has shown notable improvements in generation quality. However, applying DPO to T2I faces two challenges: the sensitivity of DPO to preference pairs and the labor-intensive process of collecting and annotating high-quality data. In this work, we demonstrate that preference pairs with marginal differences can degrade DPO performance. Since DPO relies exclusively on relative ranking while disregarding the absolute difference of pairs, it may misclassify losing samples as wins, or vice versa. We empirically show that extending the DPO from pairwise to groupwise and incorporating reward standardization for reweighting leads to performance gains without explicit data selection. Furthermore, we propose Group Preference Optimization (GPO), an effective self-improvement method that enhances performance by leveraging the model's own capabilities without requiring external data. Extensive experiments demonstrate that GPO is effective across various diffusion models and tasks. Specifically, combining with widely used computer vision models, such as YOLO and OCR, the GPO improves the accurate counting and text rendering capabilities of the Stable Diffusion 3.5 Medium by 20 percentage points. Notably, as a plug-and-play method, no extra overhead is introduced during inference.

TLDR: The paper introduces Group Preference Optimization (GPO), a self-improvement method for text-to-image diffusion models that addresses the data sensitivity issue of Direct Preference Optimization (DPO) by using groupwise preferences and reward standardization, leading to improved performance without requiring external data.

TLDR: 该论文介绍了组偏好优化(GPO),这是一种用于文本到图像扩散模型的自我改进方法,通过使用组偏好和奖励标准化来解决直接偏好优化(DPO) 的数据敏感性问题,从而在不需要外部数据的情况下提高性能。

Relevance: (8/10)
Novelty: (7/10)
Clarity: (9/10)
Potential Impact: (7/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Renjie Chen, Wenfeng Lin, Yichen Zhang, Jiangchuan Wei, Boyuan Liu, Chao Feng, Jiao Ran, Mingyu Guo

DDAE++: Enhancing Diffusion Models Towards Unified Generative and Discriminative Learning

While diffusion models have gained prominence in image synthesis, their generative pre-training has been shown to yield discriminative representations, paving the way towards unified visual generation and understanding. However, two key questions remain: 1) Can these representations be leveraged to improve the training of diffusion models themselves, rather than solely benefiting downstream tasks? 2) Can the feature quality be enhanced to rival or even surpass modern self-supervised learners, without compromising generative capability? This work addresses these questions by introducing self-conditioning, a straightforward yet effective mechanism that internally leverages the rich semantics inherent in denoising network to guide its own decoding layers, forming a tighter bottleneck that condenses high-level semantics to improve generation. Results are compelling: our method boosts both generation FID and recognition accuracy with 1% computational overhead and generalizes across diverse diffusion architectures. Crucially, self-conditioning facilitates an effective integration of discriminative techniques, such as contrastive self-distillation, directly into diffusion models without sacrificing generation quality. Extensive experiments on pixel-space and latent-space datasets show that in linear evaluations, our enhanced diffusion models, particularly UViT and DiT, serve as strong representation learners, surpassing various self-supervised models.

TLDR: The paper introduces a self-conditioning mechanism (DDAE++) to improve diffusion models for both generative and discriminative tasks, achieving better FID and recognition accuracy with minimal overhead and surpassing existing self-supervised learning methods in representation learning.

TLDR: 该论文介绍了一种自调节机制(DDAE++),以改进扩散模型在生成和判别任务中的表现,以最小的开销实现了更好的FID和识别精度,并在表征学习方面超越了现有的自监督学习方法。

Relevance: (8/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Weilai Xiang, Hongyu Yang, Di Huang, Yunhong Wang