ArXiv CS.CV Papers (Image/Video Generation)

Diffuse and Disperse: Image Generation with Representation Regularization

The development of diffusion-based generative models over the past decade has largely proceeded independently of progress in representation learning. These diffusion models typically rely on regression-based objectives and generally lack explicit regularization. In this work, we propose \textit{Dispersive Loss}, a simple plug-and-play regularizer that effectively improves diffusion-based generative models. Our loss function encourages internal representations to disperse in the hidden space, analogous to contrastive self-supervised learning, with the key distinction that it requires no positive sample pairs and therefore does not interfere with the sampling process used for regression. Compared to the recent method of representation alignment (REPA), our approach is self-contained and minimalist, requiring no pre-training, no additional parameters, and no external data. We evaluate Dispersive Loss on the ImageNet dataset across a range of models and report consistent improvements over widely used and strong baselines. We hope our work will help bridge the gap between generative modeling and representation learning.

TLDR: This paper introduces 'Dispersive Loss,' a simple regularization technique for diffusion-based image generation models that encourages dispersed internal representations, improving performance without pre-training or external data.

TLDR: 本文介绍了一种名为“Dispersive Loss”的简单正则化技术，用于基于扩散的图像生成模型，该技术鼓励分散的内部表示，从而提高性能，无需预训练或外部数据。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Runqian Wang, Kaiming He

Product of Experts for Visual Generation

Modern neural models capture rich priors and have complementary knowledge over shared data domains, e.g., images and videos. Integrating diverse knowledge from multiple sources -- including visual generative models, visual language models, and sources with human-crafted knowledge such as graphics engines and physics simulators -- remains under-explored. We propose a Product of Experts (PoE) framework that performs inference-time knowledge composition from heterogeneous models. This training-free approach samples from the product distribution across experts via Annealed Importance Sampling (AIS). Our framework shows practical benefits in image and video synthesis tasks, yielding better controllability than monolithic methods and additionally providing flexible user interfaces for specifying visual generation goals.

TLDR: The paper introduces a training-free Product of Experts (PoE) framework for image and video synthesis, enabling knowledge composition from heterogeneous models (e.g., generative models, language models, physics simulators) via Annealed Importance Sampling, achieving better controllability and flexible user interfaces.

TLDR: 本文介绍了一种无需训练的专家乘积（PoE）框架，用于图像和视频合成。该框架通过退火重要性采样，从异构模型（如生成模型、语言模型、物理模拟器）中进行知识组合，从而实现更好的可控性和灵活的用户界面。

Relevance: (9/10)

Novelty: (7/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yunzhi Zhang, Carson Murtuza-Lanier, Zizhang Li, Yilun Du, Jiajun Wu

How Much To Guide: Revisiting Adaptive Guidance in Classifier-Free Guidance Text-to-Vision Diffusion Models

With the rapid development of text-to-vision generation diffusion models, classifier-free guidance has emerged as the most prevalent method for conditioning. However, this approach inherently requires twice as many steps for model forwarding compared to unconditional generation, resulting in significantly higher costs. While previous study has introduced the concept of adaptive guidance, it lacks solid analysis and empirical results, making previous method unable to be applied to general diffusion models. In this work, we present another perspective of applying adaptive guidance and propose Step AG, which is a simple, universally applicable adaptive guidance strategy. Our evaluations focus on both image quality and image-text alignment. whose results indicate that restricting classifier-free guidance to the first several denoising steps is sufficient for generating high-quality, well-conditioned images, achieving an average speedup of 20% to 30%. Such improvement is consistent across different settings such as inference steps, and various models including video generation models, highlighting the superiority of our method.

TLDR: This paper introduces Step AG, an adaptive classifier-free guidance strategy for diffusion models that improves inference speed by 20-30% by restricting guidance to the initial denoising steps without sacrificing image quality or text alignment, applicable to both image and video generation.

TLDR: 该论文介绍了一种自适应的无分类器引导策略Step AG，用于扩散模型，通过将引导限制在初始去噪步骤中，在不牺牲图像质量或文本对齐的情况下，将推理速度提高20-30%，适用于图像和视频生成。

Relevance: (9/10)

Novelty: (7/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Huixuan Zhang, Junzhe Zhang, Xiaojun Wan

Highly Compressed Tokenizer Can Generate Without Training

Commonly used image tokenizers produce a 2D grid of spatially arranged tokens. In contrast, so-called 1D image tokenizers represent images as highly compressed one-dimensional sequences of as few as 32 discrete tokens. We find that the high degree of compression achieved by a 1D tokenizer with vector quantization enables image editing and generative capabilities through heuristic manipulation of tokens, demonstrating that even very crude manipulations -- such as copying and replacing tokens between latent representations of images -- enable fine-grained image editing by transferring appearance and semantic attributes. Motivated by the expressivity of the 1D tokenizer's latent space, we construct an image generation pipeline leveraging gradient-based test-time optimization of tokens with plug-and-play loss functions such as reconstruction or CLIP similarity. Our approach is demonstrated for inpainting and text-guided image editing use cases, and can generate diverse and realistic samples without requiring training of any generative model.

TLDR: This paper introduces a method for image editing and generation using a highly compressed 1D image tokenizer combined with test-time optimization, achieving results without training a generative model.

TLDR: 本文介绍了一种图像编辑和生成方法，该方法使用高度压缩的一维图像分词器，并结合测试时优化，无需训练生成模型即可实现效果。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: L. Lao Beyer, T. Li, X. Chen, S. Karaman, K. He

AIGC Daily Papers

Diffuse and Disperse: Image Generation with Representation Regularization

Product of Experts for Visual Generation

How Much To Guide: Revisiting Adaptive Guidance in Classifier-Free Guidance Text-to-Vision Diffusion Models

Highly Compressed Tokenizer Can Generate Without Training