ArXiv CS.CV Papers (Image/Video Generation)

Context-Aware Autoregressive Models for Multi-Conditional Image Generation

Autoregressive transformers have recently shown impressive image generation quality and efficiency on par with state-of-the-art diffusion models. Unlike diffusion architectures, autoregressive models can naturally incorporate arbitrary modalities into a single, unified token sequence--offering a concise solution for multi-conditional image generation tasks. In this work, we propose $\textbf{ContextAR}$, a flexible and effective framework for multi-conditional image generation. ContextAR embeds diverse conditions (e.g., canny edges, depth maps, poses) directly into the token sequence, preserving modality-specific semantics. To maintain spatial alignment while enhancing discrimination among different condition types, we introduce hybrid positional encodings that fuse Rotary Position Embedding with Learnable Positional Embedding. We design Conditional Context-aware Attention to reduces computational complexity while preserving effective intra-condition perception. Without any fine-tuning, ContextAR supports arbitrary combinations of conditions during inference time. Experimental results demonstrate the powerful controllability and versatility of our approach, and show that the competitive perpormance than diffusion-based multi-conditional control approaches the existing autoregressive baseline across diverse multi-condition driven scenarios. Project page: $\href{https://context-ar.github.io/}{https://context-ar.github.io/.}$

TLDR: The paper introduces ContextAR, a flexible autoregressive framework for multi-conditional image generation, which uses hybrid positional encodings and conditional context-aware attention to effectively integrate diverse conditions into the image generation process.

TLDR: 本文介绍了一种灵活的自回归框架ContextAR，用于多条件图像生成，它采用混合位置编码和条件上下文感知注意力，有效地将多种条件整合到图像生成过程中。

Relevance: (10/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Yixiao Chen, Zhiyuan Ma, Guoli Jia, Che Jiang, Jianjun Li, Bowen Zhou

Video-GPT via Next Clip Diffusion

GPT has shown its remarkable success in natural language processing. However, the language sequence is not sufficient to describe spatial-temporal details in the visual world. Alternatively, the video sequence is good at capturing such details. Motivated by this fact, we propose a concise Video-GPT in this paper by treating video as new language for visual world modeling. By analogy to next token prediction in GPT, we introduce a novel next clip diffusion paradigm for pretraining Video-GPT. Different from the previous works, this distinct paradigm allows Video-GPT to tackle both short-term generation and long-term prediction, by autoregressively denoising the noisy clip according to the clean clips in the history. Extensive experiments show our Video-GPT achieves the state-of-the-art performance on video prediction, which is the key factor towards world modeling (Physics-IQ Benchmark: Video-GPT 34.97 vs. Kling 23.64 vs. Wan 20.89). Moreover, it can be well adapted on 6 mainstream video tasks in both video generation and understanding, showing its great generalization capacity in downstream. The project page is at https://Video-GPT.github.io.

TLDR: The paper introduces Video-GPT, a novel approach for video understanding and generation based on a next clip diffusion paradigm, achieving state-of-the-art results on video prediction and demonstrating strong generalization across various downstream tasks.

TLDR: 该论文介绍了 Video-GPT，这是一种基于下一个片段扩散范例的视频理解和生成的新方法，在视频预测方面取得了最先进的结果，并在各种下游任务中展示了强大的泛化能力。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Shaobin Zhuang, Zhipeng Huang, Ying Zhang, Fangyikang Wang, Canmiao Fu, Binxin Yang, Chong Sun, Chen Li, Yali Wang

Guiding Diffusion with Deep Geometric Moments: Balancing Fidelity and Variation

Text-to-image generation models have achieved remarkable capabilities in synthesizing images, but often struggle to provide fine-grained control over the output. Existing guidance approaches, such as segmentation maps and depth maps, introduce spatial rigidity that restricts the inherent diversity of diffusion models. In this work, we introduce Deep Geometric Moments (DGM) as a novel form of guidance that encapsulates the subject's visual features and nuances through a learned geometric prior. DGMs focus specifically on the subject itself compared to DINO or CLIP features, which suffer from overemphasis on global image features or semantics. Unlike ResNets, which are sensitive to pixel-wise perturbations, DGMs rely on robust geometric moments. Our experiments demonstrate that DGM effectively balance control and diversity in diffusion-based image generation, allowing a flexible control mechanism for steering the diffusion process.

TLDR: This paper introduces Deep Geometric Moments (DGM), a novel guidance method for text-to-image diffusion models that balances control and diversity by leveraging learned geometric priors of the subject.

TLDR: 本文介绍了一种名为深度几何矩 (DGM) 的新型文本到图像扩散模型引导方法，它通过利用学习到的对象的几何先验来平衡控制和多样性。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Sangmin Jung, Utkarsh Nath, Yezhou Yang, Giulia Pedrielli, Joydeep Biswas, Amy Zhang, Hassan Ghasemzadeh, Pavan Turaga

NOFT: Test-Time Noise Finetune via Information Bottleneck for Highly Correlated Asset Creation

The diffusion model has provided a strong tool for implementing text-to-image (T2I) and image-to-image (I2I) generation. Recently, topology and texture control are popular explorations, e.g., ControlNet, IP-Adapter, Ctrl-X, and DSG. These methods explicitly consider high-fidelity controllable editing based on external signals or diffusion feature manipulations. As for diversity, they directly choose different noise latents. However, the diffused noise is capable of implicitly representing the topological and textural manifold of the corresponding image. Moreover, it's an effective workbench to conduct the trade-off between content preservation and controllable variations. Previous T2I and I2I diffusion works do not explore the information within the compressed contextual latent. In this paper, we first propose a plug-and-play noise finetune NOFT module employed by Stable Diffusion to generate highly correlated and diverse images. We fine-tune seed noise or inverse noise through an optimal-transported (OT) information bottleneck (IB) with around only 14K trainable parameters and 10 minutes of training. Our test-time NOFT is good at producing high-fidelity image variations considering topology and texture alignments. Comprehensive experiments demonstrate that NOFT is a powerful general reimagine approach to efficiently fine-tune the 2D/3D AIGC assets with text or image guidance.

TLDR: This paper introduces NOFT, a plug-and-play module for Stable Diffusion that fine-tunes seed noise using an information bottleneck to generate diverse and highly correlated images with topology and texture alignment within 10 minutes using 14K parameters.

TLDR: 该论文介绍了NOFT，一个即插即用的Stable Diffusion模块，它通过信息瓶颈微调种子噪声，从而快速生成具有拓扑和纹理对齐的，多样化且高度相关的图像，且仅需使用1.4万参数在10分钟内完成训练。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (8/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Jia Li, Nan Gao, Huaibo Huang, Ran He

AIGC Daily Papers

Context-Aware Autoregressive Models for Multi-Conditional Image Generation

Video-GPT via Next Clip Diffusion

Guiding Diffusion with Deep Geometric Moments: Balancing Fidelity and Variation

NOFT: Test-Time Noise Finetune via Information Bottleneck for Highly Correlated Asset Creation