AIGC Daily Papers

Daily papers related to Image/Video/Multimodal Generation from cs.CV

July 08, 2025

DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer

We introduce DC-AR, a novel masked autoregressive (AR) text-to-image generation framework that delivers superior image generation quality with exceptional computational efficiency. Due to the tokenizers' limitations, prior masked AR models have lagged behind diffusion models in terms of quality or efficiency. We overcome this limitation by introducing DC-HT - a deep compression hybrid tokenizer for AR models that achieves a 32x spatial compression ratio while maintaining high reconstruction fidelity and cross-resolution generalization ability. Building upon DC-HT, we extend MaskGIT and create a new hybrid masked autoregressive image generation framework that first produces the structural elements through discrete tokens and then applies refinements via residual tokens. DC-AR achieves state-of-the-art results with a gFID of 5.49 on MJHQ-30K and an overall score of 0.69 on GenEval, while offering 1.5-7.9x higher throughput and 2.0-3.5x lower latency compared to prior leading diffusion and autoregressive models.

TLDR: The paper introduces DC-AR, a novel masked autoregressive text-to-image generation framework using a deep compression hybrid tokenizer (DC-HT) that achieves state-of-the-art results with improved computational efficiency compared to prior diffusion and autoregressive models.

TLDR: 该论文介绍了DC-AR,一种新型的基于深度压缩混合令牌器(DC-HT)的掩码自回归文本到图像生成框架,与之前的扩散模型和自回归模型相比,它以更高的计算效率实现了最先进的结果。

Relevance: (9/10)
Novelty: (9/10)
Clarity: (8/10)
Potential Impact: (8/10)
Overall: (9/10)
Read Paper (PDF)

Authors: Yecheng Wu, Junyu Chen, Zhuoyang Zhang, Enze Xie, Jincheng Yu, Junsong Chen, Jinyi Hu, Yao Lu, Song Han, Han Cai

QR-LoRA: Efficient and Disentangled Fine-tuning via QR Decomposition for Customized Generation

Existing text-to-image models often rely on parameter fine-tuning techniques such as Low-Rank Adaptation (LoRA) to customize visual attributes. However, when combining multiple LoRA models for content-style fusion tasks, unstructured modifications of weight matrices often lead to undesired feature entanglement between content and style attributes. We propose QR-LoRA, a novel fine-tuning framework leveraging QR decomposition for structured parameter updates that effectively separate visual attributes. Our key insight is that the orthogonal Q matrix naturally minimizes interference between different visual features, while the upper triangular R matrix efficiently encodes attribute-specific transformations. Our approach fixes both Q and R matrices while only training an additional task-specific $\Delta R$ matrix. This structured design reduces trainable parameters to half of conventional LoRA methods and supports effective merging of multiple adaptations without cross-contamination due to the strong disentanglement properties between $\Delta R$ matrices. Experiments demonstrate that QR-LoRA achieves superior disentanglement in content-style fusion tasks, establishing a new paradigm for parameter-efficient, disentangled fine-tuning in generative models.

TLDR: This paper introduces QR-LoRA, a parameter-efficient fine-tuning method for text-to-image models that uses QR decomposition to disentangle content and style attributes, enabling better content-style fusion.

TLDR: 该论文介绍了 QR-LoRA,一种用于文本到图像模型的参数高效微调方法,它使用 QR 分解来解耦内容和样式属性,从而实现更好的内容-样式融合。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Jiahui Yang, Yongjia Ma, Donglin Di, Hao Li, Wei Chen, Yan Xie, Jianxun Cui, Xun Yang, Wangmeng Zuo