AIGC Daily Papers

Daily papers related to Image/Video/Multimodal Generation from cs.CV

January 21, 2026

Dynamic Differential Linear Attention: Enhancing Linear Diffusion Transformer for High-Quality Image Generation

Diffusion transformers (DiTs) have emerged as a powerful architecture for high-fidelity image generation, yet the quadratic cost of self-attention poses a major scalability bottleneck. To address this, linear attention mechanisms have been adopted to reduce computational cost; unfortunately, the resulting linear diffusion transformers (LiTs) models often come at the expense of generative performance, frequently producing over-smoothed attention weights that limit expressiveness. In this work, we introduce Dynamic Differential Linear Attention (DyDiLA), a novel linear attention formulation that enhances the effectiveness of LiTs by mitigating the oversmoothing issue and improving generation quality. Specifically, the novelty of DyDiLA lies in three key designs: (i) dynamic projection module, which facilitates the decoupling of token representations by learning with dynamically assigned knowledge; (ii) dynamic measure kernel, which provides a better similarity measurement to capture fine-grained semantic distinctions between tokens by dynamically assigning kernel functions for token processing; and (iii) token differential operator, which enables more robust query-to-key retrieval by calculating the differences between the tokens and their corresponding information redundancy produced by dynamic measure kernel. To capitalize on DyDiLA, we introduce a refined LiT, termed DyDi-LiT, that systematically incorporates our advancements. Extensive experiments show that DyDi-LiT consistently outperforms current state-of-the-art (SOTA) models across multiple metrics, underscoring its strong practical potential.

TLDR: This paper introduces Dynamic Differential Linear Attention (DyDiLA), a novel linear attention mechanism for diffusion transformers (DiTs) that addresses the performance degradation seen in linear DiTs (LiTs) by improving attention weight expressiveness and image generation quality. Experiments show DyDi-LiT achieves state-of-the-art (SOTA) results.

TLDR: 本文介绍了一种新的动态差分线性注意力(DyDiLA)机制,用于扩散Transformer(DiTs),旨在解决线性DiTs(LiTs)中观察到的性能下降问题。DyDiLA通过提高注意力权重的表达能力和图像生成质量来实现这一目标。实验表明,DyDi-LiT取得了最先进(SOTA)的结果。

Relevance: (10/10)
Novelty: (9/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (9/10)
Read Paper (PDF)

Authors: Boyuan Cao, Xingbo Yao, Chenhui Wang, Jiaxin Ye, Yujie Wei, Hongming Shan

POCI-Diff: Position Objects Consistently and Interactively with 3D-Layout Guided Diffusion

We propose a diffusion-based approach for Text-to-Image (T2I) generation with consistent and interactive 3D layout control and editing. While prior methods improve spatial adherence using 2D cues or iterative copy-warp-paste strategies, they often distort object geometry and fail to preserve consistency across edits. To address these limitations, we introduce a framework for Positioning Objects Consistently and Interactively (POCI-Diff), a novel formulation for jointly enforcing 3D geometric constraints and instance-level semantic binding within a unified diffusion process. Our method enables explicit per-object semantic control by binding individual text descriptions to specific 3D bounding boxes through Blended Latent Diffusion, allowing one-shot synthesis of complex multi-object scenes. We further propose a warping-free generative editing pipeline that supports object insertion, removal, and transformation via regeneration rather than pixel deformation. To preserve object identity and consistency across edits, we condition the diffusion process on reference images using IP-Adapter, enabling coherent object appearance throughout interactive 3D editing while maintaining global scene coherence. Experimental results demonstrate that POCI-Diff produces high-quality images consistent with the specified 3D layouts and edits, outperforming state-of-the-art methods in both visual fidelity and layout adherence while eliminating warping-induced geometric artifacts.

TLDR: This paper introduces POCI-Diff, a diffusion-based approach for text-to-image generation that allows for consistent and interactive 3D layout control and editing of objects, addressing limitations of prior methods regarding geometric distortion and edit consistency.

TLDR: 该论文介绍了一种基于扩散模型的文本到图像生成方法POCI-Diff,该方法允许对物体进行一致且交互式的3D布局控制和编辑,解决了现有方法在几何扭曲和编辑一致性方面的局限性。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Andrea Rigo, Luca Stornaiuolo, Weijie Wang, Mauro Martino, Bruno Lepri, Nicu Sebe

Spherical Geometry Diffusion: Generating High-quality 3D Face Geometry via Sphere-anchored Representations

A fundamental challenge in text-to-3D face generation is achieving high-quality geometry. The core difficulty lies in the arbitrary and intricate distribution of vertices in 3D space, making it challenging for existing models to establish clean connectivity and resulting in suboptimal geometry. To address this, our core insight is to simplify the underlying geometric structure by constraining the distribution onto a simple and regular manifold, a topological sphere. Building on this, we first propose the Spherical Geometry Representation, a novel face representation that anchors geometric signals to uniform spherical coordinates. This guarantees a regular point distribution, from which the mesh connectivity can be robustly reconstructed. Critically, this canonical sphere can be seamlessly unwrapped into a 2D map, creating a perfect synergy with powerful 2D generative models. We then introduce Spherical Geometry Diffusion, a conditional diffusion framework built upon this 2D map. It enables diverse and controllable generation by jointly modeling geometry and texture, where the geometry explicitly conditions the texture synthesis process. Our method's effectiveness is demonstrated through its success in a wide range of tasks: text-to-3D generation, face reconstruction, and text-based 3D editing. Extensive experiments show that our approach substantially outperforms existing methods in geometric quality, textual fidelity, and inference efficiency.

TLDR: This paper introduces Spherical Geometry Diffusion, a method that uses a sphere-anchored representation to generate high-quality 3D face geometry from text, demonstrating advancements in geometry quality, textual fidelity, and inference efficiency compared to existing methods.

TLDR: 本文介绍了球面几何扩散(Spherical Geometry Diffusion),该方法使用球面锚定表示从文本生成高质量的3D人脸几何体,与现有方法相比,在几何质量、文本保真度和推理效率方面均有所提升。

Relevance: (7/10)
Novelty: (9/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Junyi Zhang, Yiming Wang, Yunhong Lu, Qichao Wang, Wenzhe Qian, Xiaoyin Xu, David Gu, Min Zhang