ArXiv CS.CV Papers (Image/Video Generation)

Unified Latents (UL): How to train your latents

We present Unified Latents (UL), a framework for learning latent representations that are jointly regularized by a diffusion prior and decoded by a diffusion model. By linking the encoder's output noise to the prior's minimum noise level, we obtain a simple training objective that provides a tight upper bound on the latent bitrate. On ImageNet-512, our approach achieves competitive FID of 1.4, with high reconstruction quality (PSNR) while requiring fewer training FLOPs than models trained on Stable Diffusion latents. On Kinetics-600, we set a new state-of-the-art FVD of 1.3.

TLDR: The paper introduces Unified Latents (UL), a method for learning jointly regularized latent representations using diffusion priors and models. It achieves competitive image generation results on ImageNet-512 and sets a new state-of-the-art FVD on Kinetics-600 for video generation.

TLDR: 该论文介绍了统一潜在变量 (UL)，一种使用扩散先验和模型学习联合正则化潜在表示的方法。它在 ImageNet-512 上取得了有竞争力的图像生成结果，并在 Kinetics-600 上为视频生成设定了新的最先进的 FVD。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Jonathan Heek, Emiel Hoogeboom, Thomas Mensink, Tim Salimans

GASS: Geometry-Aware Spherical Sampling for Disentangled Diversity Enhancement in Text-to-Image Generation

Despite high semantic alignment, modern text-to-image (T2I) generative models still struggle to synthesize diverse images from a given prompt. This lack of diversity not only restricts user choice, but also risks amplifying societal biases. In this work, we enhance the T2I diversity through a geometric lens. Unlike most existing methods that rely primarily on entropy-based guidance to increase sample dissimilarity, we introduce Geometry-Aware Spherical Sampling (GASS) to enhance diversity by explicitly controlling both prompt-dependent and prompt-independent sources of variation. Specifically, we decompose the diversity measure in CLIP embeddings using two orthogonal directions: the text embedding, which captures semantic variation related to the prompt, and an identified orthogonal direction that captures prompt-independent variation (e.g., backgrounds). Based on this decomposition, GASS increases the geometric projection spread of generated image embeddings along both axes and guides the T2I sampling process via expanded predictions along the generation trajectory. Our experiments on different frozen T2I backbones (U-Net and DiT, diffusion and flow) and benchmarks demonstrate the effectiveness of disentangled diversity enhancement with minimal impact on image fidelity and semantic alignment.

TLDR: The paper introduces Geometry-Aware Spherical Sampling (GASS) to enhance diversity in text-to-image generation by explicitly controlling prompt-dependent and prompt-independent variations within CLIP embeddings, demonstrating effectiveness across various models.

TLDR: 该论文介绍了几何感知球形采样（GASS），通过显式控制CLIP嵌入中提示相关的和提示独立的变异，来增强文本到图像生成的多样性，并在各种模型上证明了其有效性。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Ye Zhu, Kaleb S. Newman, Johannes F. Lutzeyer, Adriana Romero-Soriano, Michal Drozdzal, Olga Russakovsky

Amber-Image: Efficient Compression of Large-Scale Diffusion Transformers

Diffusion Transformer (DiT) architectures have significantly advanced Text-to-Image (T2I) generation but suffer from prohibitive computational costs and deployment barriers. To address these challenges, we propose an efficient compression framework that transforms the 60-layer dual-stream MMDiT-based Qwen-Image into lightweight models without training from scratch. Leveraging this framework, we introduce Amber-Image, a series of streamlined T2I models. We first derive Amber-Image-10B using a timestep-sensitive depth pruning strategy, where retained layers are reinitialized via local weight averaging and optimized through layer-wise distillation and full-parameter fine-tuning. Building on this, we develop Amber-Image-6B by introducing a hybrid-stream architecture that converts deep-layer dual streams into a single stream initialized from the image branch, further refined via progressive distillation and lightweight fine-tuning. Our approach reduces parameters by 70% and eliminates the need for large-scale data engineering. Notably, the entire compression and training pipeline-from the 10B to the 6B variant-requires fewer than 2,000 GPU hours, demonstrating exceptional cost-efficiency compared to training from scratch. Extensive evaluations on benchmarks like DPG-Bench and LongText-Bench show that Amber-Image achieves high-fidelity synthesis and superior text rendering, matching much larger models.

TLDR: The paper introduces Amber-Image, a framework for efficiently compressing large-scale Diffusion Transformer models, significantly reducing computational costs and deployment barriers while maintaining high-fidelity image generation and superior text rendering.

TLDR: 该论文介绍了Amber-Image，一个用于高效压缩大规模扩散Transformer模型的框架，在保持高保真图像生成和卓越文本渲染的同时，显著降低了计算成本和部署障碍。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Chaojie Yang, Tian Li, Yue Zhang, Jun Gao

DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers

Diffusion Transformers (DiTs) have achieved state-of-the-art performance in image and video generation, but their success comes at the cost of heavy computation. This inefficiency is largely due to the fixed tokenization process, which uses constant-sized patches throughout the entire denoising phase, regardless of the content's complexity. We propose dynamic tokenization, an efficient test-time strategy that varies patch sizes based on content complexity and the denoising timestep. Our key insight is that early timesteps only require coarser patches to model global structure, while later iterations demand finer (smaller-sized) patches to refine local details. During inference, our method dynamically reallocates patch sizes across denoising steps for image and video generation and substantially reduces cost while preserving perceptual generation quality. Extensive experiments demonstrate the effectiveness of our approach: it achieves up to $3.52\times$ and $3.2\times$ speedup on FLUX-1.Dev and Wan $2.1$, respectively, without compromising the generation quality and prompt adherence.

TLDR: The paper proposes a dynamic patch scheduling method (DDiT) for Diffusion Transformers that adapts patch sizes based on content complexity and denoising timestep, leading to significant speedups in image and video generation without sacrificing quality.

TLDR: 该论文提出了一种用于扩散Transformer的动态patch调度方法(DDiT)，该方法基于内容复杂度和去噪时间步长调整patch大小，从而在不牺牲质量的前提下显著加速图像和视频生成。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Dahye Kim, Deepti Ghadiyaram, Raghudeep Gadde

Unpaired Image-to-Image Translation via a Self-Supervised Semantic Bridge

Adversarial diffusion and diffusion-inversion methods have advanced unpaired image-to-image translation, but each faces key limitations. Adversarial approaches require target-domain adversarial loss during training, which can limit generalization to unseen data, while diffusion-inversion methods often produce low-fidelity translations due to imperfect inversion into noise-latent representations. In this work, we propose the Self-Supervised Semantic Bridge (SSB), a versatile framework that integrates external semantic priors into diffusion bridge models to enable spatially faithful translation without cross-domain supervision. Our key idea is to leverage self-supervised visual encoders to learn representations that are invariant to appearance changes but capture geometric structure, forming a shared latent space that conditions the diffusion bridges. Extensive experiments show that SSB outperforms strong prior methods for challenging medical image synthesis in both in-domain and out-of-domain settings, and extends easily to high-quality text-guided editing.

TLDR: The paper introduces Self-Supervised Semantic Bridge (SSB), a novel framework for unpaired image-to-image translation that leverages self-supervised visual encoders and diffusion bridge models to achieve spatially faithful and high-fidelity translations, particularly in medical image synthesis and text-guided editing.

TLDR: 本文提出了自监督语义桥 (SSB)，这是一种新颖的非配对图像到图像转换框架，它利用自监督视觉编码器和扩散桥模型来实现空间保真和高保真转换，尤其是在医学图像合成和文本引导编辑方面。

Relevance: (8/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Jiaming Liu, Felix Petersen, Yunhe Gao, Yabin Zhang, Hyojin Kim, Akshay S. Chaudhari, Yu Sun, Stefano Ermon, Sergios Gatidis

RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward

Recent advances in multimodal large language models (MLLMs) have shown great potential for extending vision-language reasoning to professional tool-based image editing, enabling intuitive and creative editing. A promising direction is to use reinforcement learning (RL) to enable MLLMs to reason about and execute optimal tool-use plans within professional image-editing software. However, training remains challenging due to the lack of reliable, verifiable reward signals that can reflect the inherently subjective nature of creative editing. In this work, we introduce RetouchIQ, a framework that performs instruction-based executable image editing through MLLM agents guided by a generalist reward model. RetouchIQ interprets user-specified editing intentions and generates corresponding, executable image adjustments, bridging high-level aesthetic goals with precise parameter control. To move beyond conventional, rule-based rewards that compute similarity against a fixed reference image using handcrafted metrics, we propose a generalist reward model, an RL fine-tuned MLLM that evaluates edited results through a set of generated metrics on a case-by-case basis. Then, the reward model provides scalar feedback through multimodal reasoning, enabling reinforcement learning with high-quality, instruction-consistent gradients. We curate an extended dataset with 190k instruction-reasoning pairs and establish a new benchmark for instruction-based image editing. Experiments show that RetouchIQ substantially improves both semantic consistency and perceptual quality over previous MLLM-based and diffusion-based editing systems. Our findings demonstrate the potential of generalist reward-driven MLLM agents as flexible, explainable, and executable assistants for professional image editing.

TLDR: The paper introduces RetouchIQ, a framework using MLLM agents and reinforcement learning with a generalist reward model for instruction-based image retouching, demonstrating improved semantic consistency and perceptual quality over existing methods.

TLDR: 该论文介绍了 RetouchIQ，一个使用 MLLM 代理和强化学习框架，通过通用奖励模型进行基于指令的图像修饰。实验表明，该方法在语义一致性和感知质量方面优于现有方法。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Qiucheng Wu, Jing Shi, Simon Jenni, Kushal Kafle, Tianyu Wang, Shiyu Chang, Handong Zhao

PartRAG: Retrieval-Augmented Part-Level 3D Generation and Editing

Single-image 3D generation with part-level structure remains challenging: learned priors struggle to cover the long tail of part geometries and maintain multi-view consistency, and existing systems provide limited support for precise, localized edits. We present PartRAG, a retrieval-augmented framework that integrates an external part database with a diffusion transformer to couple generation with an editable representation. To overcome the first challenge, we introduce a Hierarchical Contrastive Retrieval module that aligns dense image patches with 3D part latents at both part and object granularity, retrieving from a curated bank of 1,236 part-annotated assets to inject diverse, physically plausible exemplars into denoising. To overcome the second challenge, we add a masked, part-level editor that operates in a shared canonical space, enabling swaps, attribute refinements, and compositional updates without regenerating the whole object while preserving non-target parts and multi-view consistency. PartRAG achieves competitive results on Objaverse, ShapeNet, and ABO-reducing Chamfer Distance from 0.1726 to 0.1528 and raising F-Score from 0.7472 to 0.844 on Objaverse-with inference of 38s and interactive edits in 5-8s. Qualitatively, PartRAG produces sharper part boundaries, better thin-structure fidelity, and robust behavior on articulated objects. Code: https://github.com/AIGeeksGroup/PartRAG. Website: https://aigeeksgroup.github.io/PartRAG.

TLDR: PartRAG improves single-image 3D generation and editing by using a retrieval-augmented framework that combines an external part database with a diffusion transformer, allowing for part-level editing and better handling of part geometries.

TLDR: PartRAG通过使用检索增强框架改进了单图像3D生成和编辑，该框架将外部部件数据库与扩散转换器相结合，从而可以进行部件级编辑并更好地处理部件几何形状。

Relevance: (6/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Peize Li, Zeyu Zhang, Hao Tang

AIGC Daily Papers

Unified Latents (UL): How to train your latents

GASS: Geometry-Aware Spherical Sampling for Disentangled Diversity Enhancement in Text-to-Image Generation

Amber-Image: Efficient Compression of Large-Scale Diffusion Transformers

DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers

Unpaired Image-to-Image Translation via a Self-Supervised Semantic Bridge

RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward

PartRAG: Retrieval-Augmented Part-Level 3D Generation and Editing