ArXiv CS.CV Papers (Image/Video Generation)

Diffusion Transformers with Representation Autoencoders

Latent generative modeling, where a pretrained autoencoder maps pixels into a latent space for the diffusion process, has become the standard strategy for Diffusion Transformers (DiT); however, the autoencoder component has barely evolved. Most DiTs continue to rely on the original VAE encoder, which introduces several limitations: outdated backbones that compromise architectural simplicity, low-dimensional latent spaces that restrict information capacity, and weak representations that result from purely reconstruction-based training and ultimately limit generative quality. In this work, we explore replacing the VAE with pretrained representation encoders (e.g., DINO, SigLIP, MAE) paired with trained decoders, forming what we term Representation Autoencoders (RAEs). These models provide both high-quality reconstructions and semantically rich latent spaces, while allowing for a scalable transformer-based architecture. Since these latent spaces are typically high-dimensional, a key challenge is enabling diffusion transformers to operate effectively within them. We analyze the sources of this difficulty, propose theoretically motivated solutions, and validate them empirically. Our approach achieves faster convergence without auxiliary representation alignment losses. Using a DiT variant equipped with a lightweight, wide DDT head, we achieve strong image generation results on ImageNet: 1.51 FID at 256x256 (no guidance) and 1.13 at both 256x256 and 512x512 (with guidance). RAE offers clear advantages and should be the new default for diffusion transformer training.

TLDR: This paper introduces Representation Autoencoders (RAEs) for Diffusion Transformers, replacing traditional VAEs with pretrained representation encoders to improve image generation quality and achieve state-of-the-art results on ImageNet.

TLDR: 本文介绍了用于扩散Transformer的表征自编码器 (RAEs)，用预训练的表征编码器取代传统的VAE，以提高图像生成质量，并在ImageNet上实现了最先进的结果。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Boyang Zheng, Nanye Ma, Shengbang Tong, Saining Xie

InfiniHuman: Infinite 3D Human Creation with Precise Control

Generating realistic and controllable 3D human avatars is a long-standing challenge, particularly when covering broad attribute ranges such as ethnicity, age, clothing styles, and detailed body shapes. Capturing and annotating large-scale human datasets for training generative models is prohibitively expensive and limited in scale and diversity. The central question we address in this paper is: Can existing foundation models be distilled to generate theoretically unbounded, richly annotated 3D human data? We introduce InfiniHuman, a framework that synergistically distills these models to produce richly annotated human data at minimal cost and with theoretically unlimited scalability. We propose InfiniHumanData, a fully automatic pipeline that leverages vision-language and image generation models to create a large-scale multi-modal dataset. User study shows our automatically generated identities are undistinguishable from scan renderings. InfiniHumanData contains 111K identities spanning unprecedented diversity. Each identity is annotated with multi-granularity text descriptions, multi-view RGB images, detailed clothing images, and SMPL body-shape parameters. Building on this dataset, we propose InfiniHumanGen, a diffusion-based generative pipeline conditioned on text, body shape, and clothing assets. InfiniHumanGen enables fast, realistic, and precisely controllable avatar generation. Extensive experiments demonstrate significant improvements over state-of-the-art methods in visual quality, generation speed, and controllability. Our approach enables high-quality avatar generation with fine-grained control at effectively unbounded scale through a practical and affordable solution. We will publicly release the automatic data generation pipeline, the comprehensive InfiniHumanData dataset, and the InfiniHumanGen models at https://yuxuan-xue.com/infini-human.

TLDR: The paper introduces InfiniHuman, a framework for generating diverse and controllable 3D human avatars by distilling existing foundation models to create a large-scale, richly annotated dataset and a diffusion-based generative pipeline, achieving state-of-the-art results in quality, speed, and controllability.

TLDR: 该论文介绍了 InfiniHuman，它通过提炼现有的基础模型来生成多样且可控的 3D 人体化身，从而创建一个大规模、带有丰富注释的数据集和一个基于扩散的生成管道，并在质量、速度和可控性方面取得了最先进的结果。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Yuxuan Xue, Xianghui Xie, Margaret Kostyrko, Gerard Pons-Moll

Scaling Language-Centric Omnimodal Representation Learning

Recent multimodal embedding approaches leveraging multimodal large language models (MLLMs) fine-tuned with contrastive learning (CL) have shown promising results, yet the underlying reasons behind their superiority remain underexplored. This work argues that a crucial advantage of MLLM-based approaches stems from implicit cross-modal alignment achieved during generative pretraining, where the language decoder learns to exploit multimodal signals within a shared representation space for generating unimodal outputs. Through analysis of anisotropy and kernel similarity structure, we empirically confirm that latent alignment emerges within MLLM representations, allowing CL to serve as a lightweight refinement stage. Leveraging this insight, we propose a Language-Centric Omnimodal Embedding framework, termed LCO-Emb. Extensive experiments across diverse backbones and benchmarks demonstrate its effectiveness, achieving state-of-the-art performance across modalities. Furthermore, we identify a Generation-Representation Scaling Law (GRSL), showing that the representational capabilities gained through contrastive refinement scales positively with the MLLM's generative capabilities. This suggests that improving generative abilities evolves as an effective paradigm for enhancing representation quality. We provide a theoretical explanation of GRSL, which formally links the MLLM's generative quality to the upper bound on its representation performance, and validate it on a challenging, low-resource visual-document retrieval task, showing that continual generative pretraining before CL can further enhance the potential of a model's embedding capabilities. Codes, models, and resources are available at https://github.com/LCO-Embedding/LCO-Embedding.

TLDR: The paper introduces LCO-Emb, a language-centric omnimodal embedding framework, and demonstrates that generative pretraining in MLLMs leads to better cross-modal alignment and representation quality, formalized by a Generation-Representation Scaling Law (GRSL). Further generative pretraining before contrastive learning enhances embedding capabilities, as validated on a low-resource visual-document retrieval task.

TLDR: 该论文介绍了 LCO-Emb，一种以语言为中心的通用模态嵌入框架，并证明了 MLLM 中的生成式预训练可以带来更好的跨模态对齐和表征质量，并用生成-表征缩放定律 (GRSL) 进行了形式化。在对比学习之前进行进一步的生成式预训练可以增强嵌入能力，这已在一个低资源视觉-文档检索任务中得到验证。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Chenghao Xiao, Hou Pong Chan, Hao Zhang, Weiwen Xu, Mahani Aljunied, Yu Rong

Massive Activations are the Key to Local Detail Synthesis in Diffusion Transformers

Diffusion Transformers (DiTs) have recently emerged as a powerful backbone for visual generation. Recent observations reveal \emph{Massive Activations} (MAs) in their internal feature maps, yet their function remains poorly understood. In this work, we systematically investigate these activations to elucidate their role in visual generation. We found that these massive activations occur across all spatial tokens, and their distribution is modulated by the input timestep embeddings. Importantly, our investigations further demonstrate that these massive activations play a key role in local detail synthesis, while having minimal impact on the overall semantic content of output. Building on these insights, we propose \textbf{D}etail \textbf{G}uidance (\textbf{DG}), a MAs-driven, training-free self-guidance strategy to explicitly enhance local detail fidelity for DiTs. Specifically, DG constructs a degraded ``detail-deficient'' model by disrupting MAs and leverages it to guide the original network toward higher-quality detail synthesis. Our DG can seamlessly integrate with Classifier-Free Guidance (CFG), enabling further refinements of fine-grained details. Extensive experiments demonstrate that our DG consistently improves fine-grained detail quality across various pre-trained DiTs (\eg, SD3, SD3.5, and Flux).

TLDR: This paper investigates massive activations within Diffusion Transformers (DiTs), revealing their crucial role in synthesizing local details, and proposes a training-free guidance strategy, Detail Guidance (DG), to enhance fine-grained detail quality in generated images.

TLDR: 本文研究了扩散Transformer（DiT）中的大规模激活，揭示了它们在合成局部细节中的关键作用，并提出了一种无需训练的引导策略Detail Guidance (DG)，以提高生成图像中精细细节的质量。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Chaofan Gan, Zicheng Zhao, Yuanpeng Tu, Xi Chen, Ziran Qin, Tieyuan Chen, Mehrtash Harandi, Weiyao Lin

InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models

General SVG modeling remains challenging due to fragmented datasets, limited transferability of methods across tasks, and the difficulty of handling structural complexity. In response, we leverage the strong transfer and generalization capabilities of multimodal large language models (MLLMs) to achieve unified modeling for SVG understanding, editing, and generation. We present the InternSVG family, an integrated data-benchmark-model suite. At its core is SAgoge, the largest and most comprehensive multimodal dataset for SVG tasks, encompassing both static graphics and dynamic animations. It covers icons, long-sequence illustrations, scientific diagrams, and dynamic animations, supporting tasks of varied difficulty levels and providing deeper hierarchies with richer attributes compared to previous datasets. Based on this resource, we introduce SArena, a companion benchmark with comprehensive task definitions and standardized evaluation that aligns with the domains and difficulty spectrum covered by SAgoge. Building on these foundations, we propose InternSVG, a unified MLLM for SVG understanding, editing, and generation with SVG-specific special tokens, subword-based embedding initialization, and a two-stage training strategy that progresses from short static SVGs to long-sequence illustrations and complex animations. This unified formulation induces positive transfer and improves overall performance. Experiments on SArena and prior benchmark confirm that InternSVG achieves substantial gains and consistently outperforms leading open and proprietary counterparts.

TLDR: The paper introduces InternSVG, a multimodal large language model for unified SVG understanding, editing, and generation, along with a new dataset (SAgoge) and benchmark (SArena). Results show it outperforms existing methods.

TLDR: 该论文介绍了 InternSVG，一个用于统一 SVG 理解、编辑和生成的的多模态大型语言模型，以及一个新的数据集 (SAgoge) 和基准 (SArena)。结果表明它优于现有方法。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Haomin Wang, Jinhui Yin, Qi Wei, Wenguang Zeng, Lixin Gu, Shenglong Ye, Zhangwei Gao, Yaohui Wang, Yanting Zhang, Yuanqi Li, Yanwen Guo, Wenhai Wang, Kai Chen, Yu Qiao, Hongjie Zhang

Demystifying Numerosity in Diffusion Models -- Limitations and Remedies

Numerosity remains a challenge for state-of-the-art text-to-image generation models like FLUX and GPT-4o, which often fail to accurately follow counting instructions in text prompts. In this paper, we aim to study a fundamental yet often overlooked question: Can diffusion models inherently generate the correct number of objects specified by a textual prompt simply by scaling up the dataset and model size? To enable rigorous and reproducible evaluation, we construct a clean synthetic numerosity benchmark comprising two complementary datasets: GrayCount250 for controlled scaling studies, and NaturalCount6 featuring complex naturalistic scenes. Second, we empirically show that the scaling hypothesis does not hold: larger models and datasets alone fail to improve counting accuracy on our benchmark. Our analysis identifies a key reason: diffusion models tend to rely heavily on the noise initialization rather than the explicit numerosity specified in the prompt. We observe that noise priors exhibit biases toward specific object counts. In addition, we propose an effective strategy for controlling numerosity by injecting count-aware layout information into the noise prior. Our method achieves significant gains, improving accuracy on GrayCount250 from 20.0\% to 85.3\% and on NaturalCount6 from 74.8\% to 86.3\%, demonstrating effective generalization across settings.

TLDR: This paper investigates the limitations of diffusion models in accurately generating a specified number of objects and proposes a count-aware layout injection method to improve numerosity control, demonstrating significant gains in counting accuracy.

TLDR: 本文研究了扩散模型在准确生成指定数量对象方面的局限性，并提出了一种计数感知布局注入方法来提高数量控制，实验结果表明计数准确率显著提高。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yaqi Zhao, Xiaochen Wang, Li Dong, Wentao Zhang, Yuhui Yuan

MoMaps: Semantics-Aware Scene Motion Generation with Motion Maps

This paper addresses the challenge of learning semantically and functionally meaningful 3D motion priors from real-world videos, in order to enable prediction of future 3D scene motion from a single input image. We propose a novel pixel-aligned Motion Map (MoMap) representation for 3D scene motion, which can be generated from existing generative image models to facilitate efficient and effective motion prediction. To learn meaningful distributions over motion, we create a large-scale database of MoMaps from over 50,000 real videos and train a diffusion model on these representations. Our motion generation not only synthesizes trajectories in 3D but also suggests a new pipeline for 2D video synthesis: first generate a MoMap, then warp an image accordingly and complete the warped point-based renderings. Experimental results demonstrate that our approach generates plausible and semantically consistent 3D scene motion.

TLDR: The paper proposes a Motion Map (MoMap) representation and a diffusion model trained on a large dataset of MoMaps to generate semantically consistent 3D scene motion from a single image, offering a novel approach to 2D video synthesis.

TLDR: 该论文提出了一种运动地图 (MoMap) 表示，并训练了一个基于大规模 MoMap 数据集的扩散模型，以从单张图像生成语义一致的 3D 场景运动，为 2D 视频合成提供了一种新方法。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Jiahui Lei, Kyle Genova, George Kopanas, Noah Snavely, Leonidas Guibas

Zero-shot Face Editing via ID-Attribute Decoupled Inversion

Recent advancements in text-guided diffusion models have shown promise for general image editing via inversion techniques, but often struggle to maintain ID and structural consistency in real face editing tasks. To address this limitation, we propose a zero-shot face editing method based on ID-Attribute Decoupled Inversion. Specifically, we decompose the face representation into ID and attribute features, using them as joint conditions to guide both the inversion and the reverse diffusion processes. This allows independent control over ID and attributes, ensuring strong ID preservation and structural consistency while enabling precise facial attribute manipulation. Our method supports a wide range of complex multi-attribute face editing tasks using only text prompts, without requiring region-specific input, and operates at a speed comparable to DDIM inversion. Comprehensive experiments demonstrate its practicality and effectiveness.

TLDR: This paper introduces a zero-shot face editing method using ID-attribute decoupled inversion to improve ID preservation and structural consistency during text-guided face editing, operating at comparable speed to DDIM inversion.

TLDR: 本文提出了一种基于ID属性解耦反演的零样本面部编辑方法，通过改进ID保持和结构一致性来增强文本引导的面部编辑效果，并且运算速度与DDIM反演相当。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yang Hou, Minggu Wang, Jianjun Zhao

GIR-Bench: Versatile Benchmark for Generating Images with Reasoning

Unified multimodal models integrate the reasoning capacity of large language models with both image understanding and generation, showing great promise for advanced multimodal intelligence. However, the community still lacks a rigorous reasoning-centric benchmark to systematically evaluate the alignment between understanding and generation, and their generalization potential in complex visual tasks. To this end, we introduce \textbf{GIR-Bench}, a comprehensive benchmark that evaluates unified models across three complementary perspectives. Firstly, we investigate understanding-generation consistency (GIR-Bench-UGC), asking whether models can consistently leverage the same knowledge in both understanding and generation tasks. Secondly, we investigate whether models can perform reasoning-centric text-to-image generation that requires applying logical constraints and implicit knowledge to generate faithful visual content (GIR-Bench-T2I). Thirdly, we evaluate whether models can handle multi-step reasoning in editing (GIR-Bench-Edit). For each subset, we carefully design different task-specific evaluation pipelines tailored for each task. This enables fine-grained and interpretable evaluation while mitigating biases from the prevalent MLLM-as-a-Judge paradigm. Extensive ablations over various unified models and generation-only systems have shown that: Although unified models are more capable of reasoning-driven visual tasks, they still exhibit a persistent gap between understanding and generation. The data and code for GIR-Bench are available at \href{https://hkust-longgroup.github.io/GIR-Bench}{https://hkust-longgroup.github.io/GIR-Bench}.

TLDR: The paper introduces GIR-Bench, a new benchmark for evaluating the reasoning capabilities of unified multimodal models in image generation tasks, focusing on understanding-generation consistency, text-to-image generation with reasoning, and multi-step reasoning in editing.

TLDR: 该论文介绍了GIR-Bench，一个用于评估统一多模态模型在图像生成任务中推理能力的新基准，重点关注理解-生成一致性、具有推理的文本到图像生成以及编辑中的多步骤推理。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Hongxiang Li, Yaowei Li, Bin Lin, Yuwei Niu, Yuhang Yang, Xiaoshuang Huang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Long Chen

ContextGen: Contextual Layout Anchoring for Identity-Consistent Multi-Instance Generation

Multi-instance image generation (MIG) remains a significant challenge for modern diffusion models due to key limitations in achieving precise control over object layout and preserving the identity of multiple distinct subjects. To address these limitations, we introduce ContextGen, a novel Diffusion Transformer framework for multi-instance generation that is guided by both layout and reference images. Our approach integrates two key technical contributions: a Contextual Layout Anchoring (CLA) mechanism that incorporates the composite layout image into the generation context to robustly anchor the objects in their desired positions, and Identity Consistency Attention (ICA), an innovative attention mechanism that leverages contextual reference images to ensure the identity consistency of multiple instances. Recognizing the lack of large-scale, hierarchically-structured datasets for this task, we introduce IMIG-100K, the first dataset with detailed layout and identity annotations. Extensive experiments demonstrate that ContextGen sets a new state-of-the-art, outperforming existing methods in control precision, identity fidelity, and overall visual quality.

TLDR: The paper introduces ContextGen, a Diffusion Transformer framework for multi-instance image generation that uses Contextual Layout Anchoring (CLA) and Identity Consistency Attention (ICA) to improve control precision and identity fidelity. They also introduce a new dataset, IMIG-100K.

TLDR: 该论文介绍了 ContextGen，一个用于多实例图像生成的 Diffusion Transformer 框架，它使用上下文布局锚定 (CLA) 和身份一致性注意 (ICA) 来提高控制精度和身份保真度。他们还推出了一个名为 IMIG-100K 的新数据集。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Ruihang Xu, Dewei Zhou, Fan Ma, Yi Yang

IUT-Plug: A Plug-in tool for Interleaved Image-Text Generation

Existing vision language models (VLMs), including GPT-4 and DALL-E, often struggle to preserve logic, object identity, and style in multimodal image-text generation. This limitation significantly hinders the generalization capability of VLMs in complex image-text input-output scenarios. To address this issue, we propose IUT-Plug, a module grounded in an Image Understanding Tree (IUT), which enhances existing interleaved VLMs through explicit structured reasoning, thereby mitigating context drift in logic, entity identity, and style. The proposed framework operates in two stages. (1) A dynamic IUT-Plug extraction module parses visual scenes into hierarchical symbolic structures. (2) A coordinated narrative-flow and image synthesis mechanism ensures cross-modal consistency. To evaluate our approach, we construct a novel benchmark based on 3,000 real human-generated question-answer pairs over fine-tuned large models, introducing a dynamic evaluation protocol for quantifying context drift in interleaved VLMs. Experimental results demonstrate that IUT-Plug not only improves accuracy on established benchmarks but also effectively alleviates the three critical forms of context drift across diverse multimodal question answering (QA) scenarios.

TLDR: The paper introduces IUT-Plug, a module that leverages an Image Understanding Tree (IUT) to improve logic, object identity, and style consistency in interleaved image-text generation within VLMs. It also introduces a new benchmark and evaluation protocol.

TLDR: 该论文介绍了IUT-Plug，一个利用图像理解树（IUT）的模块，用于提高视觉语言模型（VLMs）中交错图像-文本生成的逻辑、对象身份和风格一致性。还引入了一个新的基准和评估协议。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Zeteng Lin, Xingxing Li, Wen You, Xiaoyang Li, Zehan Lu, Yujun Cai, Jing Tang

DreamMakeup: Face Makeup Customization using Latent Diffusion Models

The exponential growth of the global makeup market has paralleled advancements in virtual makeup simulation technology. Despite the progress led by GANs, their application still encounters significant challenges, including training instability and limited customization capabilities. Addressing these challenges, we introduce DreamMakup - a novel training-free Diffusion model based Makeup Customization method, leveraging the inherent advantages of diffusion models for superior controllability and precise real-image editing. DreamMakeup employs early-stopped DDIM inversion to preserve the facial structure and identity while enabling extensive customization through various conditioning inputs such as reference images, specific RGB colors, and textual descriptions. Our model demonstrates notable improvements over existing GAN-based and recent diffusion-based frameworks - improved customization, color-matching capabilities, identity preservation and compatibility with textual descriptions or LLMs with affordable computational costs.

TLDR: DreamMakeup introduces a training-free diffusion-based method for face makeup customization, offering improved controllability, precision, and compatibility with textual descriptions compared to GAN-based approaches.

TLDR: DreamMakeup 提出了一种无需训练的基于扩散模型的面部化妆定制方法，与基于 GAN 的方法相比，该方法具有更好的可控性、精度和与文本描述的兼容性。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Geon Yeong Park, Inhwa Han, Serin Yang, Yeobin Hong, Seongmin Jeong, Heechan Jeon, Myeongjin Goh, Sung Won Yi, Jin Nam, Jong Chul Ye

DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training

In this work, we propose DiT360, a DiT-based framework that performs hybrid training on perspective and panoramic data for panoramic image generation. For the issues of maintaining geometric fidelity and photorealism in generation quality, we attribute the main reason to the lack of large-scale, high-quality, real-world panoramic data, where such a data-centric view differs from prior methods that focus on model design. Basically, DiT360 has several key modules for inter-domain transformation and intra-domain augmentation, applied at both the pre-VAE image level and the post-VAE token level. At the image level, we incorporate cross-domain knowledge through perspective image guidance and panoramic refinement, which enhance perceptual quality while regularizing diversity and photorealism. At the token level, hybrid supervision is applied across multiple modules, which include circular padding for boundary continuity, yaw loss for rotational robustness, and cube loss for distortion awareness. Extensive experiments on text-to-panorama, inpainting, and outpainting tasks demonstrate that our method achieves better boundary consistency and image fidelity across eleven quantitative metrics. Our code is available at https://github.com/Insta360-Research-Team/DiT360.

TLDR: DiT360 is a DiT-based framework using hybrid training on perspective and panoramic data to generate high-fidelity panoramic images, addressing geometric fidelity and photorealism through inter-domain transformation and intra-domain augmentation.

TLDR: DiT360是一个基于DiT的框架，通过对透视和全景数据进行混合训练来生成高保真全景图像，通过域间转换和域内增强来解决几何保真度和照片真实感的问题。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Haoran Feng, Dizhe Zhang, Xiangtai Li, Bo Du, Lu Qi

Uncertainty-Aware ControlNet: Bridging Domain Gaps with Synthetic Image Generation

Generative Models are a valuable tool for the controlled creation of high-quality image data. Controlled diffusion models like the ControlNet have allowed the creation of labeled distributions. Such synthetic datasets can augment the original training distribution when discriminative models, like semantic segmentation, are trained. However, this augmentation effect is limited since ControlNets tend to reproduce the original training distribution. This work introduces a method to utilize data from unlabeled domains to train ControlNets by introducing the concept of uncertainty into the control mechanism. The uncertainty indicates that a given image was not part of the training distribution of a downstream task, e.g., segmentation. Thus, two types of control are engaged in the final network: an uncertainty control from an unlabeled dataset and a semantic control from the labeled dataset. The resulting ControlNet allows us to create annotated data with high uncertainty from the target domain, i.e., synthetic data from the unlabeled distribution with labels. In our scenario, we consider retinal OCTs, where typically high-quality Spectralis images are available with given ground truth segmentations, enabling the training of segmentation networks. The recent development in Home-OCT devices, however, yields retinal OCTs with lower quality and a large domain shift, such that out-of-the-pocket segmentation networks cannot be applied for this type of data. Synthesizing annotated images from the Home-OCT domain using the proposed approach closes this gap and leads to significantly improved segmentation results without adding any further supervision. The advantage of uncertainty-guidance becomes obvious when compared to style transfer: it enables arbitrary domain shifts without any strict learning of an image style. This is also demonstrated in a traffic scene experiment.

TLDR: This paper introduces Uncertainty-Aware ControlNet, a method that leverages uncertainty from unlabeled data domains to train ControlNets for synthetic data generation, improving performance on downstream tasks like segmentation in target domains where domain shift is a significant issue. They demonstrate this on retinal OCT images and traffic scenes.

TLDR: 本文介绍了一种名为“不确定性感知ControlNet”的方法，该方法利用来自未标记数据域的不确定性来训练ControlNet，以生成合成数据，从而提高目标领域（领域转移是主要问题）中分割等下游任务的性能。他们在视网膜OCT图像和交通场景中演示了这一点。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Joshua Niemeijer, Jan Ehrhardt, Heinz Handels, Hristina Uzunova

SceneTextStylizer: A Training-Free Scene Text Style Transfer Framework with Diffusion Model

With the rapid development of diffusion models, style transfer has made remarkable progress. However, flexible and localized style editing for scene text remains an unsolved challenge. Although existing scene text editing methods have achieved text region editing, they are typically limited to content replacement and simple styles, which lack the ability of free-style transfer. In this paper, we introduce SceneTextStylizer, a novel training-free diffusion-based framework for flexible and high-fidelity style transfer of text in scene images. Unlike prior approaches that either perform global style transfer or focus solely on textual content modification, our method enables prompt-guided style transformation specifically for text regions, while preserving both text readability and stylistic consistency. To achieve this, we design a feature injection module that leverages diffusion model inversion and self-attention to transfer style features effectively. Additionally, a region control mechanism is introduced by applying a distance-based changing mask at each denoising step, enabling precise spatial control. To further enhance visual quality, we incorporate a style enhancement module based on the Fourier transform to reinforce stylistic richness. Extensive experiments demonstrate that our method achieves superior performance in scene text style transformation, outperforming existing state-of-the-art methods in both visual fidelity and text preservation.

TLDR: The paper introduces SceneTextStylizer, a training-free diffusion-based framework for style transfer of text in scene images, enabling prompt-guided style transformation specifically for text regions with high fidelity and readability.

TLDR: 该论文介绍了SceneTextStylizer，一个无需训练的，基于扩散模型的场景文本风格迁移框架，能够根据提示对文本区域进行风格转换，实现高保真和可读性。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Honghui Yuan, Keiji Yanai

DISC-GAN: Disentangling Style and Content for Cluster-Specific Synthetic Underwater Image Generation

In this paper, we propose a novel framework, Disentangled Style-Content GAN (DISC-GAN), which integrates style-content disentanglement with a cluster-specific training strategy towards photorealistic underwater image synthesis. The quality of synthetic underwater images is challenged by optical due to phenomena such as color attenuation and turbidity. These phenomena are represented by distinct stylistic variations across different waterbodies, such as changes in tint and haze. While generative models are well-suited to capture complex patterns, they often lack the ability to model the non-uniform conditions of diverse underwater environments. To address these challenges, we employ K-means clustering to partition a dataset into style-specific domains. We use separate encoders to get latent spaces for style and content; we further integrate these latent representations via Adaptive Instance Normalization (AdaIN) and decode the result to produce the final synthetic image. The model is trained independently on each style cluster to preserve domain-specific characteristics. Our framework demonstrates state-of-the-art performance, obtaining a Structural Similarity Index (SSIM) of 0.9012, an average Peak Signal-to-Noise Ratio (PSNR) of 32.5118 dB, and a Frechet Inception Distance (FID) of 13.3728.

TLDR: The paper introduces DISC-GAN, a novel GAN-based framework that disentangles style and content with cluster-specific training to generate photorealistic underwater images, achieving state-of-the-art results in SSIM, PSNR, and FID.

TLDR: 该论文介绍了一种新的基于GAN的框架DISC-GAN，它通过解耦风格和内容以及特定于集群的训练来生成逼真的水下图像，并在SSIM、PSNR和FID方面取得了最先进的结果。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Sneha Varur, Anirudh R Hanchinamani, Tarun S Bagewadi, Uma Mudenagudi, Chaitra D Desai, Sujata C, Padmashree Desai, Sumit Meharwade

AIGC Daily Papers

Diffusion Transformers with Representation Autoencoders

InfiniHuman: Infinite 3D Human Creation with Precise Control

Scaling Language-Centric Omnimodal Representation Learning

Massive Activations are the Key to Local Detail Synthesis in Diffusion Transformers

InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models

Demystifying Numerosity in Diffusion Models -- Limitations and Remedies

MoMaps: Semantics-Aware Scene Motion Generation with Motion Maps

Zero-shot Face Editing via ID-Attribute Decoupled Inversion

GIR-Bench: Versatile Benchmark for Generating Images with Reasoning

ContextGen: Contextual Layout Anchoring for Identity-Consistent Multi-Instance Generation

IUT-Plug: A Plug-in tool for Interleaved Image-Text Generation

DreamMakeup: Face Makeup Customization using Latent Diffusion Models

DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training

Uncertainty-Aware ControlNet: Bridging Domain Gaps with Synthetic Image Generation

SceneTextStylizer: A Training-Free Scene Text Style Transfer Framework with Diffusion Model

DISC-GAN: Disentangling Style and Content for Cluster-Specific Synthetic Underwater Image Generation