AIGC Daily Papers

Daily papers related to Image/Video/Multimodal Generation from cs.CV

September 23, 2025

ComposeMe: Attribute-Specific Image Prompts for Controllable Human Image Generation

Generating high-fidelity images of humans with fine-grained control over attributes such as hairstyle and clothing remains a core challenge in personalized text-to-image synthesis. While prior methods emphasize identity preservation from a reference image, they lack modularity and fail to provide disentangled control over specific visual attributes. We introduce a new paradigm for attribute-specific image prompting, in which distinct sets of reference images are used to guide the generation of individual aspects of human appearance, such as hair, clothing, and identity. Our method encodes these inputs into attribute-specific tokens, which are injected into a pre-trained text-to-image diffusion model. This enables compositional and disentangled control over multiple visual factors, even across multiple people within a single image. To promote natural composition and robust disentanglement, we curate a cross-reference training dataset featuring subjects in diverse poses and expressions, and propose a multi-attribute cross-reference training strategy that encourages the model to generate faithful outputs from misaligned attribute inputs while adhering to both identity and textual conditioning. Extensive experiments show that our method achieves state-of-the-art performance in accurately following both visual and textual prompts. Our framework paves the way for more configurable human image synthesis by combining visual prompting with text-driven generation. Webpage is available at: https://snap-research.github.io/composeme/.

TLDR: The paper introduces a novel attribute-specific image prompting method for controllable human image generation, utilizing distinct reference images for individual attributes like hair and clothing within a text-to-image diffusion model.

TLDR: 该论文介绍了一种新的属性特定图像提示方法,用于可控的人类图像生成,利用不同的参考图像来控制头发和服装等各个属性,并将其应用到文本到图像的扩散模型中。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Guocheng Gordon Qian, Daniil Ostashev, Egor Nemchinov, Avihay Assouline, Sergey Tulyakov, Kuan-Chieh Jackson Wang, Kfir Aberman

Semantic and Visual Crop-Guided Diffusion Models for Heterogeneous Tissue Synthesis in Histopathology

Synthetic data generation in histopathology faces unique challenges: preserving tissue heterogeneity, capturing subtle morphological features, and scaling to unannotated datasets. We present a latent diffusion model that generates realistic heterogeneous histopathology images through a novel dual-conditioning approach combining semantic segmentation maps with tissue-specific visual crops. Unlike existing methods that rely on text prompts or abstract visual embeddings, our approach preserves critical morphological details by directly incorporating raw tissue crops from corresponding semantic regions. For annotated datasets (i.e., Camelyon16, Panda), we extract patches ensuring 20-80% tissue heterogeneity. For unannotated data (i.e., TCGA), we introduce a self-supervised extension that clusters whole-slide images into 100 tissue types using foundation model embeddings, automatically generating pseudo-semantic maps for training. Our method synthesizes high-fidelity images with precise region-wise annotations, achieving superior performance on downstream segmentation tasks. When evaluated on annotated datasets, models trained on our synthetic data show competitive performance to those trained on real data, demonstrating the utility of controlled heterogeneous tissue generation. In quantitative evaluation, prompt-guided synthesis reduces Frechet Distance by up to 6X on Camelyon16 (from 430.1 to 72.0) and yields 2-3x lower FD across Panda and TCGA. Downstream DeepLabv3+ models trained solely on synthetic data attain test IoU of 0.71 and 0.95 on Camelyon16 and Panda, within 1-2% of real-data baselines (0.72 and 0.96). By scaling to 11,765 TCGA whole-slide images without manual annotations, our framework offers a practical solution for an urgent need for generating diverse, annotated histopathology data, addressing a critical bottleneck in computational pathology.

TLDR: This paper presents a novel diffusion model for generating realistic and diverse histopathology images by conditioning on semantic segmentation maps and tissue-specific visual crops, achieving significant improvements in downstream segmentation tasks and reducing the need for manual annotations.

TLDR: 本文提出了一种新颖的扩散模型,通过以语义分割图和组织特异性视觉裁剪为条件,生成逼真且多样化的组织病理学图像,从而在下游分割任务中实现显着改进,并减少了手动注释的需求。

Relevance: (7/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (9/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Saghir Alfasly, Wataru Uegami, MD Enamul Hoq, Ghazal Alabtah, H. R. Tizhoosh

ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment

Training-free video object editing aims to achieve precise object-level manipulation, including object insertion, swapping, and deletion. However, it faces significant challenges in maintaining fidelity and temporal consistency. Existing methods, often designed for U-Net architectures, suffer from two primary limitations: inaccurate inversion due to first-order solvers, and contextual conflicts caused by crude "hard" feature replacement. These issues are more challenging in Diffusion Transformers (DiTs), where the unsuitability of prior layer-selection heuristics makes effective guidance challenging. To address these limitations, we introduce ContextFlow, a novel training-free framework for DiT-based video object editing. In detail, we first employ a high-order Rectified Flow solver to establish a robust editing foundation. The core of our framework is Adaptive Context Enrichment (for specifying what to edit), a mechanism that addresses contextual conflicts. Instead of replacing features, it enriches the self-attention context by concatenating Key-Value pairs from parallel reconstruction and editing paths, empowering the model to dynamically fuse information. Additionally, to determine where to apply this enrichment (for specifying where to edit), we propose a systematic, data-driven analysis to identify task-specific vital layers. Based on a novel Guidance Responsiveness Metric, our method pinpoints the most influential DiT blocks for different tasks (e.g., insertion, swapping), enabling targeted and highly effective guidance. Extensive experiments show that ContextFlow significantly outperforms existing training-free methods and even surpasses several state-of-the-art training-based approaches, delivering temporally coherent, high-fidelity results.

TLDR: The paper introduces ContextFlow, a training-free framework for video object editing using Diffusion Transformers, which leverages a high-order solver and adaptive context enrichment to significantly improve fidelity and temporal consistency compared to existing methods.

TLDR: 该论文介绍了ContextFlow,一个基于扩散Transformer的无训练视频对象编辑框架,它利用高阶求解器和自适应上下文丰富来显著提高保真度和时间一致性,相比于现有的方法。

Relevance: (8/10)
Novelty: (9/10)
Clarity: (8/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Yiyang Chen, Xuanhua He, Xiujun Ma, Yue Ma

Qwen3-Omni Technical Report

We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.

TLDR: Qwen3-Omni is a novel multimodal model achieving state-of-the-art performance across text, image, audio, and video, with notable success in audio tasks and real-time speech synthesis, and public availability under Apache 2.0.

TLDR: Qwen3-Omni 是一款新型多模态模型,在文本、图像、音频和视频方面均实现了最先进的性能。它在音频任务和实时语音合成方面表现出色,并以 Apache 2.0 许可公开提供。

Relevance: (8/10)
Novelty: (9/10)
Clarity: (8/10)
Potential Impact: (9/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo Zheng, Rui Men, Fan Zhou, Bowen Yu, Jianxin Yang, Le Yu, Jingren Zhou, Junyang Lin

Multi-Agent Amodal Completion: Direct Synthesis with Fine-Grained Semantic Guidance

Amodal completion, generating invisible parts of occluded objects, is vital for applications like image editing and AR. Prior methods face challenges with data needs, generalization, or error accumulation in progressive pipelines. We propose a Collaborative Multi-Agent Reasoning Framework based on upfront collaborative reasoning to overcome these issues. Our framework uses multiple agents to collaboratively analyze occlusion relationships and determine necessary boundary expansion, yielding a precise mask for inpainting. Concurrently, an agent generates fine-grained textual descriptions, enabling Fine-Grained Semantic Guidance. This ensures accurate object synthesis and prevents the regeneration of occluders or other unwanted elements, especially within large inpainting areas. Furthermore, our method directly produces layered RGBA outputs guided by visible masks and attention maps from a Diffusion Transformer, eliminating extra segmentation. Extensive evaluations demonstrate our framework achieves state-of-the-art visual quality.

TLDR: This paper introduces a collaborative multi-agent framework for amodal completion that leverages fine-grained semantic guidance and a Diffusion Transformer to directly generate layered RGBA outputs, achieving state-of-the-art visual quality.

TLDR: 本文提出了一种用于非模态补全的协作多代理框架,该框架利用细粒度语义引导和扩散变换器直接生成分层RGBA输出,从而实现最先进的视觉质量。

Relevance: (7/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Hongxing Fan, Lipeng Wang, Haohua Chen, Zehuan Huang, Jiangtao Wu, Lu Sheng

OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models

Recent advances in video insertion based on diffusion models are impressive. However, existing methods rely on complex control signals but struggle with subject consistency, limiting their practical applicability. In this paper, we focus on the task of Mask-free Video Insertion and aim to resolve three key challenges: data scarcity, subject-scene equilibrium, and insertion harmonization. To address the data scarcity, we propose a new data pipeline InsertPipe, constructing diverse cross-pair data automatically. Building upon our data pipeline, we develop OmniInsert, a novel unified framework for mask-free video insertion from both single and multiple subject references. Specifically, to maintain subject-scene equilibrium, we introduce a simple yet effective Condition-Specific Feature Injection mechanism to distinctly inject multi-source conditions and propose a novel Progressive Training strategy that enables the model to balance feature injection from subjects and source video. Meanwhile, we design the Subject-Focused Loss to improve the detailed appearance of the subjects. To further enhance insertion harmonization, we propose an Insertive Preference Optimization methodology to optimize the model by simulating human preferences, and incorporate a Context-Aware Rephraser module during reference to seamlessly integrate the subject into the original scenes. To address the lack of a benchmark for the field, we introduce InsertBench, a comprehensive benchmark comprising diverse scenes with meticulously selected subjects. Evaluation on InsertBench indicates OmniInsert outperforms state-of-the-art closed-source commercial solutions. The code will be released.

TLDR: The paper introduces OmniInsert, a mask-free video insertion framework based on diffusion transformer models, addressing data scarcity, subject-scene equilibrium, and insertion harmonization through a new data pipeline, condition-specific feature injection, progressive training, and insertive preference optimization. It outperforms state-of-the-art methods on a newly introduced benchmark.

TLDR: 这篇论文介绍了OmniInsert,一个基于扩散Transformer模型的无掩码视频插入框架。 该框架通过新的数据管道、条件特定的特征注入、渐进式训练和插入偏好优化,解决了数据稀缺、主体-场景平衡和插入协调等问题。 在新发布的benchmark上,OmniInsert的表现优于当前最优方法。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Jinshu Chen, Xinghui Li, Xu Bai, Tianxiang Ma, Pengze Zhang, Zhuowei Chen, Gen Li, Lijie Liu, Songtao Zhao, Bingchuan Li, Qian He

Guided and Unguided Conditional Diffusion Mechanisms for Structured and Semantically-Aware 3D Point Cloud Generation

Generating realistic 3D point clouds is a fundamental problem in computer vision with applications in remote sensing, robotics, and digital object modeling. Existing generative approaches primarily capture geometry, and when semantics are considered, they are typically imposed post hoc through external segmentation or clustering rather than integrated into the generative process itself. We propose a diffusion-based framework that embeds per-point semantic conditioning directly within generation. Each point is associated with a conditional variable corresponding to its semantic label, which guides the diffusion dynamics and enables the joint synthesis of geometry and semantics. This design produces point clouds that are both structurally coherent and segmentation-aware, with object parts explicitly represented during synthesis. Through a comparative analysis of guided and unguided diffusion processes, we demonstrate the significant impact of conditional variables on diffusion dynamics and generation quality. Extensive experiments validate the efficacy of our approach, producing detailed and accurate 3D point clouds tailored to specific parts and features.

TLDR: This paper introduces a diffusion-based method for generating 3D point clouds that incorporates semantic information directly into the generative process, resulting in structurally coherent and segmentation-aware outputs.

TLDR: 本文提出了一种基于扩散的3D点云生成方法,该方法将语义信息直接融入生成过程,从而产生结构连贯且具有分割意识的输出。

Relevance: (6/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (7/10)
Overall: (7/10)
Read Paper (PDF)

Authors: Gunner Stone, Sushmita Sarker, Alireza Tavakkoli

4D-MoDe: Towards Editable and Scalable Volumetric Streaming via Motion-Decoupled 4D Gaussian Compression

Volumetric video has emerged as a key medium for immersive telepresence and augmented/virtual reality, enabling six-degrees-of-freedom (6DoF) navigation and realistic spatial interactions. However, delivering high-quality dynamic volumetric content at scale remains challenging due to massive data volume, complex motion, and limited editability of existing representations. In this paper, we present 4D-MoDe, a motion-decoupled 4D Gaussian compression framework designed for scalable and editable volumetric video streaming. Our method introduces a layered representation that explicitly separates static backgrounds from dynamic foregrounds using a lookahead-based motion decomposition strategy, significantly reducing temporal redundancy and enabling selective background/foreground streaming. To capture continuous motion trajectories, we employ a multi-resolution motion estimation grid and a lightweight shared MLP, complemented by a dynamic Gaussian compensation mechanism to model emergent content. An adaptive grouping scheme dynamically inserts background keyframes to balance temporal consistency and compression efficiency. Furthermore, an entropy-aware training pipeline jointly optimizes the motion fields and Gaussian parameters under a rate-distortion (RD) objective, while employing range-based and KD-tree compression to minimize storage overhead. Extensive experiments on multiple datasets demonstrate that 4D-MoDe consistently achieves competitive reconstruction quality with an order of magnitude lower storage cost (e.g., as low as \textbf{11.4} KB/frame) compared to state-of-the-art methods, while supporting practical applications such as background replacement and foreground-only streaming.

TLDR: The paper introduces 4D-MoDe, a motion-decoupled 4D Gaussian compression framework for scalable and editable volumetric video streaming, achieving significant compression gains and supporting editing functionalities like background replacement.

TLDR: 该论文介绍了4D-MoDe,一种运动解耦的4D高斯压缩框架,用于可扩展和可编辑的 volumetric video 流传输,实现了显著的压缩收益,并支持背景替换等编辑功能。

Relevance: (3/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (7/10)
Overall: (5/10)
Read Paper (PDF)

Authors: Houqiang Zhong, Zihan Zheng, Qiang Hu, Yuan Tian, Ning Cao, Lan Xu, Xiaoyun Zhang, Zhengxue Cheng, Li Song, Wenjun Zhang