ArXiv CS.CV Papers (Image/Video Generation)

T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image Generation

We propose T2I-ReasonBench, a benchmark evaluating reasoning capabilities of text-to-image (T2I) models. It consists of four dimensions: Idiom Interpretation, Textual Image Design, Entity-Reasoning and Scientific-Reasoning. We propose a two-stage evaluation protocol to assess the reasoning accuracy and image quality. We benchmark various T2I generation models, and provide comprehensive analysis on their performances.

TLDR: The paper introduces T2I-ReasonBench, a new benchmark to evaluate the reasoning capabilities of text-to-image models across four dimensions, and benchmarks existing models using a two-stage evaluation protocol.

TLDR: 该论文介绍了T2I-ReasonBench，一个用于评估文本到图像模型推理能力的新基准，它包含四个维度，并使用两阶段评估协议对现有模型进行了基准测试。

Relevance: (9/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Kaiyue Sun, Rongyao Fang, Chengqi Duan, Xian Liu, Xihui Liu

An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing

Despite the remarkable capabilities of text-to-image (T2I) generation models, real-world applications often demand fine-grained, iterative image editing that existing methods struggle to provide. Key challenges include granular instruction understanding, robust context preservation during modifications, and the lack of intelligent feedback mechanisms for iterative refinement. This paper introduces RefineEdit-Agent, a novel, training-free intelligent agent framework designed to address these limitations by enabling complex, iterative, and context-aware image editing. RefineEdit-Agent leverages the powerful planning capabilities of Large Language Models (LLMs) and the advanced visual understanding and evaluation prowess of Vision-Language Large Models (LVLMs) within a closed-loop system. Our framework comprises an LVLM-driven instruction parser and scene understanding module, a multi-level LLM-driven editing planner for goal decomposition, tool selection, and sequence generation, an iterative image editing module, and a crucial LVLM-driven feedback and evaluation loop. To rigorously evaluate RefineEdit-Agent, we propose LongBench-T2I-Edit, a new benchmark featuring 500 initial images with complex, multi-turn editing instructions across nine visual dimensions. Extensive experiments demonstrate that RefineEdit-Agent significantly outperforms state-of-the-art baselines, achieving an average score of 3.67 on LongBench-T2I-Edit, compared to 2.29 for Direct Re-Prompting, 2.91 for InstructPix2Pix, 3.16 for GLIGEN-based Edit, and 3.39 for ControlNet-XL. Ablation studies, human evaluations, and analyses of iterative refinement, backbone choices, tool usage, and robustness to instruction complexity further validate the efficacy of our agentic design in delivering superior edit fidelity and context preservation.

TLDR: The paper introduces RefineEdit-Agent, a novel LLM-LVLM driven agent for iterative and fine-grained image editing, outperforming SOTA methods on a new benchmark, LongBench-T2I-Edit.

TLDR: 本文介绍了一种名为 RefineEdit-Agent 的新型 LLM-LVLM 驱动的代理，用于迭代和细粒度的图像编辑，并在名为 LongBench-T2I-Edit 的新基准测试中优于 SOTA 方法。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Zihan Liang, Jiahao Sun, Haoran Ma

MoCo: Motion-Consistent Human Video Generation via Structure-Appearance Decoupling

Generating human videos with consistent motion from text prompts remains a significant challenge, particularly for whole-body or long-range motion. Existing video generation models prioritize appearance fidelity, resulting in unrealistic or physically implausible human movements with poor structural coherence. Additionally, most existing human video datasets primarily focus on facial or upper-body motions, or consist of vertically oriented dance videos, limiting the scope of corresponding generation methods to simple movements. To overcome these challenges, we propose MoCo, which decouples the process of human video generation into two components: structure generation and appearance generation. Specifically, our method first employs an efficient 3D structure generator to produce a human motion sequence from a text prompt. The remaining video appearance is then synthesized under the guidance of the generated structural sequence. To improve fine-grained control over sparse human structures, we introduce Human-Aware Dynamic Control modules and integrate dense tracking constraints during training. Furthermore, recognizing the limitations of existing datasets, we construct a large-scale whole-body human video dataset featuring complex and diverse motions. Extensive experiments demonstrate that MoCo outperforms existing approaches in generating realistic and structurally coherent human videos.

TLDR: The paper introduces MoCo, a method for generating human videos from text prompts by decoupling structure and appearance generation, using a 3D structure generator and a new large-scale whole-body motion dataset.

TLDR: 该论文介绍了MoCo，一种通过解耦结构和外观生成，利用3D结构生成器和一个新的大规模全身运动数据集，从文本提示生成人体视频的方法。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Haoyu Wang, Hao Tang, Donglin Di, Zhilu Zhang, Wangmeng Zuo, Feng Gao, Siwei Ma, Shiliang Zhang

ShaLa: Multimodal Shared Latent Space Modelling

This paper presents a novel generative framework for learning shared latent representations across multimodal data. Many advanced multimodal methods focus on capturing all combinations of modality-specific details across inputs, which can inadvertently obscure the high-level semantic concepts that are shared across modalities. Notably, Multimodal VAEs with low-dimensional latent variables are designed to capture shared representations, enabling various tasks such as joint multimodal synthesis and cross-modal inference. However, multimodal VAEs often struggle to design expressive joint variational posteriors and suffer from low-quality synthesis. In this work, ShaLa addresses these challenges by integrating a novel architectural inference model and a second-stage expressive diffusion prior, which not only facilitates effective inference of shared latent representation but also significantly improves the quality of downstream multimodal synthesis. We validate ShaLa extensively across multiple benchmarks, demonstrating superior coherence and synthesis quality compared to state-of-the-art multimodal VAEs. Furthermore, ShaLa scales to many more modalities while prior multimodal VAEs have fallen short in capturing the increasing complexity of the shared latent space.

TLDR: The paper introduces ShaLa, a novel generative framework that uses a new architectural inference model and diffusion prior to learn shared latent representations across multiple modalities, improving both inference and synthesis quality compared to multimodal VAEs.

TLDR: 本文介绍了一种名为ShaLa的新型生成框架，该框架使用新的架构推理模型和扩散先验来学习跨多种模态的共享潜在表示，与多模态VAE相比，提高了推理和合成质量。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Jiali Cui, Yan-Ying Chen, Yanxia Zhang, Matthew Klenk

Condition Weaving Meets Expert Modulation: Towards Universal and Controllable Image Generation

The image-to-image generation task aims to produce controllable images by leveraging conditional inputs and prompt instructions. However, existing methods often train separate control branches for each type of condition, leading to redundant model structures and inefficient use of computational resources. To address this, we propose a Unified image-to-image Generation (UniGen) framework that supports diverse conditional inputs while enhancing generation efficiency and expressiveness. Specifically, to tackle the widely existing parameter redundancy and computational inefficiency in controllable conditional generation architectures, we propose the Condition Modulated Expert (CoMoE) module. This module aggregates semantically similar patch features and assigns them to dedicated expert modules for visual representation and conditional modeling. By enabling independent modeling of foreground features under different conditions, CoMoE effectively mitigates feature entanglement and redundant computation in multi-condition scenarios. Furthermore, to bridge the information gap between the backbone and control branches, we propose WeaveNet, a dynamic, snake-like connection mechanism that enables effective interaction between global text-level control from the backbone and fine-grained control from conditional branches. Extensive experiments on the Subjects-200K and MultiGen-20M datasets across various conditional image generation tasks demonstrate that our method consistently achieves state-of-the-art performance, validating its advantages in both versatility and effectiveness. The code has been uploaded to https://github.com/gavin-gqzhang/UniGen.

TLDR: The paper introduces UniGen, a new image-to-image generation framework with a Condition Modulated Expert (CoMoE) module and WeaveNet to improve efficiency and control in multi-conditional image generation, achieving state-of-the-art results.

TLDR: 本文介绍了一种新的图像到图像生成框架UniGen，它具有条件调制专家（CoMoE）模块和WeaveNet，可以提高多条件图像生成的效率和控制能力，并取得了最先进的结果。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Guoqing Zhang, Xingtong Ge, Lu Shi, Xin Zhang, Muqing Xue, Wanru Xu, Yigang Cen

DanceEditor: Towards Iterative Editable Music-driven Dance Generation with Open-Vocabulary Descriptions

Generating coherent and diverse human dances from music signals has gained tremendous progress in animating virtual avatars. While existing methods support direct dance synthesis, they fail to recognize that enabling users to edit dance movements is far more practical in real-world choreography scenarios. Moreover, the lack of high-quality dance datasets incorporating iterative editing also limits addressing this challenge. To achieve this goal, we first construct DanceRemix, a large-scale multi-turn editable dance dataset comprising the prompt featuring over 25.3M dance frames and 84.5K pairs. In addition, we propose a novel framework for iterative and editable dance generation coherently aligned with given music signals, namely DanceEditor. Considering the dance motion should be both musical rhythmic and enable iterative editing by user descriptions, our framework is built upon a prediction-then-editing paradigm unifying multi-modal conditions. At the initial prediction stage, our framework improves the authority of generated results by directly modeling dance movements from tailored, aligned music. Moreover, at the subsequent iterative editing stages, we incorporate text descriptions as conditioning information to draw the editable results through a specifically designed Cross-modality Editing Module (CEM). Specifically, CEM adaptively integrates the initial prediction with music and text prompts as temporal motion cues to guide the synthesized sequences. Thereby, the results display music harmonics while preserving fine-grained semantic alignment with text descriptions. Extensive experiments demonstrate that our method outperforms the state-of-the-art models on our newly collected DanceRemix dataset. Code is available at https://lzvsdy.github.io/DanceEditor/.

TLDR: The paper introduces DanceEditor, a framework for iteratively editing music-driven dance generation using user-defined text descriptions, along with a new large-scale dataset, DanceRemix, for training and evaluation.

TLDR: 该论文介绍了 DanceEditor，一个利用用户定义的文本描述迭代编辑音乐驱动舞蹈生成的框架，以及一个新的大型数据集 DanceRemix，用于训练和评估。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Hengyuan Zhang, Zhe Li, Xingqun Qi, Mengze Li, Muyi Sun, Man Zhang, Sirui Han

PosBridge: Multi-View Positional Embedding Transplant for Identity-Aware Image Editing

Localized subject-driven image editing aims to seamlessly integrate user-specified objects into target scenes. As generative models continue to scale, training becomes increasingly costly in terms of memory and computation, highlighting the need for training-free and scalable editing frameworks.To this end, we propose PosBridge an efficient and flexible framework for inserting custom objects. A key component of our method is positional embedding transplant, which guides the diffusion model to faithfully replicate the structural characteristics of reference objects.Meanwhile, we introduce the Corner Centered Layout, which concatenates reference images and the background image as input to the FLUX.1-Fill model. During progressive denoising, positional embedding transplant is applied to guide the noise distribution in the target region toward that of the reference object. In this way, Corner Centered Layout effectively directs the FLUX.1-Fill model to synthesize identity-consistent content at the desired location. Extensive experiments demonstrate that PosBridge outperforms mainstream baselines in structural consistency, appearance fidelity, and computational efficiency, showcasing its practical value and potential for broad adoption.

TLDR: The paper introduces PosBridge, a training-free and efficient framework for subject-driven image editing using positional embedding transplant to insert custom objects into scenes while maintaining identity and structural consistency.

TLDR: 该论文介绍了PosBridge，一个无需训练且高效的主体驱动图像编辑框架，它使用位置嵌入移植技术将自定义对象插入场景，同时保持身份和结构一致性。

Relevance: (8/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Peilin Xiong, Junwen Chen, Honghui Yuan, Keiji Yanai

MMCIG: Multimodal Cover Image Generation for Text-only Documents and Its Dataset Construction via Pseudo-labeling

In this study, we introduce a novel cover image generation task that produces both a concise summary and a visually corresponding image from a given text-only document. Because no existing datasets are available for this task, we propose a multimodal pseudo-labeling method to construct high-quality datasets at low cost. We first collect documents that contain multiple images with their captions, and their summaries by excluding factually inconsistent instances. Our approach selects one image from the multiple images accompanying the documents. Using the gold summary, we independently rank both the images and their captions. Then, we annotate a pseudo-label for an image when both the image and its corresponding caption are ranked first in their respective rankings. Finally, we remove documents that contain direct image references within texts. Experimental results demonstrate that the proposed multimodal pseudo-labeling method constructs more precise datasets and generates higher quality images than text- and image-only pseudo-labeling methods, which consider captions and images separately. We release our code at: https://github.com/HyeyeeonKim/MMCIG

TLDR: The paper introduces a novel multimodal cover image generation task for text-only documents by creating a dataset using a multimodal pseudo-labeling approach, outperforming unimodal baselines.

TLDR: 该论文介绍了一种新的多模态封面图像生成任务，用于仅包含文本的文档。该方法通过多模态伪标签方法创建数据集，并优于单模态基线。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Hyeyeon Kim, Sungwoo Han, Jingun Kwon, Hidetaka Kamigaito, Manabu Okumura

AIGC Daily Papers

T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image Generation

An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing

MoCo: Motion-Consistent Human Video Generation via Structure-Appearance Decoupling

ShaLa: Multimodal Shared Latent Space Modelling

Condition Weaving Meets Expert Modulation: Towards Universal and Controllable Image Generation

DanceEditor: Towards Iterative Editable Music-driven Dance Generation with Open-Vocabulary Descriptions

PosBridge: Multi-View Positional Embedding Transplant for Identity-Aware Image Editing

MMCIG: Multimodal Cover Image Generation for Text-only Documents and Its Dataset Construction via Pseudo-labeling