ArXiv CS.CV Papers (Image/Video Generation)

Emu3.5: Native Multimodal Models are World Learners

We introduce Emu3.5, a large-scale multimodal world model that natively predicts the next state across vision and language. Emu3.5 is pre-trained end-to-end with a unified next-token prediction objective on a corpus of vision-language interleaved data containing over 10 trillion tokens, primarily derived from sequential frames and transcripts of internet videos. The model naturally accepts interleaved vision-language inputs and generates interleaved vision-language outputs. Emu3.5 is further post-trained with large-scale reinforcement learning to enhance multimodal reasoning and generation. To improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA), which converts token-by-token decoding into bidirectional parallel prediction, accelerating per-image inference by about 20x without sacrificing performance. Emu3.5 exhibits strong native multimodal capabilities, including long-horizon vision-language generation, any-to-image (X2I) generation, and complex text-rich image generation. It also exhibits generalizable world-modeling abilities, enabling spatiotemporally consistent world exploration and open-world embodied manipulation across diverse scenarios and tasks. For comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image (Nano Banana) on image generation and editing tasks and demonstrates superior results on a suite of interleaved generation tasks. We open-source Emu3.5 at https://github.com/baaivision/Emu3.5 to support community research.

TLDR: Emu3.5 is a large-scale multimodal model trained on trillions of tokens from internet videos, capable of native multimodal predictions, long-horizon generation, and demonstrates strong performance in image/video generation and embodied manipulation.

TLDR: Emu3.5是一个大规模多模态模型，在来自互联网视频的数万亿 token 上进行训练，能够进行原生多模态预测、长程生成，并在图像/视频生成和具身操作方面表现出强大的性能。

Relevance: (10/10)

Novelty: (9/10)

Clarity: (9/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, Yueze Wang, Chengyuan Wang, Fan Zhang, Yingli Zhao, Ting Pan, Xianduo Li, Zecheng Hao, Wenxuan Ma, Zhuo Chen, Yulong Ao, Tiejun Huang, Zhongyuan Wang, Xinlong Wang

LoCoT2V-Bench: A Benchmark for Long-Form and Complex Text-to-Video Generation

Recently text-to-video generation has made impressive progress in producing short, high-quality clips, but evaluating long-form outputs remains a major challenge especially when processing complex prompts. Existing benchmarks mostly rely on simplified prompts and focus on low-level metrics, overlooking fine-grained alignment with prompts and abstract dimensions such as narrative coherence and thematic expression. To address these gaps, we propose LoCoT2V-Bench, a benchmark specifically designed for long video generation (LVG) under complex input conditions. Based on various real-world videos, LoCoT2V-Bench introduces a suite of realistic and complex prompts incorporating elements like scene transitions and event dynamics. Moreover, it constructs a multi-dimensional evaluation framework that includes our newly proposed metrics such as event-level alignment, fine-grained temporal consistency, content clarity, and the Human Expectation Realization Degree (HERD) that focuses on more abstract attributes like narrative flow, emotional response, and character development. Using this framework, we conduct a comprehensive evaluation of nine representative LVG models, finding that while current methods perform well on basic visual and temporal aspects, they struggle with inter-event consistency, fine-grained alignment, and high-level thematic adherence, etc. Overall, LoCoT2V-Bench provides a comprehensive and reliable platform for evaluating long-form complex text-to-video generation and highlights critical directions for future method improvement.

TLDR: The paper introduces LoCoT2V-Bench, a new benchmark for evaluating long-form text-to-video generation with complex prompts, addressing limitations in existing benchmarks by focusing on fine-grained alignment, narrative coherence, and thematic expression.

TLDR: 该论文介绍了LoCoT2V-Bench，一个新的用于评估基于复杂提示的长视频生成模型的基准，通过关注细粒度对齐、叙事连贯性和主题表达，解决了现有基准的局限性。

Relevance: (10/10)

Novelty: (9/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Xiangqing Zheng, Chengyue Wu, Kehai Chen, Min Zhang

OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes

There are two prevalent ways to constructing 3D scenes: procedural generation and 2D lifting. Among them, panorama-based 2D lifting has emerged as a promising technique, leveraging powerful 2D generative priors to produce immersive, realistic, and diverse 3D environments. In this work, we advance this technique to generate graphics-ready 3D scenes suitable for physically based rendering (PBR), relighting, and simulation. Our key insight is to repurpose 2D generative models for panoramic perception of geometry, textures, and PBR materials. Unlike existing 2D lifting approaches that emphasize appearance generation and ignore the perception of intrinsic properties, we present OmniX, a versatile and unified framework. Based on a lightweight and efficient cross-modal adapter structure, OmniX reuses 2D generative priors for a broad range of panoramic vision tasks, including panoramic perception, generation, and completion. Furthermore, we construct a large-scale synthetic panorama dataset containing high-quality multimodal panoramas from diverse indoor and outdoor scenes. Extensive experiments demonstrate the effectiveness of our model in panoramic visual perception and graphics-ready 3D scene generation, opening new possibilities for immersive and physically realistic virtual world generation.

TLDR: The paper introduces OmniX, a framework that uses 2D generative priors for panoramic perception, generation, and completion to create graphics-ready 3D scenes suitable for PBR and simulation.

TLDR: 该论文介绍了OmniX，一个利用2D生成先验进行全景感知、生成和补全的框架，旨在创建适用于基于物理渲染（PBR）和仿真的、可用于图形处理的3D场景。

Relevance: (8/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yukun Huang, Jiwen Yu, Yanning Zhou, Jianan Wang, Xintao Wang, Pengfei Wan, Xihui Liu

The Quest for Generalizable Motion Generation: Data, Model, and Evaluation

Despite recent advances in 3D human motion generation (MoGen) on standard benchmarks, existing models still face a fundamental bottleneck in their generalization capability. In contrast, adjacent generative fields, most notably video generation (ViGen), have demonstrated remarkable generalization in modeling human behaviors, highlighting transferable insights that MoGen can leverage. Motivated by this observation, we present a comprehensive framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation. First, we introduce ViMoGen-228K, a large-scale dataset comprising 228,000 high-quality motion samples that integrates high-fidelity optical MoCap data with semantically annotated motions from web videos and synthesized samples generated by state-of-the-art ViGen models. The dataset includes both text-motion pairs and text-video-motion triplets, substantially expanding semantic diversity. Second, we propose ViMoGen, a flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning. To enhance efficiency, we further develop ViMoGen-light, a distilled variant that eliminates video generation dependencies while preserving strong generalization. Finally, we present MBench, a hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability. Extensive experiments show that our framework significantly outperforms existing approaches in both automatic and human evaluations. The code, data, and benchmark will be made publicly available.

TLDR: This paper introduces ViMoGen, a framework for improving 3D human motion generation by transferring knowledge from video generation, including a large-scale dataset (ViMoGen-228K), a flow-matching-based diffusion transformer, and a hierarchical benchmark (MBench). The framework demonstrates improved generalization and performance compared to existing methods.

TLDR: 本文介绍了一个名为ViMoGen的框架，通过从视频生成领域迁移知识来改善3D人体运动生成，包括一个大型数据集(ViMoGen-228K)、一个基于流匹配的扩散Transformer和一个分层基准测试(MBench)。该框架展示了比现有方法更好的泛化能力和性能。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Jing Lin, Ruisi Wang, Junzhe Lu, Ziqi Huang, Guorui Song, Ailing Zeng, Xian Liu, Chen Wei, Wanqi Yin, Qingping Sun, Zhongang Cai, Lei Yang, Ziwei Liu

Dynamic VLM-Guided Negative Prompting for Diffusion Models

We propose a novel approach for dynamic negative prompting in diffusion models that leverages Vision-Language Models (VLMs) to adaptively generate negative prompts during the denoising process. Unlike traditional Negative Prompting methods that use fixed negative prompts, our method generates intermediate image predictions at specific denoising steps and queries a VLM to produce contextually appropriate negative prompts. We evaluate our approach on various benchmark datasets and demonstrate the trade-offs between negative guidance strength and text-image alignment.

TLDR: This paper introduces a dynamic negative prompting technique for diffusion models, utilizing VLMs to generate context-aware negative prompts during the denoising process, aiming to improve text-image alignment.

TLDR: 该论文提出了一种动态负面提示技术，用于扩散模型。该技术利用视觉-语言模型在去噪过程中生成上下文相关的负面提示，旨在提高文本-图像对齐。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Hoyeon Chang, Seungjin Kim, Yoonseok Choi

SplitFlow: Flow Decomposition for Inversion-Free Text-to-Image Editing

Rectified flow models have become a de facto standard in image generation due to their stable sampling trajectories and high-fidelity outputs. Despite their strong generative capabilities, they face critical limitations in image editing tasks: inaccurate inversion processes for mapping real images back into the latent space, and gradient entanglement issues during editing often result in outputs that do not faithfully reflect the target prompt. Recent efforts have attempted to directly map source and target distributions via ODE-based approaches without inversion; however,these methods still yield suboptimal editing quality. In this work, we propose a flow decomposition-and-aggregation framework built upon an inversion-free formulation to address these limitations. Specifically, we semantically decompose the target prompt into multiple sub-prompts, compute an independent flow for each, and aggregate them to form a unified editing trajectory. While we empirically observe that decomposing the original flow enhances diversity in the target space, generating semantically aligned outputs still requires consistent guidance toward the full target prompt. To this end, we design a projection and soft-aggregation mechanism for flow, inspired by gradient conflict resolution in multi-task learning. This approach adaptively weights the sub-target velocity fields, suppressing semantic redundancy while emphasizing distinct directions, thereby preserving both diversity and consistency in the final edited output. Experimental results demonstrate that our method outperforms existing zero-shot editing approaches in terms of semantic fidelity and attribute disentanglement. The code is available at https://github.com/Harvard-AI-and-Robotics-Lab/SplitFlow.

TLDR: The paper introduces SplitFlow, an inversion-free text-to-image editing framework that decomposes the target prompt into sub-prompts, computes independent flows for each, and aggregates them using a novel projection and soft-aggregation mechanism to improve semantic fidelity and attribute disentanglement.

TLDR: 该论文介绍了SplitFlow，一个无需反演的文本到图像编辑框架，它将目标提示分解为多个子提示，为每个子提示计算独立的 flow，并使用一种新颖的投影和软聚合机制来聚合它们，以提高语义保真度和属性解耦，而无需图像反演。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Sung-Hoon Yoon, Minghan Li, Gaspard Beaudouin, Congcong Wen, Muhammad Rafay Azhar, Mengyu Wang

MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency

Current text-to-image generative models are trained on large uncurated datasets to enable diverse generation capabilities. However, this does not align well with user preferences. Recently, reward models have been specifically designed to perform post-hoc selection of generated images and align them to a reward, typically user preference. This discarding of informative data together with the optimizing for a single reward tend to harm diversity, semantic fidelity and efficiency. Instead of this post-processing, we propose to condition the model on multiple reward models during training to let the model learn user preferences directly. We show that this not only dramatically improves the visual quality of the generated images but it also significantly speeds up the training. Our proposed method, called MIRO, achieves state-of-the-art performances on the GenEval compositional benchmark and user-preference scores (PickAScore, ImageReward, HPSv2).

TLDR: The paper proposes MIRO, a method for pretraining text-to-image models by conditioning on multiple reward models during training, improving image quality and training efficiency compared to post-hoc reward optimization.

TLDR: 该论文提出了MIRO，一种通过在训练期间基于多个奖励模型进行条件预训练的文本到图像模型方法，相比于事后奖励优化，提高了图像质量和训练效率。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Nicolas Dufour, Lucas Degeorge, Arijit Ghosh, Vicky Kalogeiton, David Picard

Revisiting Generative Infrared and Visible Image Fusion Based on Human Cognitive Laws

Existing infrared and visible image fusion methods often face the dilemma of balancing modal information. Generative fusion methods reconstruct fused images by learning from data distributions, but their generative capabilities remain limited. Moreover, the lack of interpretability in modal information selection further affects the reliability and consistency of fusion results in complex scenarios. This manuscript revisits the essence of generative image fusion under the inspiration of human cognitive laws and proposes a novel infrared and visible image fusion method, termed HCLFuse. First, HCLFuse investigates the quantification theory of information mapping in unsupervised fusion networks, which leads to the design of a multi-scale mask-regulated variational bottleneck encoder. This encoder applies posterior probability modeling and information decomposition to extract accurate and concise low-level modal information, thereby supporting the generation of high-fidelity structural details. Furthermore, the probabilistic generative capability of the diffusion model is integrated with physical laws, forming a time-varying physical guidance mechanism that adaptively regulates the generation process at different stages, thereby enhancing the ability of the model to perceive the intrinsic structure of data and reducing dependence on data quality. Experimental results show that the proposed method achieves state-of-the-art fusion performance in qualitative and quantitative evaluations across multiple datasets and significantly improves semantic segmentation metrics. This fully demonstrates the advantages of this generative image fusion method, drawing inspiration from human cognition, in enhancing structural consistency and detail quality.

TLDR: The paper proposes a new infrared and visible image fusion method (HCLFuse) inspired by human cognitive laws, utilizing a multi-scale variational bottleneck encoder and a diffusion model with physical guidance to achieve state-of-the-art fusion performance and improved semantic segmentation metrics.

TLDR: 该论文提出了一种新的红外和可见光图像融合方法（HCLFuse），其灵感来自人类认知规律，利用多尺度变分瓶颈编码器和具有物理指导的扩散模型，实现了最先进的融合性能和改进的语义分割指标。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Lin Guo, Xiaoqing Luo, Wei Xie, Zhancheng Zhang, Hui Li, Rui Wang, Zhenhua Feng, Xiaoning Song

FullPart: Generating each 3D Part at Full Resolution

Part-based 3D generation holds great potential for various applications. Previous part generators that represent parts using implicit vector-set tokens often suffer from insufficient geometric details. Another line of work adopts an explicit voxel representation but shares a global voxel grid among all parts; this often causes small parts to occupy too few voxels, leading to degraded quality. In this paper, we propose FullPart, a novel framework that combines both implicit and explicit paradigms. It first derives the bounding box layout through an implicit box vector-set diffusion process, a task that implicit diffusion handles effectively since box tokens contain little geometric detail. Then, it generates detailed parts, each within its own fixed full-resolution voxel grid. Instead of sharing a global low-resolution space, each part in our method - even small ones - is generated at full resolution, enabling the synthesis of intricate details. We further introduce a center-point encoding strategy to address the misalignment issue when exchanging information between parts of different actual sizes, thereby maintaining global coherence. Moreover, to tackle the scarcity of reliable part data, we present PartVerse-XL, the largest human-annotated 3D part dataset to date with 40K objects and 320K parts. Extensive experiments demonstrate that FullPart achieves state-of-the-art results in 3D part generation. We will release all code, data, and model to benefit future research in 3D part generation.

TLDR: The paper introduces FullPart, a novel framework for generating high-resolution 3D parts by combining implicit and explicit representations, and a new large-scale part dataset (PartVerse-XL).

TLDR: 该论文介绍了一种名为FullPart的新框架，它通过结合隐式和显式表示来生成高分辨率的3D部件，并创建了一个新的大型部件数据集（PartVerse-XL）。

Relevance: (3/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (5/10)

Read Paper (PDF)

Authors: Lihe Ding, Shaocong Dong, Yaokun Li, Chenjian Gao, Xiao Chen, Rui Han, Yihao Kuang, Hong Zhang, Bo Huang, Zhanpeng Huang, Zibin Wang, Dan Xu, Tianfan Xue

AIGC Daily Papers

Emu3.5: Native Multimodal Models are World Learners

LoCoT2V-Bench: A Benchmark for Long-Form and Complex Text-to-Video Generation

OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes

The Quest for Generalizable Motion Generation: Data, Model, and Evaluation

Dynamic VLM-Guided Negative Prompting for Diffusion Models

SplitFlow: Flow Decomposition for Inversion-Free Text-to-Image Editing

MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency

Revisiting Generative Infrared and Visible Image Fusion Based on Human Cognitive Laws

FullPart: Generating each 3D Part at Full Resolution