ArXiv CS.CV Papers (Image/Video Generation)

MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models

Recent multimodal image generators such as GPT-4o, Gemini 2.0 Flash, and Gemini 2.5 Pro excel at following complex instructions, editing images and maintaining concept consistency. However, they are still evaluated by disjoint toolkits: text-to-image (T2I) benchmarks that lacks multi-modal conditioning, and customized image generation benchmarks that overlook compositional semantics and common knowledge. We propose MMIG-Bench, a comprehensive Multi-Modal Image Generation Benchmark that unifies these tasks by pairing 4,850 richly annotated text prompts with 1,750 multi-view reference images across 380 subjects, spanning humans, animals, objects, and artistic styles. MMIG-Bench is equipped with a three-level evaluation framework: (1) low-level metrics for visual artifacts and identity preservation of objects; (2) novel Aspect Matching Score (AMS): a VQA-based mid-level metric that delivers fine-grained prompt-image alignment and shows strong correlation with human judgments; and (3) high-level metrics for aesthetics and human preference. Using MMIG-Bench, we benchmark 17 state-of-the-art models, including Gemini 2.5 Pro, FLUX, DreamBooth, and IP-Adapter, and validate our metrics with 32k human ratings, yielding in-depth insights into architecture and data design. We will release the dataset and evaluation code to foster rigorous, unified evaluation and accelerate future innovations in multi-modal image generation.

TLDR: The paper introduces MMIG-Bench, a comprehensive benchmark for evaluating multi-modal image generation models, addressing the limitations of existing toolkits by incorporating diverse tasks and a three-level evaluation framework. They benchmarked 17 SOTA models and will release the dataset and evaluation code.

TLDR: 该论文介绍了MMIG-Bench，一个综合性的多模态图像生成模型评估基准，通过整合多样化的任务和一个三级评估框架，解决了现有工具包的局限性。他们对17个最先进的模型进行了基准测试，并将发布数据集和评估代码。

Relevance: (10/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Hang Hua, Ziyun Zeng, Yizhi Song, Yunlong Tang, Liu He, Daniel Aliaga, Wei Xiong, Jiebo Luo

FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities

The rapid progress of large language models (LLMs) has catalyzed the emergence of multimodal large language models (MLLMs) that unify visual understanding and image generation within a single framework. However, most existing MLLMs rely on autoregressive (AR) architectures, which impose inherent limitations on future development, such as the raster-scan order in image generation and restricted reasoning abilities in causal context modeling. In this work, we challenge the dominance of AR-based approaches by introducing FUDOKI, a unified multimodal model purely based on discrete flow matching, as an alternative to conventional AR paradigms. By leveraging metric-induced probability paths with kinetic optimal velocities, our framework goes beyond the previous masking-based corruption process, enabling iterative refinement with self-correction capability and richer bidirectional context integration during generation. To mitigate the high cost of training from scratch, we initialize FUDOKI from pre-trained AR-based MLLMs and adaptively transition to the discrete flow matching paradigm. Experimental results show that FUDOKI achieves performance comparable to state-of-the-art AR-based MLLMs across both visual understanding and image generation tasks, highlighting its potential as a foundation for next-generation unified multimodal models. Furthermore, we show that applying test-time scaling techniques to FUDOKI yields significant performance gains, further underscoring its promise for future enhancement through reinforcement learning.

TLDR: This paper introduces FUDOKI, a novel unified multimodal model based on discrete flow matching, offering an alternative to autoregressive models for visual understanding and image generation, achieving comparable performance with potential for future enhancement.

TLDR: 本文介绍了FUDOKI，一种基于离散流匹配的新型统一多模态模型，为视觉理解和图像生成提供了除自回归模型之外的替代方案，实现了可比的性能，并具有未来增强的潜力。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Jin Wang, Yao Lai, Aoxue Li, Shifeng Zhang, Jiacheng Sun, Ning Kang, Chengyue Wu, Zhenguo Li, Ping Luo

Dynamic-I2V: Exploring Image-to-Video Generaion Models via Multimodal LLM

Recent advancements in image-to-video (I2V) generation have shown promising performance in conventional scenarios. However, these methods still encounter significant challenges when dealing with complex scenes that require a deep understanding of nuanced motion and intricate object-action relationships. To address these challenges, we present Dynamic-I2V, an innovative framework that integrates Multimodal Large Language Models (MLLMs) to jointly encode visual and textual conditions for a diffusion transformer (DiT) architecture. By leveraging the advanced multimodal understanding capabilities of MLLMs, our model significantly improves motion controllability and temporal coherence in synthesized videos. The inherent multimodality of Dynamic-I2V further enables flexible support for diverse conditional inputs, extending its applicability to various downstream generation tasks. Through systematic analysis, we identify a critical limitation in current I2V benchmarks: a significant bias towards favoring low-dynamic videos, stemming from an inadequate balance between motion complexity and visual quality metrics. To resolve this evaluation gap, we propose DIVE - a novel assessment benchmark specifically designed for comprehensive dynamic quality measurement in I2V generation. In conclusion, extensive quantitative and qualitative experiments confirm that Dynamic-I2V attains state-of-the-art performance in image-to-video generation, particularly revealing significant improvements of 42.5%, 7.9%, and 11.8% in dynamic range, controllability, and quality, respectively, as assessed by the DIVE metric in comparison to existing methods.

TLDR: The paper introduces Dynamic-I2V, an image-to-video generation framework leveraging MLLMs and a custom benchmark, DIVE, to address limitations in generating complex, dynamic videos, achieving SotA results.

TLDR: 该论文介绍了Dynamic-I2V，一个利用多模态LLM和自定义基准DIVE的图像到视频生成框架，旨在解决生成复杂动态视频的局限性，并取得了最先进的结果。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Peng Liu, Xiaoming Ren, Fengkai Liu, Qingsong Xie, Quanlong Zheng, Yanhao Zhang, Haonan Lu, Yujiu Yang

ReDDiT: Rehashing Noise for Discrete Visual Generation

Discrete diffusion models are gaining traction in the visual generative area for their efficiency and compatibility. However, the pioneered attempts still fall behind the continuous counterparts, which we attribute to the noise (absorbing state) design and sampling heuristics. In this study, we propose the rehashing noise framework for discrete diffusion transformer, termed ReDDiT, to extend absorbing states and improve expressive capacity of discrete diffusion models. ReDDiT enriches the potential paths that latent variables can traverse during training with randomized multi-index corruption. The derived rehash sampler, which reverses the randomized absorbing paths, guarantees the diversity and low discrepancy of the generation process. These reformulations lead to more consistent and competitive generation quality, mitigating the need for heavily tuned randomness. Experiments show that ReDDiT significantly outperforms the baseline (reducing gFID from 6.18 to 1.61) and is on par with the continuous counterparts with higher efficiency.

TLDR: The paper introduces ReDDiT, a new discrete diffusion transformer model that uses a rehashing noise framework to improve the efficiency and generation quality of discrete visual generation, achieving performance on par with continuous models.

TLDR: 该论文介绍了ReDDiT，一种新的离散扩散Transformer模型，它使用重新哈希噪声框架来提高离散视觉生成的效率和生成质量，从而达到与连续模型相当的性能。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Tianren Ma, Xiaosong Zhang, Boyu Yang, Junlan Feng, Qixiang Ye

Enhancing Text-to-Image Diffusion Transformer via Split-Text Conditioning

Current text-to-image diffusion generation typically employs complete-text conditioning. Due to the intricate syntax, diffusion transformers (DiTs) inherently suffer from a comprehension defect of complete-text captions. One-fly complete-text input either overlooks critical semantic details or causes semantic confusion by simultaneously modeling diverse semantic primitive types. To mitigate this defect of DiTs, we propose a novel split-text conditioning framework named DiT-ST. This framework converts a complete-text caption into a split-text caption, a collection of simplified sentences, to explicitly express various semantic primitives and their interconnections. The split-text caption is then injected into different denoising stages of DiT-ST in a hierarchical and incremental manner. Specifically, DiT-ST leverages Large Language Models to parse captions, extracting diverse primitives and hierarchically sorting out and constructing these primitives into a split-text input. Moreover, we partition the diffusion denoising process according to its differential sensitivities to diverse semantic primitive types and determine the appropriate timesteps to incrementally inject tokens of diverse semantic primitive types into input tokens via cross-attention. In this way, DiT-ST enhances the representation learning of specific semantic primitive types across different stages. Extensive experiments validate the effectiveness of our proposed DiT-ST in mitigating the complete-text comprehension defect.

TLDR: The paper introduces DiT-ST, a novel split-text conditioning framework for Diffusion Transformers (DiTs) that converts complete-text captions into simplified sentences to improve semantic comprehension and ultimately enhance text-to-image generation.

TLDR: 该论文介绍了一种名为DiT-ST的新型分割文本调节框架，用于扩散Transformer (DiTs)，该框架将完整的文本字幕转换为简化的句子，以提高语义理解，并最终增强文本到图像的生成。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yu Zhang, Jialei Zhou, Xinchen Li, Qi Zhang, Zhongwei Wan, Tianyu Wang, Duoqian Miao, Changwei Wang, Longbing Cao

AIGC Daily Papers

MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models

FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities

Dynamic-I2V: Exploring Image-to-Video Generaion Models via Multimodal LLM

ReDDiT: Rehashing Noise for Discrete Visual Generation

Enhancing Text-to-Image Diffusion Transformer via Split-Text Conditioning