ArXiv CS.CV Papers (Image/Video Generation)

DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

Video generation models, as one form of world models, have emerged as one of the most exciting frontiers in AI, promising agents the ability to imagine the future by modeling the temporal evolution of complex scenes. In autonomous driving, this vision gives rise to driving world models: generative simulators that imagine ego and agent futures, enabling scalable simulation, safe testing of corner cases, and rich synthetic data generation. Yet, despite fast-growing research activity, the field lacks a rigorous benchmark to measure progress and guide priorities. Existing evaluations remain limited: generic video metrics overlook safety-critical imaging factors; trajectory plausibility is rarely quantified; temporal and agent-level consistency is neglected; and controllability with respect to ego conditioning is ignored. Moreover, current datasets fail to cover the diversity of conditions required for real-world deployment. To address these gaps, we present DrivingGen, the first comprehensive benchmark for generative driving world models. DrivingGen combines a diverse evaluation dataset curated from both driving datasets and internet-scale video sources, spanning varied weather, time of day, geographic regions, and complex maneuvers, with a suite of new metrics that jointly assess visual realism, trajectory plausibility, temporal coherence, and controllability. Benchmarking 14 state-of-the-art models reveals clear trade-offs: general models look better but break physics, while driving-specific ones capture motion realistically but lag in visual quality. DrivingGen offers a unified evaluation framework to foster reliable, controllable, and deployable driving world models, enabling scalable simulation, planning, and data-driven decision-making.

TLDR: The paper introduces DrivingGen, a new benchmark for generative video world models in autonomous driving, featuring a diverse dataset and comprehensive metrics to evaluate visual realism, trajectory plausibility, temporal coherence, and controllability.

TLDR: 该论文介绍了DrivingGen，一个用于自动驾驶中生成式视频世界模型的新基准，它具有多样化的数据集和综合指标，用于评估视觉真实感、轨迹合理性、时间连贯性和可控性。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Yang Zhou, Hao Shao, Letian Wang, Zhuofan Zong, Hongsheng Li, Steven L. Waslander

Guiding Token-Sparse Diffusion Models

Diffusion models deliver high quality in image synthesis but remain expensive during training and inference. Recent works have leveraged the inherent redundancy in visual content to make training more affordable by training only on a subset of visual information. While these methods were successful in providing cheaper and more effective training, sparsely trained diffusion models struggle in inference. This is due to their lacking response to Classifier-free Guidance (CFG) leading to underwhelming performance during inference. To overcome this, we propose Sparse Guidance (SG). Instead of using conditional dropout as a signal to guide diffusion models, SG uses token-level sparsity. As a result, SG preserves the high-variance of the conditional prediction better, achieving good quality and high variance outputs. Leveraging token-level sparsity at inference, SG improves fidelity at lower compute, achieving 1.58 FID on the commonly used ImageNet-256 benchmark with 25% fewer FLOPs, and yields up to 58% FLOP savings at matched baseline quality. To demonstrate the effectiveness of Sparse Guidance, we train a 2.5B text-to-image diffusion model using training time sparsity and leverage SG during inference. SG achieves improvements in composition and human preference score while increasing throughput at the same time.

TLDR: This paper introduces Sparse Guidance (SG) to improve the inference quality of token-sparse diffusion models by leveraging token-level sparsity for conditional guidance, leading to improved fidelity and reduced FLOPs.

TLDR: 本文介绍了稀疏引导 (SG)，通过利用令牌级稀疏性进行条件引导，提高了令牌稀疏扩散模型的推理质量，从而提高了保真度并减少了 FLOP。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Felix Krause, Stefan Andreas Baumann, Johannes Schusterbauer, Olga Grebenkova, Ming Gui, Vincent Tao Hu, Björn Ommer

Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation

Manual font design is an intricate process that transforms a stylistic visual concept into a coherent glyph set. This challenge persists in automated Few-shot Font Generation (FFG), where models often struggle to preserve both the structural integrity and stylistic fidelity from limited references. While autoregressive (AR) models have demonstrated impressive generative capabilities, their application to FFG is constrained by conventional patch-level tokenization, which neglects global dependencies crucial for coherent font synthesis. Moreover, existing FFG methods remain within the image-to-image paradigm, relying solely on visual references and overlooking the role of language in conveying stylistic intent during font design. To address these limitations, we propose GAR-Font, a novel AR framework for multimodal few-shot font generation. GAR-Font introduces a global-aware tokenizer that effectively captures both local structures and global stylistic patterns, a multimodal style encoder offering flexible style control through a lightweight language-style adapter without requiring intensive multimodal pretraining, and a post-refinement pipeline that further enhances structural fidelity and style coherence. Extensive experiments show that GAR-Font outperforms existing FFG methods, excelling in maintaining global style faithfulness and achieving higher-quality results with textual stylistic guidance.

TLDR: This paper introduces GAR-Font, a global-aware autoregressive model for multimodal few-shot font generation that leverages a novel tokenizer and a language-style adapter to improve structural integrity and stylistic fidelity with textual guidance.

TLDR: 本文介绍了一种名为GAR-Font的全局感知自回归模型，用于多模态小样本字体生成。该模型利用一种新颖的tokenizer和一个语言-风格适配器，以提升结构完整性和风格保真度，并支持文本风格引导。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Haonan Cai, Yuxuan Luo, Zhouhui Lian

MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning

Joint audio-video generation aims to synthesize synchronized multisensory content, yet current unified models struggle with fine-grained acoustic control, particularly for identity-preserving speech. Existing approaches either suffer from temporal misalignment due to cascaded generation or lack the capability to perform zero-shot voice cloning within a joint synthesis framework. In this work, we present MM-Sonate, a multimodal flow-matching framework that unifies controllable audio-video joint generation with zero-shot voice cloning capabilities. Unlike prior works that rely on coarse semantic descriptions, MM-Sonate utilizes a unified instruction-phoneme input to enforce strict linguistic and temporal alignment. To enable zero-shot voice cloning, we introduce a timbre injection mechanism that effectively decouples speaker identity from linguistic content. Furthermore, addressing the limitations of standard classifier-free guidance in multimodal settings, we propose a noise-based negative conditioning strategy that utilizes natural noise priors to significantly enhance acoustic fidelity. Empirical evaluations demonstrate that MM-Sonate establishes new state-of-the-art performance in joint generation benchmarks, significantly outperforming baselines in lip synchronization and speech intelligibility, while achieving voice cloning fidelity comparable to specialized Text-to-Speech systems.

TLDR: MM-Sonate is a multimodal flow-matching framework that achieves state-of-the-art controllable audio-video generation with zero-shot voice cloning by using a unified instruction-phoneme input and a novel noise-based negative conditioning strategy.

TLDR: MM-Sonate是一个多模态流匹配框架，通过统一的指令-音素输入和一种新的基于噪声的负条件策略，实现了最先进的可控音频-视频生成和零样本语音克隆。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Chunyu Qiang, Jun Wang, Xiaopeng Wang, Kang Yin, Yuxin Guo, Xijuan Zeng, Nan Li, Zihan Li, Yuzhe Liang, Ziyu Zhang, Teng Ma, Yushen Chen, Zhongliang Liu, Feng Deng, Chen Zhang, Pengfei Wan

Improving Flexible Image Tokenizers for Autoregressive Image Generation

Flexible image tokenizers aim to represent an image using an ordered 1D variable-length token sequence. This flexible tokenization is typically achieved through nested dropout, where a portion of trailing tokens is randomly truncated during training, and the image is reconstructed using the remaining preceding sequence. However, this tail-truncation strategy inherently concentrates the image information in the early tokens, limiting the effectiveness of downstream AutoRegressive (AR) image generation as the token length increases. To overcome these limitations, we propose \textbf{ReToK}, a flexible tokenizer with \underline{Re}dundant \underline{Tok}en Padding and Hierarchical Semantic Regularization, designed to fully exploit all tokens for enhanced latent modeling. Specifically, we introduce \textbf{Redundant Token Padding} to activate tail tokens more frequently, thereby alleviating information over-concentration in the early tokens. In addition, we apply \textbf{Hierarchical Semantic Regularization} to align the decoding features of earlier tokens with those from a pre-trained vision foundation model, while progressively reducing the regularization strength toward the tail to allow finer low-level detail reconstruction. Extensive experiments demonstrate the effectiveness of ReTok: on ImageNet 256$\times$256, our method achieves superior generation performance compared with both flexible and fixed-length tokenizers. Code will be available at: \href{https://github.com/zfu006/ReTok}{https://github.com/zfu006/ReTok}

TLDR: This paper introduces ReTok, a flexible image tokenizer with redundant token padding and hierarchical semantic regularization, designed to improve autoregressive image generation performance by addressing information concentration in early tokens.

TLDR: 该论文介绍了ReTok，一种灵活的图像分词器，具有冗余令牌填充和分层语义正则化，旨在通过解决早期令牌中的信息集中问题来提高自回归图像生成性能。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Zixuan Fu, Lanqing Guo, Chong Wang, Binbin Song, Ding Liu, Bihan Wen

Slot-ID: Identity-Preserving Video Generation from Reference Videos via Slot-Based Temporal Identity Encoding

Producing prompt-faithful videos that preserve a user-specified identity remains challenging: models need to extrapolate facial dynamics from sparse reference while balancing the tension between identity preservation and motion naturalness. Conditioning on a single image completely ignores the temporal signature, which leads to pose-locked motions, unnatural warping, and "average" faces when viewpoints and expressions change. To this end, we introduce an identity-conditioned variant of a diffusion-transformer video generator which uses a short reference video rather than a single portrait. Our key idea is to incorporate the dynamics in the reference. A short clip reveals subject-specific patterns, e.g., how smiles form, across poses and lighting. From this clip, a Sinkhorn-routed encoder learns compact identity tokens that capture characteristic dynamics while remaining pretrained backbone-compatible. Despite adding only lightweight conditioning, the approach consistently improves identity retention under large pose changes and expressive facial behavior, while maintaining prompt faithfulness and visual realism across diverse subjects and prompts.

TLDR: The paper introduces Slot-ID, a method using a short reference video and slot-based temporal identity encoding to improve identity preservation in video generation, addressing issues like pose-locked motions and unnatural warping by capturing subject-specific dynamics.

TLDR: 该论文介绍了Slot-ID，一种利用短参考视频和基于槽的时序身份编码的方法来改善视频生成中的身份保留，通过捕获特定主体的动态来解决姿势锁定动作和不自然扭曲等问题。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yixuan Lai, He Wang, Kun Zhou, Tianjia Shao

DeepInv: A Novel Self-supervised Learning Approach for Fast and Accurate Diffusion Inversion

Diffusion inversion is a task of recovering the noise of an image in a diffusion model, which is vital for controllable diffusion image editing. At present, diffusion inversion still remains a challenging task due to the lack of viable supervision signals. Thus, most existing methods resort to approximation-based solutions, which however are often at the cost of performance or efficiency. To remedy these shortcomings, we propose a novel self-supervised diffusion inversion approach in this paper, termed Deep Inversion (DeepInv). Instead of requiring ground-truth noise annotations, we introduce a self-supervised objective as well as a data augmentation strategy to generate high-quality pseudo noises from real images without manual intervention. Based on these two innovative designs, DeepInv is also equipped with an iterative and multi-scale training regime to train a parameterized inversion solver, thereby achieving the fast and accurate image-to-noise mapping. To the best of our knowledge, this is the first attempt of presenting a trainable solver to predict inversion noise step by step. The extensive experiments show that our DeepInv can achieve much better performance and inference speed than the compared methods, e.g., +40.435% SSIM than EasyInv and +9887.5% speed than ReNoise on COCO dataset. Moreover, our careful designs of trainable solvers can also provide insights to the community. Codes and model parameters will be released in https://github.com/potato-kitty/DeepInv.

TLDR: The paper proposes a self-supervised learning approach called DeepInv for fast and accurate diffusion inversion, a crucial step for controllable diffusion image editing, by training a parametric inversion solver with pseudo noises generated from real images.

TLDR: 本文提出了一种名为DeepInv的自监督学习方法，用于快速准确的扩散反演。扩散反演是可控扩散图像编辑的关键步骤。DeepInv通过训练参数化的反演求解器，并使用从真实图像生成的伪噪声进行训练。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Ziyue Zhang, Luxi Lin, Xiaolin Hu, Chao Chang, HuaiXi Wang, Yiyi Zhou, Rongrong Ji

Unified Generation and Self-Verification for Vision-Language Models via Advantage Decoupled Preference Optimization

Parallel test-time scaling typically trains separate generation and verification models, incurring high training and inference costs. We propose Advantage Decoupled Preference Optimization (ADPO), a unified reinforcement learning framework that jointly learns answer generation and self-verification within a single policy. ADPO introduces two innovations: a preference verification reward improving verification capability and a decoupled optimization mechanism enabling synergistic optimization of generation and verification. Specifically, the preference verification reward computes mean verification scores from positive and negative samples as decision thresholds, providing positive feedback when prediction correctness aligns with answer correctness. Meanwhile, the advantage decoupled optimization computes separate advantages for generation and verification, applies token masks to isolate gradients, and combines masked GRPO objectives, preserving generation quality while calibrating verification scores. ADPO achieves up to +34.1% higher verification AUC and -53.5% lower inference time, with significant gains of +2.8%/+1.4% accuracy on MathVista/MMMU, +1.9 cIoU on ReasonSeg, and +1.7%/+1.0% step success rate on AndroidControl/GUI Odyssey.

TLDR: This paper proposes Advantage Decoupled Preference Optimization (ADPO), a unified reinforcement learning framework that improves both the generation and self-verification capabilities of vision-language models, achieving significant performance gains and reduced inference time across several tasks.

TLDR: 该论文提出了优势解耦偏好优化 (ADPO)，这是一个统一的强化学习框架，可以提高视觉-语言模型的生成和自我验证能力，在多个任务中实现显著的性能提升并减少推理时间。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Xinyu Qiu, Heng Jia, Zhengwen Zeng, Shuheng Shen, Changhua Meng, Yi Yang, Linchao Zhu

Image Synthesis Using Spintronic Deep Convolutional Generative Adversarial Network

The computational requirements of generative adversarial networks (GANs) exceed the limit of conventional Von Neumann architectures, necessitating energy efficient alternatives such as neuromorphic spintronics. This work presents a hybrid CMOS-spintronic deep convolutional generative adversarial network (DCGAN) architecture for synthetic image generation. The proposed generative vision model approach follows the standard framework, leveraging generator and discriminators adversarial training with our designed spintronics hardware for deconvolution, convolution, and activation layers of the DCGAN architecture. To enable hardware aware spintronic implementation, the generator's deconvolution layers are restructured as zero padded convolution, allowing seamless integration with a 6-bit skyrmion based synapse in a crossbar, without compromising training performance. Nonlinear activation functions are implemented using a hybrid CMOS domain wall based Rectified linear unit (ReLU) and Leaky ReLU units. Our proposed tunable Leaky ReLU employs domain wall position coded, continuous resistance states and a piecewise uniaxial parabolic anisotropy profile with a parallel MTJ readout, exhibiting energy consumption of 0.192 pJ. Our spintronic DCGAN model demonstrates adaptability across both grayscale and colored datasets, achieving Fr'echet Inception Distances (FID) of 27.5 for the Fashion MNIST and 45.4 for Anime Face datasets, with testing energy (training energy) of 4.9 nJ (14.97~nJ/image) and 24.72 nJ (74.7 nJ/image).

TLDR: This paper presents a hybrid CMOS-spintronic deep convolutional generative adversarial network (DCGAN) architecture for energy-efficient synthetic image generation, achieving promising FID scores on Fashion MNIST and Anime Face datasets.

TLDR: 本文提出了一种混合CMOS-自旋电子深度卷积生成对抗网络（DCGAN）架构，用于实现节能的合成图像生成，并在Fashion MNIST和Anime Face数据集上取得了良好的FID分数。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Saumya Gupta, Abhinandan, Venkatesh vadde, Bhaskaran Muralidharan, Abhishek Sharma

AIGC Daily Papers

DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

Guiding Token-Sparse Diffusion Models

Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation

MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning

Improving Flexible Image Tokenizers for Autoregressive Image Generation

Slot-ID: Identity-Preserving Video Generation from Reference Videos via Slot-Based Temporal Identity Encoding

DeepInv: A Novel Self-supervised Learning Approach for Fast and Accurate Diffusion Inversion

Unified Generation and Self-Verification for Vision-Language Models via Advantage Decoupled Preference Optimization

Image Synthesis Using Spintronic Deep Convolutional Generative Adversarial Network