ArXiv CS.CV Papers (Image/Video Generation)

Advancing Open-source World Models

We present LingBot-World, an open-sourced world simulator stemming from video generation. Positioned as a top-tier world model, LingBot-World offers the following features. (1) It maintains high fidelity and robust dynamics in a broad spectrum of environments, including realism, scientific contexts, cartoon styles, and beyond. (2) It enables a minute-level horizon while preserving contextual consistency over time, which is also known as "long-term memory". (3) It supports real-time interactivity, achieving a latency of under 1 second when producing 16 frames per second. We provide public access to the code and model in an effort to narrow the divide between open-source and closed-source technologies. We believe our release will empower the community with practical applications across areas like content creation, gaming, and robot learning.

TLDR: The paper introduces LingBot-World, an open-source world simulator based on video generation boasting high fidelity, long-term memory, and real-time interactivity across various environments, aimed at bridging the gap between open and closed-source AI.

TLDR: 该论文介绍了LingBot-World，一个基于视频生成的开源世界模拟器，具有高保真度、长期记忆和实时交互功能，适用于多种环境，旨在弥合开源和闭源人工智能之间的差距。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, Yihang Chen, Jie Liu, Yansong Cheng, Yao Yao, Jiayi Zhu, Yihao Meng, Kecheng Zheng, Qingyan Bai, Jingye Chen, Zehong Shen, Yue Yu, Xing Zhu, Yujun Shen, Hao Ouyang

Say Cheese! Detail-Preserving Portrait Collection Generation via Natural Language Edits

As social media platforms proliferate, users increasingly demand intuitive ways to create diverse, high-quality portrait collections. In this work, we introduce Portrait Collection Generation (PCG), a novel task that generates coherent portrait collections by editing a reference portrait image through natural language instructions. This task poses two unique challenges to existing methods: (1) complex multi-attribute modifications such as pose, spatial layout, and camera viewpoint; and (2) high-fidelity detail preservation including identity, clothing, and accessories. To address these challenges, we propose CHEESE, the first large-scale PCG dataset containing 24K portrait collections and 573K samples with high-quality modification text annotations, constructed through an Large Vison-Language Model-based pipeline with inversion-based verification. We further propose SCheese, a framework that combines text-guided generation with hierarchical identity and detail preservation. SCheese employs adaptive feature fusion mechanism to maintain identity consistency, and ConsistencyNet to inject fine-grained features for detail consistency. Comprehensive experiments validate the effectiveness of CHEESE in advancing PCG, with SCheese achieving state-of-the-art performance.

TLDR: The paper introduces Portrait Collection Generation (PCG), a novel task for generating coherent portrait collections via natural language editing, along with a new dataset (CHEESE) and a corresponding method (SCheese) for addressing the challenges of multi-attribute modification and detail preservation.

TLDR: 该论文介绍了人像集合生成（PCG）这一新任务，通过自然语言编辑生成连贯的人像集合，同时提出了一个新的数据集(CHEESE)和一种相应的方法(SCheese)，用于解决多属性修改和细节保留的挑战。

Relevance: (9/10)

Novelty: (9/10)

Clarity: (10/10)

Potential Impact: (8/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Zelong Sun, Jiahui Wu, Ying Ba, Dong Jing, Zhiwu Lu

Compression Tells Intelligence: Visual Coding, Visual Token Technology, and the Unification

"Compression Tells Intelligence", is supported by research in artificial intelligence, particularly concerning (multimodal) large language models (LLMs/MLLMs), where compression efficiency often correlates with improved model performance and capabilities. For compression, classical visual coding based on traditional information theory has developed over decades, achieving great success with numerous international industrial standards widely applied in multimedia (e.g., image/video) systems. Except that, the recent emergingvisual token technology of generative multi-modal large models also shares a similar fundamental objective like visual coding: maximizing semantic information fidelity during the representation learning while minimizing computational cost. Therefore, this paper provides a comprehensive overview of two dominant technique families first -- Visual Coding and Vision Token Technology -- then we further unify them from the aspect of optimization, discussing the essence of compression efficiency and model performance trade-off behind. Next, based on the proposed unified formulation bridging visual coding andvisual token technology, we synthesize bidirectional insights of themselves and forecast the next-gen visual codec and token techniques. Last but not least, we experimentally show a large potential of the task-oriented token developments in the more practical tasks like multimodal LLMs (MLLMs), AI-generated content (AIGC), and embodied AI, as well as shedding light on the future possibility of standardizing a general token technology like the traditional codecs (e.g., H.264/265) with high efficiency for a wide range of intelligent tasks in a unified and effective manner.

TLDR: This paper explores the connection between visual coding and visual token technology in multimodal LLMs, unifying them under an optimization framework and suggesting future directions for task-oriented token development and standardization.

TLDR: 本文探讨了多模态LLM中视觉编码和视觉令牌技术之间的联系，将它们统一在一个优化框架下，并提出了面向任务的令牌开发和标准化的未来方向。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Xin Jin, Jinming Liu, Yuntao Wei, Junyan Lin, Zhicheng Wang, Jianguo Huang, Xudong Yang, Yanxiao Liu, Wenjun Zeng

Detecting and Mitigating Memorization in Diffusion Models through Anisotropy of the Log-Probability

Diffusion-based image generative models produce high-fidelity images through iterative denoising but remain vulnerable to memorization, where they unintentionally reproduce exact copies or parts of training images. Recent memorization detection methods are primarily based on the norm of score difference as indicators of memorization. We prove that such norm-based metrics are mainly effective under the assumption of isotropic log-probability distributions, which generally holds at high or medium noise levels. In contrast, analyzing the anisotropic regime reveals that memorized samples exhibit strong angular alignment between the guidance vector and unconditional scores in the low-noise setting. Through these insights, we develop a memorization detection metric by integrating isotropic norm and anisotropic alignment. Our detection metric can be computed directly on pure noise inputs via two conditional and unconditional forward passes, eliminating the need for costly denoising steps. Detection experiments on Stable Diffusion v1.4 and v2 show that our metric outperforms existing denoising-free detection methods while being at least approximately 5x faster than the previous best approach. Finally, we demonstrate the effectiveness of our approach by utilizing a mitigation strategy that adapts memorized prompts based on our developed metric.

TLDR: This paper introduces a new metric for detecting memorization in diffusion models, leveraging the anisotropy of log-probability distributions. It outperforms existing methods in speed and accuracy and demonstrates a mitigation strategy.

TLDR: 该论文提出了一种新的扩散模型记忆化检测指标，利用对数概率分布的各向异性。该方法在速度和准确性上优于现有方法，并展示了一种缓解策略。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Rohan Asthana, Vasileios Belagiannis

Latent Temporal Discrepancy as Motion Prior: A Loss-Weighting Strategy for Dynamic Fidelity in T2V

Video generation models have achieved notable progress in static scenarios, yet their performance in motion video generation remains limited, with quality degrading under drastic dynamic changes. This is due to noise disrupting temporal coherence and increasing the difficulty of learning dynamic regions. {Unfortunately, existing diffusion models rely on static loss for all scenarios, constraining their ability to capture complex dynamics.} To address this issue, we introduce Latent Temporal Discrepancy (LTD) as a motion prior to guide loss weighting. LTD measures frame-to-frame variation in the latent space, assigning larger penalties to regions with higher discrepancy while maintaining regular optimization for stable regions. This motion-aware strategy stabilizes training and enables the model to better reconstruct high-frequency dynamics. Extensive experiments on the general benchmark VBench and the motion-focused VMBench show consistent gains, with our method outperforming strong baselines by 3.31% on VBench and 3.58% on VMBench, achieving significant improvements in motion quality.

TLDR: The paper introduces Latent Temporal Discrepancy (LTD), a motion-aware loss weighting strategy in the latent space of video generation models, to improve the generation of dynamic motion by assigning larger penalties to highly dynamic regions. Experiments on VBench and VMBench demonstrate performance improvements.

TLDR: 该论文介绍了潜在时间差异（LTD），一种视频生成模型潜在空间中的运动感知损失加权策略，通过对高动态区域分配更大的惩罚来提高动态运动的生成质量。在VBench和VMBench上的实验表明性能有所提高。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Meiqi Wu, Bingze Song, Ruimin Lin, Chen Zhu, Xiaokun Feng, Jiahong Wu, Xiangxiang Chu, Kaiqi Huang

Efficient Autoregressive Video Diffusion with Dummy Head

The autoregressive video diffusion model has recently gained considerable research interest due to its causal modeling and iterative denoising. In this work, we identify that the multi-head self-attention in these models under-utilizes historical frames: approximately 25% heads attend almost exclusively to the current frame, and discarding their KV caches incurs only minor performance degradation. Building upon this, we propose Dummy Forcing, a simple yet effective method to control context accessibility across different heads. Specifically, the proposed heterogeneous memory allocation reduces head-wise context redundancy, accompanied by dynamic head programming to adaptively classify head types. Moreover, we develop a context packing technique to achieve more aggressive cache compression. Without additional training, our Dummy Forcing delivers up to 2.0x speedup over the baseline, supporting video generation at 24.3 FPS with less than 0.5% quality drop. Project page is available at https://csguoh.github.io/project/DummyForcing/.

TLDR: This paper proposes an optimization for autoregressive video diffusion models by identifying and mitigating redundancy in multi-head self-attention, leading to significant speedups (up to 2x) with minimal quality loss.

TLDR: 本文提出了一种针对自回归视频扩散模型的优化方法，通过识别和减少多头自注意力机制中的冗余，从而显著提高速度（高达2倍），且质量损失极小。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Hang Guo, Zhaoyang Jia, Jiahao Li, Bin Li, Yuanhao Cai, Jiangshan Wang, Yawei Li, Yan Lu

StreamFusion: Scalable Sequence Parallelism for Distributed Inference of Diffusion Transformers on GPUs

Diffusion Transformers (DiTs) have gained increasing adoption in high-quality image and video generation. As demand for higher-resolution images and longer videos increases, single-GPU inference becomes inefficient due to increased latency and large activation sizes. Current frameworks employ sequence parallelism (SP) techniques such as Ulysses Attention and Ring Attention to scale inference. However, these implementations have three primary limitations: (1) suboptimal communication patterns for network topologies on modern GPU machines, (2) latency bottlenecks from all-to-all operations in inter-machine communication, and (3) GPU sender-receiver synchronization and computation overheads from using two-sided communication libraries. To address these issues, we present StreamFusion, a topology-aware efficient DiT serving engine. StreamFusion incorporates three key innovations: (1) a topology-aware sequence parallelism technique that accounts for inter- and intra-machine bandwidth differences, (2) Torus Attention, a novel SP technique enabling overlapping of inter-machine all-to-all operations with computation, and (3) a one-sided communication implementation that minimizes GPU sender-receiver synchronization and computation overheads. Our experiments demonstrate that StreamFusion outperforms the state-of-the-art approach by an average of $1.35\times$ (up to $1.77\times$).

TLDR: The paper introduces StreamFusion, a new DiT serving engine that improves inference speed and efficiency on multi-GPU systems by addressing communication bottlenecks inherent in sequence parallelism strategies. It achieves this through topology-aware parallelism, a novel Torus Attention mechanism, and optimized one-sided communication, demonstrating significant performance gains.

TLDR: 该论文介绍了StreamFusion，一种新的DiT服务引擎，通过解决序列并行策略中固有的通信瓶颈，提高了多GPU系统上的推理速度和效率。它通过拓扑感知的并行化、一种新颖的Torus注意力机制以及优化的单边通信来实现这一目标，并展示了显著的性能提升。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Jiacheng Yang, Jun Wu, Yaoyao Ding, Zhiying Xu, Yida Wang, Gennady Pekhimenko

DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment

Recent GRPO-based approaches built on flow matching models have shown remarkable improvements in human preference alignment for text-to-image generation. Nevertheless, they still suffer from the sparse reward problem: the terminal reward of the entire denoising trajectory is applied to all intermediate steps, resulting in a mismatch between the global feedback signals and the exact fine-grained contributions at intermediate denoising steps. To address this issue, we introduce \textbf{DenseGRPO}, a novel framework that aligns human preference with dense rewards, which evaluates the fine-grained contribution of each denoising step. Specifically, our approach includes two key components: (1) we propose to predict the step-wise reward gain as dense reward of each denoising step, which applies a reward model on the intermediate clean images via an ODE-based approach. This manner ensures an alignment between feedback signals and the contributions of individual steps, facilitating effective training; and (2) based on the estimated dense rewards, a mismatch drawback between the uniform exploration setting and the time-varying noise intensity in existing GRPO-based methods is revealed, leading to an inappropriate exploration space. Thus, we propose a reward-aware scheme to calibrate the exploration space by adaptively adjusting a timestep-specific stochasticity injection in the SDE sampler, ensuring a suitable exploration space at all timesteps. Extensive experiments on multiple standard benchmarks demonstrate the effectiveness of the proposed DenseGRPO and highlight the critical role of the valid dense rewards in flow matching model alignment.

TLDR: The paper introduces DenseGRPO, a novel framework that aligns human preference in text-to-image generation with dense rewards by predicting step-wise reward gains for each denoising step and calibrating the exploration space with a reward-aware scheme.

TLDR: 该论文介绍了DenseGRPO，一个通过为每个去噪步骤预测步进式奖励增益并使用奖励感知方案校准探索空间，从而将文本到图像生成中的人类偏好与密集奖励对齐的新框架。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Haoyou Deng, Keyu Yan, Chaojie Mao, Xiang Wang, Yu Liu, Changxin Gao, Nong Sang

TeleStyle: Content-Preserving Style Transfer in Images and Videos

Content-preserving style transfer, generating stylized outputs based on content and style references, remains a significant challenge for Diffusion Transformers (DiTs) due to the inherent entanglement of content and style features in their internal representations. In this technical report, we present TeleStyle, a lightweight yet effective model for both image and video stylization. Built upon Qwen-Image-Edit, TeleStyle leverages the base model's robust capabilities in content preservation and style customization. To facilitate effective training, we curated a high-quality dataset of distinct specific styles and further synthesized triplets using thousands of diverse, in-the-wild style categories. We introduce a Curriculum Continual Learning framework to train TeleStyle on this hybrid dataset of clean (curated) and noisy (synthetic) triplets. This approach enables the model to generalize to unseen styles without compromising precise content fidelity. Additionally, we introduce a video-to-video stylization module to enhance temporal consistency and visual quality. TeleStyle achieves state-of-the-art performance across three core evaluation metrics: style similarity, content consistency, and aesthetic quality. Code and pre-trained models are available at https://github.com/Tele-AI/TeleStyle

TLDR: TeleStyle is a new Diffusion Transformer-based model for content-preserving image and video style transfer, leveraging a curated dataset and curriculum continual learning for improved generalization and temporal consistency.

TLDR: TeleStyle 是一个基于扩散变换器的新模型，用于内容保持的图像和视频风格迁移，它利用精选数据集和课程持续学习来改进泛化能力和时间一致性。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Shiwen Zhang, Xiaoyan Yang, Bojia Zi, Haibin Huang, Chi Zhang, Xuelong Li

Efficient Token Pruning for LLaDA-V

Diffusion-based large multimodal models, such as LLaDA-V, have demonstrated impressive capabilities in vision-language understanding and generation. However, their bidirectional attention mechanism and diffusion-style iterative denoising paradigm introduce significant computational overhead, as visual tokens are repeatedly processed across all layers and denoising steps. In this work, we conduct an in-depth attention analysis and reveal that, unlike autoregressive decoders, LLaDA-V aggregates cross-modal information predominantly in middle-to-late layers, leading to delayed semantic alignment. Motivated by this observation, we propose a structured token pruning strategy inspired by FastV, selectively removing a proportion of visual tokens at designated layers to reduce FLOPs while preserving critical semantic information. To the best of our knowledge, this is the first work to investigate structured token pruning in diffusion-based large multimodal models. Unlike FastV, which focuses on shallow-layer pruning, our method targets the middle-to-late layers of the first denoising step to align with LLaDA-V's delayed attention aggregation to maintain output quality, and the first-step pruning strategy reduces the computation across all subsequent steps. Our framework provides an empirical basis for efficient LLaDA-V inference and highlights the potential of vision-aware pruning in diffusion-based multimodal models. Across multiple benchmarks, our best configuration reduces computational cost by up to 65% while preserving an average of 95% task performance.

TLDR: This paper introduces a token pruning strategy specifically designed for LLaDA-V, a diffusion-based large multimodal model, to reduce computational costs by selectively removing visual tokens in middle-to-late layers, achieving significant speedup with minimal performance loss.

TLDR: 本文提出了一种专门为LLaDA-V设计的token剪枝策略，该策略通过选择性地移除中间到后期的视觉tokens来降低计算成本，实现了显著的加速，且性能损失极小。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Zhewen Wan, Tianchen Song, Chen Lin, Zhiyong Zhao, Xianpeng Lang

AIGC Daily Papers

Advancing Open-source World Models

Say Cheese! Detail-Preserving Portrait Collection Generation via Natural Language Edits

Compression Tells Intelligence: Visual Coding, Visual Token Technology, and the Unification

Detecting and Mitigating Memorization in Diffusion Models through Anisotropy of the Log-Probability

Latent Temporal Discrepancy as Motion Prior: A Loss-Weighting Strategy for Dynamic Fidelity in T2V

Efficient Autoregressive Video Diffusion with Dummy Head

StreamFusion: Scalable Sequence Parallelism for Distributed Inference of Diffusion Transformers on GPUs

DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment

TeleStyle: Content-Preserving Style Transfer in Images and Videos

Efficient Token Pruning for LLaDA-V