ArXiv CS.CV Papers (Image/Video Generation)

Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention

Advanced autoregressive (AR) video generation models have improved visual fidelity and interactivity, but the quadratic complexity of attention remains a primary bottleneck for efficient deployment. While existing sparse attention solutions have shown promise on bidirectional models, we identify that applying these solutions to AR models leads to considerable performance degradation for two reasons: isolated consideration of chunk generation and insufficient utilization of past informative context. Motivated by these observations, we propose \textsc{Light Forcing}, the \textit{first} sparse attention solution tailored for AR video generation models. It incorporates a \textit{Chunk-Aware Growth} mechanism to quantitatively estimate the contribution of each chunk, which determines their sparsity allocation. This progressive sparsity increase strategy enables the current chunk to inherit prior knowledge in earlier chunks during generation. Additionally, we introduce a \textit{Hierarchical Sparse Attention} to capture informative historical and local context in a coarse-to-fine manner. Such two-level mask selection strategy (\ie, frame and block level) can adaptively handle diverse attention patterns. Extensive experiments demonstrate that our method outperforms existing sparse attention in quality (\eg, 84.5 on VBench) and efficiency (\eg, $1.2{\sim}1.3\times$ end-to-end speedup). Combined with FP8 quantization and LightVAE, \textsc{Light Forcing} further achieves a $2.3\times$ speedup and 19.7\,FPS on an RTX~5090 GPU. Code will be released at \href{https://github.com/chengtao-lv/LightForcing}{https://github.com/chengtao-lv/LightForcing}.

TLDR: The paper introduces Light Forcing, a novel sparse attention method tailored for autoregressive video generation, which improves both the quality and efficiency of video generation by using chunk-aware growth and hierarchical sparse attention.

TLDR: 该论文介绍了一种名为Light Forcing的新型稀疏注意力方法，专为自回归视频生成量身定制，通过使用分块感知增长和分层稀疏注意力，提高了视频生成的质量和效率。

Relevance: (9/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Chengtao Lv, Yumeng Shi, Yushi Huang, Ruihao Gong, Shen Ren, Wenya Wang

Generative Modeling via Drifting

Generative modeling can be formulated as learning a mapping f such that its pushforward distribution matches the data distribution. The pushforward behavior can be carried out iteratively at inference time, for example in diffusion and flow-based models. In this paper, we propose a new paradigm called Drifting Models, which evolve the pushforward distribution during training and naturally admit one-step inference. We introduce a drifting field that governs the sample movement and achieves equilibrium when the distributions match. This leads to a training objective that allows the neural network optimizer to evolve the distribution. In experiments, our one-step generator achieves state-of-the-art results on ImageNet at 256 x 256 resolution, with an FID of 1.54 in latent space and 1.61 in pixel space. We hope that our work opens up new opportunities for high-quality one-step generation.

TLDR: The paper introduces 'Drifting Models,' a new generative modeling paradigm achieving state-of-the-art one-step image generation on ImageNet at 256x256 resolution.

TLDR: 该论文提出了一种新的生成模型范式“漂移模型”，在 ImageNet 256x256 分辨率上实现了最先进的单步图像生成效果。

Relevance: (9/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Mingyang Deng, He Li, Tianhong Li, Yilun Du, Kaiming He

SparVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration

Visual AutoRegressive (VAR) modeling has garnered significant attention for its innovative next-scale prediction paradigm. However, mainstream VAR paradigms attend to all tokens across historical scales at each autoregressive step. As the next scale resolution grows, the computational complexity of attention increases quartically with resolution, causing substantial latency. Prior accelerations often skip high-resolution scales, which speeds up inference but discards high-frequency details and harms image quality. To address these problems, we present SparVAR, a training-free acceleration framework that exploits three properties of VAR attention: (i) strong attention sinks, (ii) cross-scale activation similarity, and (iii) pronounced locality. Specifically, we dynamically predict the sparse attention pattern of later high-resolution scales from a sparse decision scale, and construct scale self-similar sparse attention via an efficient index-mapping mechanism, enabling high-efficiency sparse attention computation at large scales. Furthermore, we propose cross-scale local sparse attention and implement an efficient block-wise sparse kernel, which achieves $\mathbf{> 5\times}$ faster forward speed than FlashAttention. Extensive experiments demonstrate that the proposed SparseVAR can reduce the generation time of an 8B model producing $1024\times1024$ high-resolution images to the 1s, without skipping the last scales. Compared with the VAR baseline accelerated by FlashAttention, our method achieves a $\mathbf{1.57\times}$ speed-up while preserving almost all high-frequency details. When combined with existing scale-skipping strategies, SparseVAR attains up to a $\mathbf{2.28\times}$ acceleration, while maintaining competitive visual generation quality. Code is available at https://github.com/CAS-CLab/SparVAR.

TLDR: The paper introduces SparVAR, a training-free acceleration framework for Visual AutoRegressive modeling that leverages sparsity to significantly speed up image generation while preserving high-frequency details.

TLDR: 该论文介绍了 SparVAR，一种用于视觉自回归建模的免训练加速框架，它利用稀疏性来显著加速图像生成，同时保留高频细节。

Relevance: (8/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Zekun Li, Ning Wang, Tongxin Bai, Changwang Mei, Peisong Wang, Shuang Qiu, Jian Cheng

SkeletonGaussian: Editable 4D Generation through Gaussian Skeletonization

4D generation has made remarkable progress in synthesizing dynamic 3D objects from input text, images, or videos. However, existing methods often represent motion as an implicit deformation field, which limits direct control and editability. To address this issue, we propose SkeletonGaussian, a novel framework for generating editable dynamic 3D Gaussians from monocular video input. Our approach introduces a hierarchical articulated representation that decomposes motion into sparse rigid motion explicitly driven by a skeleton and fine-grained non-rigid motion. Concretely, we extract a robust skeleton and drive rigid motion via linear blend skinning, followed by a hexplane-based refinement for non-rigid deformations, enhancing interpretability and editability. Experimental results demonstrate that SkeletonGaussian surpasses existing methods in generation quality while enabling intuitive motion editing, establishing a new paradigm for editable 4D generation. Project page: https://wusar.github.io/projects/skeletongaussian/

TLDR: The paper introduces SkeletonGaussian, a framework for generating editable dynamic 3D Gaussians from monocular video by using a hierarchical articulated representation; it claims improvements in generation quality and motion editing capabilities.

TLDR: 该论文介绍了SkeletonGaussian，一个通过分层关节表示从单目视频生成可编辑的动态3D高斯的框架。它声称在生成质量和运动编辑能力方面有所改进。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Lifan Wu, Ruijie Zhu, Yubo Ai, Tianzhu Zhang

Adaptive 1D Video Diffusion Autoencoder

Recent video generation models largely rely on video autoencoders that compress pixel-space videos into latent representations. However, existing video autoencoders suffer from three major limitations: (1) fixed-rate compression that wastes tokens on simple videos, (2) inflexible CNN architectures that prevent variable-length latent modeling, and (3) deterministic decoders that struggle to recover appropriate details from compressed latents. To address these issues, we propose One-Dimensional Diffusion Video Autoencoder (One-DVA), a transformer-based framework for adaptive 1D encoding and diffusion-based decoding. The encoder employs query-based vision transformers to extract spatiotemporal features and produce latent representations, while a variable-length dropout mechanism dynamically adjusts the latent length. The decoder is a pixel-space diffusion transformer that reconstructs videos with the latents as input conditions. With a two-stage training strategy, One-DVA achieves performance comparable to 3D-CNN VAEs on reconstruction metrics at identical compression ratios. More importantly, it supports adaptive compression and thus can achieve higher compression ratios. To better support downstream latent generation, we further regularize the One-DVA latent distribution for generative modeling and fine-tune its decoder to mitigate artifacts caused by the generation process.

TLDR: The paper introduces One-DVA, a transformer-based video autoencoder employing adaptive 1D encoding and diffusion-based decoding to overcome limitations of fixed-rate compression and inflexible CNN architectures in existing video autoencoders, achieving comparable or better reconstruction performance and supporting downstream generative modeling.

TLDR: 本文提出了一种名为One-DVA的基于Transformer的视频自动编码器，它采用自适应一维编码和基于扩散的解码来克服现有视频自动编码器中固定速率压缩和不灵活的CNN架构的限制，实现了可比或更好的重建性能，并支持下游生成建模。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yao Teng, Minxuan Lin, Xian Liu, Shuai Wang, Xiao Yang, Xihui Liu

VTok: A Unified Video Tokenizer with Decoupled Spatial-Temporal Latents

This work presents VTok, a unified video tokenization framework that can be used for both generation and understanding tasks. Unlike the leading vision-language systems that tokenize videos through a naive frame-sampling strategy, we propose to decouple the spatial and temporal representations of videos by retaining the spatial features of a single key frame while encoding each subsequent frame into a single residual token, achieving compact yet expressive video tokenization. Our experiments suggest that VTok effectively reduces the complexity of video representation from the product of frame count and per-frame token count to their sum, while the residual tokens sufficiently capture viewpoint and motion changes relative to the key frame. Extensive evaluations demonstrate the efficacy and efficiency of VTok: it achieves notably higher performance on a range of video understanding and text-to-video generation benchmarks compared with baselines using naive tokenization, all with shorter token sequences per video (e.g., 3.4% higher accuracy on our TV-Align benchmark and 1.9% higher VBench score). Remarkably, VTok produces more coherent motion and stronger guidance following in text-to-video generation, owing to its more consistent temporal encoding. We hope VTok can serve as a standardized video tokenization paradigm for future research in video understanding and generation.

TLDR: The paper introduces VTok, a novel video tokenization framework that decouples spatial and temporal representations, achieving higher performance in video understanding and text-to-video generation with shorter token sequences compared to naive methods.

TLDR: 该论文介绍了 VTok，一种新颖的视频标记化框架，它解耦了空间和时间表示，与传统方法相比，在视频理解和文本到视频生成方面实现了更高的性能，并且具有更短的标记序列。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Feng Wang, Yichun Shi, Ceyuan Yang, Qiushan Guo, Jingxiang Sun, Alan Yuille, Peng Wang

Point2Insert: Video Object Insertion via Sparse Point Guidance

This paper introduces Point2Insert, a sparse-point-based framework for flexible and user-friendly object insertion in videos, motivated by the growing popularity of accurate, low-effort object placement. Existing approaches face two major challenges: mask-based insertion methods require labor-intensive mask annotations, while instruction-based methods struggle to place objects at precise locations. Point2Insert addresses these issues by requiring only a small number of sparse points instead of dense masks, eliminating the need for tedious mask drawing. Specifically, it supports both positive and negative points to indicate regions that are suitable or unsuitable for insertion, enabling fine-grained spatial control over object locations. The training of Point2Insert consists of two stages. In Stage 1, we train an insertion model that generates objects in given regions conditioned on either sparse-point prompts or a binary mask. In Stage 2, we further train the model on paired videos synthesized by an object removal model, adapting it to video insertion. Moreover, motivated by the higher insertion success rate of mask-guided editing, we leverage a mask-guided insertion model as a teacher to distill reliable insertion behavior into the point-guided model. Extensive experiments demonstrate that Point2Insert consistently outperforms strong baselines and even surpasses models with $\times$10 more parameters.

TLDR: Point2Insert introduces a user-friendly video object insertion method using sparse point guidance, outperforming existing mask-based and instruction-based methods by providing fine-grained spatial control with fewer annotations.

TLDR: Point2Insert 提出了一种用户友好的视频对象插入方法，该方法使用稀疏点引导，通过以更少的注释提供细粒度的空间控制，优于现有的基于掩码和基于指令的方法。

Relevance: (8/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yu Zhou, Xiaoyan Yang, Bojia Zi, Lihan Zhang, Ruijie Sun, Weishi Zheng, Haibin Huang, Chi Zhang, Xuelong Li

AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations

High-quality scientific illustrations are crucial for effectively communicating complex scientific and technical concepts, yet their manual creation remains a well-recognized bottleneck in both academia and industry. We present FigureBench, the first large-scale benchmark for generating scientific illustrations from long-form scientific texts. It contains 3,300 high-quality scientific text-figure pairs, covering diverse text-to-illustration tasks from scientific papers, surveys, blogs, and textbooks. Moreover, we propose AutoFigure, the first agentic framework that automatically generates high-quality scientific illustrations based on long-form scientific text. Specifically, before rendering the final result, AutoFigure engages in extensive thinking, recombination, and validation to produce a layout that is both structurally sound and aesthetically refined, outputting a scientific illustration that achieves both structural completeness and aesthetic appeal. Leveraging the high-quality data from FigureBench, we conduct extensive experiments to test the performance of AutoFigure against various baseline methods. The results demonstrate that AutoFigure consistently surpasses all baseline methods, producing publication-ready scientific illustrations. The code, dataset and huggingface space are released in https://github.com/ResearAI/AutoFigure.

TLDR: The paper introduces AutoFigure, an agentic framework for automatically generating publication-ready scientific illustrations from long-form text, along with FigureBench, a new benchmark dataset for this task. AutoFigure outperforms existing methods in generating high-quality, structurally complete, and aesthetically refined illustrations.

TLDR: 该论文介绍了AutoFigure，一个用于从长文本自动生成可用于发布的科学插图的agentic框架，以及FigureBench，一个新的用于此任务的基准数据集。AutoFigure在生成高质量、结构完整和美观的插图方面优于现有方法。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Minjun Zhu, Zhen Lin, Yixuan Weng, Panzhong Lu, Qiujie Xie, Yifan Wei, Sifan Liu, Qiyao Sun, Yue Zhang

Progressive Checkerboards for Autoregressive Multiscale Image Generation

A key challenge in autoregressive image generation is to efficiently sample independent locations in parallel, while still modeling mutual dependencies with serial conditioning. Some recent works have addressed this by conditioning between scales in a multiscale pyramid. Others have looked at parallelizing samples in a single image using regular partitions or randomized orders. In this work we examine a flexible, fixed ordering based on progressive checkerboards for multiscale autoregressive image generation. Our ordering draws samples in parallel from evenly spaced regions at each scale, maintaining full balance in all levels of a quadtree subdivision at each step. This enables effective conditioning both between and within scales. Intriguingly, we find evidence that in our balanced setting, a wide range of scale-up factors lead to similar results, so long as the total number of serial steps is constant. On class-conditional ImageNet, our method achieves competitive performance compared to recent state-of-the-art autoregressive systems with like model capacity, using fewer sampling steps.

TLDR: This paper introduces a novel, fixed sampling order based on progressive checkerboards for multiscale autoregressive image generation, enabling efficient parallel sampling while maintaining dependencies, achieving competitive results on ImageNet with fewer sampling steps.

TLDR: 本文提出了一种基于渐进式棋盘格的新型固定采样顺序，用于多尺度自回归图像生成，能够在保持依赖关系的同时实现高效的并行采样，并在 ImageNet 上以更少的采样步骤实现了有竞争力的结果。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: David Eigen

X2HDR: HDR Image Generation in a Perceptually Uniform Space

High-dynamic-range (HDR) formats and displays are becoming increasingly prevalent, yet state-of-the-art image generators (e.g., Stable Diffusion and FLUX) typically remain limited to low-dynamic-range (LDR) output due to the lack of large-scale HDR training data. In this work, we show that existing pretrained diffusion models can be easily adapted to HDR generation without retraining from scratch. A key challenge is that HDR images are natively represented in linear RGB, whose intensity and color statistics differ substantially from those of sRGB-encoded LDR images. This gap, however, can be effectively bridged by converting HDR inputs into perceptually uniform encodings (e.g., using PU21 or PQ). Empirically, we find that LDR-pretrained variational autoencoders (VAEs) reconstruct PU21-encoded HDR inputs with fidelity comparable to LDR data, whereas linear RGB inputs cause severe degradations. Motivated by this finding, we describe an efficient adaptation strategy that freezes the VAE and finetunes only the denoiser via low-rank adaptation in a perceptually uniform space. This results in a unified computational method that supports both text-to-HDR synthesis and single-image RAW-to-HDR reconstruction. Experiments demonstrate that our perceptually encoded adaptation consistently improves perceptual fidelity, text-image alignment, and effective dynamic range, relative to previous techniques.

TLDR: The paper presents a method, X2HDR, to adapt existing LDR-pretrained diffusion models for HDR image generation by leveraging perceptually uniform encodings without retraining from scratch, achieving improved perceptual fidelity and dynamic range.

TLDR: 该论文提出了一种名为X2HDR的方法，通过利用感知均匀编码来调整现有的LDR预训练扩散模型，以生成HDR图像，无需从头开始重新训练，从而提高了感知保真度和动态范围。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Ronghuan Wu, Wanchao Su, Kede Ma, Jing Liao, Rafał K. Mantiuk

DiMo: Discrete Diffusion Modeling for Motion Generation and Understanding

Prior masked modeling motion generation methods predominantly study text-to-motion. We present DiMo, a discrete diffusion-style framework, which extends masked modeling to bidirectional text--motion understanding and generation. Unlike GPT-style autoregressive approaches that tokenize motion and decode sequentially, DiMo performs iterative masked token refinement, unifying Text-to-Motion (T2M), Motion-to-Text (M2T), and text-free Motion-to-Motion (M2M) within a single model. This decoding paradigm naturally enables a quality-latency trade-off at inference via the number of refinement steps.We further improve motion token fidelity with residual vector quantization (RVQ) and enhance alignment and controllability with Group Relative Policy Optimization (GRPO). Experiments on HumanML3D and KIT-ML show strong motion quality and competitive bidirectional understanding under a unified framework. In addition, we demonstrate model ability in text-free motion completion, text-guided motion prediction and motion caption correction without architectural change.Additional qualitative results are available on our project page: https://animotionlab.github.io/DiMo/.

TLDR: DiMo proposes a discrete diffusion framework for motion generation and understanding, unifying text-to-motion, motion-to-text, and motion-to-motion tasks within a single model using iterative masked token refinement.

TLDR: DiMo 提出了一个离散扩散框架，用于运动生成和理解，通过迭代的掩蔽令牌细化，将文本到运动、运动到文本和运动到运动任务统一在一个模型中。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Ning Zhang, Zhengyu Li, Kwong Weng Loh, Mingxi Xu, Qi Wang, Zhengyu Wen, Xiaoyu He, Wei Zhao, Kehong Gong, Mingyuan Zhang

AIGC Daily Papers

Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention

Generative Modeling via Drifting

SparVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration

SkeletonGaussian: Editable 4D Generation through Gaussian Skeletonization

Adaptive 1D Video Diffusion Autoencoder

VTok: A Unified Video Tokenizer with Decoupled Spatial-Temporal Latents

Point2Insert: Video Object Insertion via Sparse Point Guidance

AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations

Progressive Checkerboards for Autoregressive Multiscale Image Generation

X2HDR: HDR Image Generation in a Perceptually Uniform Space

DiMo: Discrete Diffusion Modeling for Motion Generation and Understanding