Daily papers related to Image/Video/Multimodal Generation from cs.CV
May 30, 2025
Video generation has made substantial strides with the emergence of deep generative models, especially diffusion-based approaches. However, video generation based on multiple reference subjects still faces significant challenges in maintaining multi-subject consistency and ensuring high generation quality. In this paper, we propose MAGREF, a unified framework for any-reference video generation that introduces masked guidance to enable coherent multi-subject video synthesis conditioned on diverse reference images and a textual prompt. Specifically, we propose (1) a region-aware dynamic masking mechanism that enables a single model to flexibly handle various subject inference, including humans, objects, and backgrounds, without architectural changes, and (2) a pixel-wise channel concatenation mechanism that operates on the channel dimension to better preserve appearance features. Our model delivers state-of-the-art video generation quality, generalizing from single-subject training to complex multi-subject scenarios with coherent synthesis and precise control over individual subjects, outperforming existing open-source and commercial baselines. To facilitate evaluation, we also introduce a comprehensive multi-subject video benchmark. Extensive experiments demonstrate the effectiveness of our approach, paving the way for scalable, controllable, and high-fidelity multi-subject video synthesis. Code and model can be found at: https://github.com/MAGREF-Video/MAGREF
TLDR: The paper introduces MAGREF, a new framework for generating videos with multiple subjects based on reference images and text prompts, using masked guidance for improved subject consistency and generation quality.
TLDR: 该论文介绍了MAGREF,一个新的视频生成框架,可以基于参考图像和文本提示生成包含多个主体的视频,通过使用掩码引导来提高主体一致性和生成质量。
Read Paper (PDF)Recent advancements in text-to-video (T2V) diffusion models have enabled high-fidelity and realistic video synthesis. However, current T2V models often struggle to generate physically plausible content due to their limited inherent ability to accurately understand physics. We found that while the representations within T2V models possess some capacity for physics understanding, they lag significantly behind those from recent video self-supervised learning methods. To this end, we propose a novel framework called VideoREPA, which distills physics understanding capability from video understanding foundation models into T2V models by aligning token-level relations. This closes the physics understanding gap and enable more physics-plausible generation. Specifically, we introduce the Token Relation Distillation (TRD) loss, leveraging spatio-temporal alignment to provide soft guidance suitable for finetuning powerful pre-trained T2V models, a critical departure from prior representation alignment (REPA) methods. To our knowledge, VideoREPA is the first REPA method designed for finetuning T2V models and specifically for injecting physical knowledge. Empirical evaluations show that VideoREPA substantially enhances the physics commonsense of baseline method, CogVideoX, achieving significant improvement on relevant benchmarks and demonstrating a strong capacity for generating videos consistent with intuitive physics. More video results are available at https://videorepa.github.io/.
TLDR: The paper introduces VideoREPA, a novel framework that distills physics understanding from video understanding foundation models into text-to-video diffusion models by aligning token-level relations, leading to more physically plausible video generation.
TLDR: 该论文介绍了VideoREPA,一种新颖的框架,通过对齐 token 级别的关系,将视频理解基础模型中的物理理解能力提炼到文本到视频的扩散模型中,从而生成更符合物理规律的视频。
Read Paper (PDF)Diffusion Transformers (DiT) have become the de-facto model for generating high-quality visual content like videos and images. A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length. One logical way to lessen this burden is sparse attention, where only a subset of tokens or patches are included in the calculation. However, existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads. % To address this concern, we propose Re-ttention, which implements very high sparse attention for visual generation models by leveraging the temporal redundancy of Diffusion Models to overcome the probabilistic normalization shift within the attention mechanism. Specifically, Re-ttention reshapes attention scores based on the prior softmax distribution history in order to preserve the visual quality of the full quadratic attention at very high sparsity levels. % Experimental results on T2V/T2I models such as CogVideoX and the PixArt DiTs demonstrate that Re-ttention requires as few as 3.1\% of the tokens during inference, outperforming contemporary methods like FastDiTAttn, Sparse VideoGen and MInference. Further, we measure latency to show that our method can attain over 45\% end-to-end % and over 92\% self-attention latency reduction on an H100 GPU at negligible overhead cost. Code available online here: \href{https://github.com/cccrrrccc/Re-ttention}{https://github.com/cccrrrccc/Re-ttention}
TLDR: The paper introduces Re-ttention, a novel sparse attention mechanism for diffusion transformers that achieves significant latency reduction (up to 92% in self-attention) with minimal visual quality loss by reshaping attention scores based on prior softmax distributions, which improves upon existing sparsity methods.
TLDR: 该论文介绍了 Re-ttention,一种用于扩散Transformer的新型稀疏注意力机制,通过基于先前的softmax分布重塑注意力得分,从而在最小化视觉质量损失的情况下显著降低延迟(自注意力方面高达92%),优于现有的稀疏方法。
Read Paper (PDF)In this report, we present OpenUni, a simple, lightweight, and fully open-source baseline for unifying multimodal understanding and generation. Inspired by prevailing practices in unified model learning, we adopt an efficient training strategy that minimizes the training complexity and overhead by bridging the off-the-shelf multimodal large language models (LLMs) and diffusion models through a set of learnable queries and a light-weight transformer-based connector. With a minimalist choice of architecture, we demonstrate that OpenUni can: 1) generate high-quality and instruction-aligned images, and 2) achieve exceptional performance on standard benchmarks such as GenEval, DPG- Bench, and WISE, with only 1.1B and 3.1B activated parameters. To support open research and community advancement, we release all model weights, training code, and our curated training datasets (including 23M image-text pairs) at https://github.com/wusize/OpenUni.
TLDR: OpenUni presents a simple and open-source approach to unified multimodal understanding and generation, leveraging existing LLMs and diffusion models with a lightweight connector, achieving strong performance on benchmarks with relatively few parameters.
TLDR: OpenUni 提出了一种简单且开源的统一多模态理解和生成方法,通过轻量级连接器利用现有的大语言模型和扩散模型,以相对较少的参数在基准测试中取得了优异的性能。
Read Paper (PDF)This paper presents Diffusion via Autoregressive models (D-AR), a new paradigm recasting the image diffusion process as a vanilla autoregressive procedure in the standard next-token-prediction fashion. We start by designing the tokenizer that converts images into sequences of discrete tokens, where tokens in different positions can be decoded into different diffusion denoising steps in the pixel space. Thanks to the diffusion properties, these tokens naturally follow a coarse-to-fine order, which directly lends itself to autoregressive modeling. Therefore, we apply standard next-token prediction on these tokens, without modifying any underlying designs (either causal masks or training/inference strategies), and such sequential autoregressive token generation directly mirrors the diffusion procedure in image space. That is, once the autoregressive model generates an increment of tokens, we can directly decode these tokens into the corresponding diffusion denoising step in the streaming manner. Our pipeline naturally reveals several intriguing properties, for example, it supports consistent previews when generating only a subset of tokens and enables zero-shot layout-controlled synthesis. On the standard ImageNet benchmark, our method achieves 2.09 FID using a 775M Llama backbone with 256 discrete tokens. We hope our work can inspire future research on unified autoregressive architectures of visual synthesis, especially with large language models. Code and models will be available at https://github.com/showlab/D-AR
TLDR: The paper introduces D-AR, a method that reframes image diffusion as an autoregressive token prediction task, achieving strong FID scores on ImageNet with a Llama backbone and enabling features like consistent previews and zero-shot layout control.
TLDR: 该论文介绍了D-AR,一种将图像扩散重构为自回归token预测任务的方法,在ImageNet上使用Llama主干网络实现了很强的FID分数,并实现了诸如一致性预览和零样本布局控制等特性。
Read Paper (PDF)Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.
TLDR: Muddit, a unified discrete diffusion transformer, achieves fast and parallel text and image generation by incorporating strong visual priors from a pre-trained text-to-image backbone, surpassing autoregressive models in both quality and efficiency.
TLDR: Muddit是一个统一的离散扩散变换器,通过结合预训练的文本到图像主干网络的强大视觉先验,实现了快速且并行的文本和图像生成,在质量和效率上均优于自回归模型。
Read Paper (PDF)Unified multimodal large language models such as Show-o and Janus have achieved strong performance across both generation and understanding tasks. However, these models typically rely on large-scale datasets and require substantial computation during the pretraining stage. In addition, several post-training methods have been proposed, but they often depend on external data or are limited to task-specific customization. In this work, we introduce UniRL, a self-improving post-training approach. Our approach enables the model to generate images from prompts and use them as training data in each iteration, without relying on any external image data. Moreover, it enables the two tasks to enhance each other: the generated images are used for understanding, and the understanding results are used to supervise generation. We explore supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) to optimize the models. UniRL offers three key advantages: (1) it requires no external image data, as all training samples are generated by the model itself during training; (2) it not only improves individual task performance, but also reduces the imbalance between generation and understanding; and (3) it requires only several additional training steps during the post-training stage. We evaluate UniRL on top of Show-o and Janus, achieving a GenEval score of 0.77 for Show-o and 0.65 for Janus. Code and models will be released in https://github.com/showlab/UniRL.
TLDR: UniRL is a self-improving post-training method for multimodal models that generates its own training data to improve both image generation and understanding without external datasets, and reduces performance imbalance between tasks.
TLDR: UniRL 是一种自提升的多模态模型后训练方法,它生成自己的训练数据以提高图像生成和理解能力,无需外部数据集,并减少了任务之间的性能不平衡。
Read Paper (PDF)Fine-tuning pre-trained generative models with Reinforcement Learning (RL) has emerged as an effective approach for aligning outputs more closely with nuanced human preferences. In this paper, we investigate the application of Group Relative Policy Optimization (GRPO) to fine-tune next-scale visual autoregressive (VAR) models. Our empirical results demonstrate that this approach enables alignment to intricate reward signals derived from aesthetic predictors and CLIP embeddings, significantly enhancing image quality and enabling precise control over the generation style. Interestingly, by leveraging CLIP, our method can help VAR models generalize beyond their initial ImageNet distribution: through RL-driven exploration, these models can generate images aligned with prompts referencing image styles that were absent during pre-training. In summary, we show that RL-based fine-tuning is both efficient and effective for VAR models, benefiting particularly from their fast inference speeds, which are advantageous for online sampling, an aspect that poses significant challenges for diffusion-based alternatives.
TLDR: This paper explores using Group Relative Policy Optimization (GRPO) for fine-tuning visual autoregressive models, demonstrating improved image quality, style control, and generalization beyond the pre-training dataset through RL-driven exploration.
TLDR: 本文探讨了使用群体相对策略优化(GRPO)来微调视觉自回归模型,通过强化学习驱动的探索,展示了图像质量、风格控制以及超出预训练数据集的泛化能力的提升。
Read Paper (PDF)Recent advancements in unified vision-language models (VLMs), which integrate both visual understanding and generation capabilities, have attracted significant attention. The underlying hypothesis is that a unified architecture with mixed training on both understanding and generation tasks can enable mutual enhancement between understanding and generation. However, this hypothesis remains underexplored in prior works on unified VLMs. To address this gap, this paper systematically investigates the generalization across understanding and generation tasks in unified VLMs. Specifically, we design a dataset closely aligned with real-world scenarios to facilitate extensive experiments and quantitative evaluations. We evaluate multiple unified VLM architectures to validate our findings. Our key findings are as follows. First, unified VLMs trained with mixed data exhibit mutual benefits in understanding and generation tasks across various architectures, and this mutual benefits can scale up with increased data. Second, better alignment between multimodal input and output spaces will lead to better generalization. Third, the knowledge acquired during generation tasks can transfer to understanding tasks, and this cross-task generalization occurs within the base language model, beyond modality adapters. Our findings underscore the critical necessity of unifying understanding and generation in VLMs, offering valuable insights for the design and optimization of unified VLMs.
TLDR: This paper investigates the generalization capabilities of unified vision-language models (VLMs) across understanding and generation tasks, finding mutual benefits and highlighting the importance of aligning input/output spaces.
TLDR: 本文研究了统一视觉语言模型(VLM)在理解和生成任务中的泛化能力,发现它们之间存在互利关系,并强调了对齐输入/输出空间的重要性。
Read Paper (PDF)The ability to simulate the world in a spatially consistent manner is a crucial requirements for effective world models. Such a model enables high-quality visual generation, and also ensures the reliability of world models for downstream tasks such as simulation and planning. Designing a memory module is a crucial component for addressing spatial consistency: such a model must not only retain long-horizon observational information, but also enables the construction of explicit or implicit internal spatial representations. However, there are no dataset designed to promote the development of memory modules by explicitly enforcing spatial consistency constraints. Furthermore, most existing benchmarks primarily emphasize visual coherence or generation quality, neglecting the requirement of long-range spatial consistency. To bridge this gap, we construct a dataset and corresponding benchmark by sampling 150 distinct locations within the open-world environment of Minecraft, collecting about 250 hours (20 million frames) of loop-based navigation videos with actions. Our dataset follows a curriculum design of sequence lengths, allowing models to learn spatial consistency on increasingly complex navigation trajectories. Furthermore, our data collection pipeline is easily extensible to new Minecraft environments and modules. Four representative world model baselines are evaluated on our benchmark. Dataset, benchmark, and code are open-sourced to support future research.
TLDR: This paper introduces a new Minecraft dataset and benchmark for evaluating spatial consistency in memory-aided world models, addressing a gap in current evaluation methods that primarily focus on visual coherence.
TLDR: 本文介绍了一个新的Minecraft数据集和基准,用于评估记忆辅助世界模型中的空间一致性,解决了当前主要关注视觉连贯性的评估方法中的一个缺口。
Read Paper (PDF)