Daily papers related to Image/Video/Multimodal Generation from cs.CV
July 18, 2025
Despite recent advances in diffusion transformers (DiTs) for text-to-video generation, scaling to long-duration content remains challenging due to the quadratic complexity of self-attention. While prior efforts -- such as sparse attention and temporally autoregressive models -- offer partial relief, they often compromise temporal coherence or scalability. We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos, designed to produce long, coherent videos through a segment-wise generation process. At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations. It supports variable-length inputs with linearly adjustable compression rates, enabled by a single query token design based on the Q-Former architecture. Additionally, by encoding temporal context through position-aware mechanisms, our model seamlessly supports prediction, retradiction, interpolation, and multi-shot generation within a unified paradigm. Extensive experiments across diverse tasks validate the effectiveness and versatility of our approach.
TLDR: LoViC introduces a DiT-based framework with FlexFormer, an autoencoder that efficiently compresses video and text into unified latent representations for generating long, coherent videos.
TLDR: LoViC 引入了一个基于 DiT 的框架,其中 FlexFormer 是一种自动编码器,可将视频和文本有效地压缩为统一的潜在表示,以生成长而连贯的视频。
Read Paper (PDF)Diffusion Transformers (DiT) have shown strong performance in video generation tasks, but their high computational cost makes them impractical for resource-constrained devices like smartphones, and real-time generation is even more challenging. In this work, we propose a series of novel optimizations to significantly accelerate video generation and enable real-time performance on mobile platforms. First, we employ a highly compressed variational autoencoder (VAE) to reduce the dimensionality of the input data without sacrificing visual quality. Second, we introduce a KD-guided, sensitivity-aware tri-level pruning strategy to shrink the model size to suit mobile platform while preserving critical performance characteristics. Third, we develop an adversarial step distillation technique tailored for DiT, which allows us to reduce the number of inference steps to four. Combined, these optimizations enable our model to achieve over 10 frames per second (FPS) generation on an iPhone 16 Pro Max, demonstrating the feasibility of real-time, high-quality video generation on mobile devices.
TLDR: This paper introduces optimizations for Diffusion Transformers (DiT) to achieve real-time video generation on mobile devices, using VAE compression, pruning, and distillation techniques.
TLDR: 本文介绍了一系列针对扩散Transformer (DiT)的优化,旨在实现移动设备上的实时视频生成,方法包括VAE压缩、剪枝和蒸馏技术。
Read Paper (PDF)AutoRegressive (AR) models have made notable progress in image generation, with Masked AutoRegressive (MAR) models gaining attention for their efficient parallel decoding. However, MAR models have traditionally underperformed when compared to standard AR models. This study refines the MAR architecture to improve image generation quality. We begin by evaluating various image tokenizers to identify the most effective one. Subsequently, we introduce an improved Bidirectional LLaMA architecture by replacing causal attention with bidirectional attention and incorporating 2D RoPE, which together form our advanced model, MaskGIL. Scaled from 111M to 1.4B parameters, MaskGIL achieves a FID score of 3.71, matching state-of-the-art AR models in the ImageNet 256x256 benchmark, while requiring only 8 inference steps compared to the 256 steps of AR models. Furthermore, we develop a text-driven MaskGIL model with 775M parameters for generating images from text at various resolutions. Beyond image generation, MaskGIL extends to accelerate AR-based generation and enable real-time speech-to-image conversion. Our codes and models are available at https://github.com/synbol/MaskGIL.
TLDR: This paper introduces MaskGIL, an improved Masked AutoRegressive model that achieves state-of-the-art image generation quality with significantly fewer inference steps compared to standard AR models and demonstrates its applicability to text-to-image and speech-to-image tasks.
TLDR: 该论文介绍了一种改进的Masked AutoRegressive模型MaskGIL,它以显著少于传统AR模型的推理步骤实现了最先进的图像生成质量,并展示了其在文本到图像和语音到图像任务中的适用性。
Read Paper (PDF)This paper focuses on monolithic Multimodal Large Language Models (MLLMs), which integrate visual encoding and language decoding into a single model. Existing structures and pre-training strategies for monolithic MLLMs often suffer from unstable optimization and catastrophic forgetting. To address these challenges, our key idea is to embed a new visual parameter space into a pre-trained LLM, enabling stable learning of visual knowledge from noisy data via delta tuning. Based on this principle, we first introduce Mono-InternVL, an advanced monolithic MLLM that incorporates a set of visual experts through a multimodal mixture-of-experts architecture. In addition, we design an innovative Endogenous Visual Pre-training (EViP) for Mono-InternVL to maximize its visual capabilities via progressive learning. Mono-InternVL achieves competitive performance against existing MLLMs but also leads to relatively expensive data cost. Therefore, we further present Mono-InternVL-1.5, a cheaper and stronger monolithic MLLM equipped with an improved EViP (EViP++). EViP++ introduces additional visual attention experts to Mono-InternVL-1.5 and re-organizes the pre-training process in an efficient manner. During inference, it includes a fused CUDA kernel to speed up its MoE operations. With these designs, Mono-InternVL-1.5 significantly reduces training and inference costs, while still maintaining competitive performance with Mono-InternVL. To evaluate our approach, we conduct extensive experiments across 15 benchmarks. Results demonstrate that Mono-InternVL outperforms existing monolithic MLLMs on 12 out of 15 benchmarks, e.g., +114-point improvement over Emu3 on OCRBench. Compared to its modular counterpart, i.e., InternVL-1.5, Mono-InternVL-1.5 achieves similar multimodal performance while reducing first-token latency by up to 69%. Code and models are released at https://github.com/OpenGVLab/Mono-InternVL.
TLDR: The paper introduces Mono-InternVL-1.5, a monolithic multimodal large language model enhancing visual knowledge learning and reducing training/inference costs through a multimodal mixture-of-experts architecture and an improved endogenous visual pre-training method, achieving competitive performance with reduced latency.
TLDR: 本文介绍了Mono-InternVL-1.5,一种单体多模态大型语言模型,通过多模态混合专家架构和改进的内生视觉预训练方法,增强了视觉知识学习并降低了训练/推理成本,在降低延迟的同时实现了有竞争力的性能。
Read Paper (PDF)