Daily papers related to Image/Video/Multimodal Generation from cs.CV
February 25, 2026
Propagation-based video editing enables precise user control by propagating a single edited frame into following frames while maintaining the original context such as motion and structures. However, training such models requires large-scale, paired (source and edited) video datasets, which are costly and complex to acquire. Hence, we propose the PropFly, a training pipeline for Propagation-based video editing, relying on on-the-Fly supervision from pre-trained video diffusion models (VDMs) instead of requiring off-the-shelf or precomputed paired video editing datasets. Specifically, our PropFly leverages one-step clean latent estimations from intermediate noised latents with varying Classifier-Free Guidance (CFG) scales to synthesize diverse pairs of 'source' (low-CFG) and 'edited' (high-CFG) latents on-the-fly. The source latent serves as structural information of the video, while the edited latent provides the target transformation for learning propagation. Our pipeline enables an additional adapter attached to the pre-trained VDM to learn to propagate edits via Guidance-Modulated Flow Matching (GMFM) loss, which guides the model to replicate the target transformation. Our on-the-fly supervision ensures the model to learn temporally consistent and dynamic transformations. Extensive experiments demonstrate that our PropFly significantly outperforms the state-of-the-art methods on various video editing tasks, producing high-quality editing results.
TLDR: The paper introduces PropFly, a training pipeline for propagation-based video editing that leverages on-the-fly supervision from pre-trained video diffusion models, eliminating the need for paired training data and achieving state-of-the-art performance.
TLDR: 该论文介绍了PropFly,一种基于传播的视频编辑训练流程,它利用预训练的视频扩散模型进行即时监督,无需配对训练数据,并实现了最先进的性能。
Read Paper (PDF)Diffusion models have achieved remarkable success in image and video generation tasks. However, the high computational demands of Diffusion Transformers (DiTs) pose a significant challenge to their practical deployment. While feature caching is a promising acceleration strategy, existing methods based on simple reusing or training-free forecasting struggle to adapt to the complex, stage-dependent dynamics of the diffusion process, often resulting in quality degradation and failing to maintain consistency with the standard denoising process. To address this, we propose a LEarnable Stage-Aware (LESA) predictor framework based on two-stage training. Our approach leverages a Kolmogorov-Arnold Network (KAN) to accurately learn temporal feature mappings from data. We further introduce a multi-stage, multi-expert architecture that assigns specialized predictors to different noise-level stages, enabling more precise and robust feature forecasting. Extensive experiments show our method achieves significant acceleration while maintaining high-fidelity generation. Experiments demonstrate 5.00x acceleration on FLUX.1-dev with minimal quality degradation (1.0% drop), 6.25x speedup on Qwen-Image with a 20.2% quality improvement over the previous SOTA (TaylorSeer), and 5.00x acceleration on HunyuanVideo with a 24.7% PSNR improvement over TaylorSeer. State-of-the-art performance on both text-to-image and text-to-video synthesis validates the effectiveness and generalization capability of our training-based framework across different models. Our code is included in the supplementary materials and will be released on GitHub.
TLDR: The paper introduces LESA, a learnable stage-aware predictor framework utilizing KANs for accelerating Diffusion Transformers in image and video generation by intelligently caching and forecasting features at different diffusion stages, achieving significant speedups with minimal or even improved quality.
TLDR: 该论文提出了LESA,一个可学习的阶段感知预测器框架,利用KANs加速图像和视频生成中的扩散Transformer模型。通过智能地缓存和预测不同扩散阶段的特征,实现了显著的加速,并且质量损失很小,甚至有所提高。
Read Paper (PDF)Unified multimodal models can both understand and generate visual content within a single architecture. Existing models, however, remain data-hungry and too heavy for deployment on edge devices. We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligence to a mobile device. Its core module, the Mobile Conditioning Projector (MCP), fuses vision-language features with a diffusion generator using depthwise-separable convolutions and layerwise alignment. This design enables efficient cross-modal conditioning with minimal computational cost. Trained on only a few million samples and post-trained in a novel quadruplet format (generation prompt, image, question, answer), Mobile-O jointly enhances both visual understanding and generation capabilities. Despite its efficiency, Mobile-O attains competitive or superior performance compared to other unified models, achieving 74% on GenEval and outperforming Show-O and JanusFlow by 5% and 11%, while running 6x and 11x faster, respectively. For visual understanding, Mobile-O surpasses them by 15.3% and 5.1% averaged across seven benchmarks. Running in only ~3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices. We hope Mobile-O will ease future research in real-time unified multimodal intelligence running entirely on-device with no cloud dependency. Our code, models, datasets, and mobile application are publicly available at https://amshaker.github.io/Mobile-O/
TLDR: Mobile-O is a compact vision-language-diffusion model for unified multimodal understanding and generation on mobile devices, achieving competitive performance with significantly improved efficiency compared to existing models. It runs on an iPhone in ~3s per image, enabling real-time on-device multimodal AI.
TLDR: Mobile-O是一个紧凑的视觉-语言-扩散模型,用于在移动设备上进行统一的多模态理解和生成。与现有模型相比,它在显著提高效率的同时,实现了具有竞争力的性能。它在iPhone上运行每张图像只需约3秒,从而实现了实时的设备端多模态AI。
Read Paper (PDF)Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video-to-audio methods fall short in generating with long durations.
TLDR: This paper introduces MMHNet, a novel multimodal hierarchical network using Mamba to improve length generalization in video-to-audio generation, enabling the generation of long-form audio (up to 5 minutes) from video, even when trained on shorter clips.
TLDR: 本文介绍了一种新的多模态分层网络MMHNet,该网络使用Mamba来提高视频到音频生成中的长度泛化能力,从而能够从视频生成长格式音频(长达5分钟),即使在较短的片段上进行训练。
Read Paper (PDF)Editing images with diffusion models without training remains challenging. While recent optimisation-based methods achieve strong zero-shot edits from text, they struggle to preserve identity or capture details that language alone cannot express. Many visual concepts such as facial structure, material texture, or object geometry are impossible to express purely through text prompts alone. To address this gap, we introduce a training-free framework for concept-based image editing, which unifies Optimised DDS with LoRA-driven concept composition, where the training data of the LoRA represent the concept. Our approach enables combining and controlling multiple visual concepts directly within the diffusion process, integrating semantic guidance from text with low-level cues from pretrained concept adapters. We further refine DDS for stability and controllability through ordered timesteps, regularisation, and negative-prompt guidance. Quantitative and qualitative results demonstrate consistent improvements over existing training-free diffusion editing methods on InstructPix2Pix and ComposLoRA benchmarks. Code will be made publicly available.
TLDR: This paper presents a training-free framework for image editing using diffusion models by combining text prompts with pre-trained concept adapters (LoRA), achieving improved control and preservation of details compared to existing methods.
TLDR: 本文提出了一种无需训练的图像编辑框架,使用扩散模型,将文本提示与预训练的概念适配器(LoRA)相结合,与现有方法相比,实现了更好的控制和细节保留。
Read Paper (PDF)World foundation models aim to simulate the evolution of the real world with physically plausible behavior. Unlike prior methods that handle spatial and temporal correlations separately, we propose RAYNOVA, a geometry-free world model that employs a dual-causal autoregressive framework. It follows both scale-wise and temporal topological orders in the autoregressive process, and leverages global attention for unified 4D spatio-temporal reasoning. Different from existing works that impose strong 3D geometric priors, RAYNOVA constructs an isotropic spatio-temporal representation across views, frames, and scales based on relative Plücker-ray positional encoding, enabling robust generalization to diverse camera setups and ego motions. We further introduce a recurrent training paradigm to alleviate distribution drift in long-horizon video generation. RAYNOVA achieves state-of-the-art multi-view video generation results on nuScenes, while offering higher throughput and strong controllability under diverse input conditions, generalizing to novel views and camera configurations without explicit 3D scene representation. Our code will be released at http://yichen928.github.io/raynova.
TLDR: RAYNOVA is a geometry-free world model for driving scene video generation that uses a dual-causal autoregressive framework and Plücker-ray positional encoding for robust generalization across different camera setups, achieving SOTA results on nuScenes.
TLDR: RAYNOVA是一个无需几何信息的驾驶场景视频生成世界模型,它使用双因果自回归框架和普吕克射线位置编码,以实现对不同相机设置的鲁棒泛化,并在nuScenes上取得了SOTA结果。
Read Paper (PDF)A free-viewpoint, editable, and high-fidelity driving simulator is crucial for training and evaluating end-to-end autonomous driving systems. In this paper, we present GA-Drive, a novel simulation framework capable of generating camera views along user-specified novel trajectories through Geometry-Appearance Decoupling and Diffusion-Based Generation. Given a set of images captured along a recorded trajectory and the corresponding scene geometry, GA-Drive synthesizes novel pseudo-views using geometry information. These pseudo-views are then transformed into photorealistic views using a trained video diffusion model. In this way, we decouple the geometry and appearance of scenes. An advantage of such decoupling is its support for appearance editing via state-of-the-art video-to-video editing techniques, while preserving the underlying geometry, enabling consistent edits across both original and novel trajectories. Extensive experiments demonstrate that GA-Drive substantially outperforms existing methods in terms of NTA-IoU, NTL-IoU, and FID scores.
TLDR: GA-Drive is a novel driving simulation framework that decouples geometry and appearance for free-viewpoint driving scene generation using geometry information and video diffusion models, allowing for editable and high-fidelity simulations.
TLDR: GA-Drive 是一种新颖的驾驶模拟框架,它利用几何信息和视频扩散模型解耦了几何和外观,从而生成自由视点的驾驶场景,实现了可编辑和高保真的模拟。
Read Paper (PDF)Flow-based generative models have become a strong framework for high-quality generative modeling, yet pretrained models are rarely used in their vanilla conditional form: conditional samples without guidance often appear diffuse and lack fine-grained detail due to the smoothing effects of neural networks. Existing guidance techniques such as classifier-free guidance (CFG) improve fidelity but double the inference cost and typically reduce sample diversity. We introduce Momentum Guidance (MG), a new dimension of guidance that leverages the ODE trajectory itself. MG extrapolates the current velocity using an exponential moving average of past velocities and preserves the standard one-evaluation-per-step cost. It matches the effect of standard guidance without extra computation and can further improve quality when combined with CFG. Experiments demonstrate MG's effectiveness across benchmarks. Specifically, on ImageNet-256, MG achieves average improvements in FID of 36.68% without CFG and 25.52% with CFG across various sampling settings, attaining an FID of 1.597 at 64 sampling steps. Evaluations on large flow-based models like Stable Diffusion 3 and FLUX.1-dev further confirm consistent quality enhancements across standard metrics.
TLDR: This paper introduces Momentum Guidance (MG), a computationally efficient technique to improve the quality of samples from flow-based generative models, showing significant FID improvements on ImageNet and other large models.
TLDR: 该论文介绍了一种名为“动量引导”(MG)的计算效率高的技术,用于提高基于流的生成模型的样本质量,并在ImageNet和其他大型模型上显示出显着的FID改进。
Read Paper (PDF)