Daily papers related to Image/Video/Multimodal Generation from cs.CV
February 12, 2026
Autoregressive models, often built on Transformer architectures, represent a powerful paradigm for generating ultra-long videos by synthesizing content in sequential chunks. However, this sequential generation process is notoriously slow. While caching strategies have proven effective for accelerating traditional video diffusion models, existing methods assume uniform denoising across all frames-an assumption that breaks down in autoregressive models where different video chunks exhibit varying similarity patterns at identical timesteps. In this paper, we present FlowCache, the first caching framework specifically designed for autoregressive video generation. Our key insight is that each video chunk should maintain independent caching policies, allowing fine-grained control over which chunks require recomputation at each timestep. We introduce a chunkwise caching strategy that dynamically adapts to the unique denoising characteristics of each chunk, complemented by a joint importance-redundancy optimized KV cache compression mechanism that maintains fixed memory bounds while preserving generation quality. Our method achieves remarkable speedups of 2.38 times on MAGI-1 and 6.7 times on SkyReels-V2, with negligible quality degradation (VBench: 0.87 increase and 0.79 decrease respectively). These results demonstrate that FlowCache successfully unlocks the potential of autoregressive models for real-time, ultra-long video generation-establishing a new benchmark for efficient video synthesis at scale. The code is available at https://github.com/mikeallen39/FlowCache.
TLDR: The paper introduces FlowCache, a novel caching framework specifically designed to accelerate autoregressive video generation by using chunk-wise caching policies and a KV cache compression mechanism. It achieves significant speedups on MAGI-1 and SkyReels-V2 while maintaining video quality.
TLDR: 该论文介绍了FlowCache,一种新的缓存框架,专为加速自回归视频生成而设计,通过使用分块缓存策略和KV缓存压缩机制。它在MAGI-1和SkyReels-V2上实现了显著的加速,同时保持了视频质量。
Read Paper (PDF)The slow iterative sampling nature remains a major bottleneck for the practical deployment of diffusion and flow-based generative models. While consistency models (CMs) represent a state-of-the-art distillation-based approach for efficient generation, their large-scale application is still limited by two key issues: training instability and inflexible sampling. Existing methods seek to mitigate these problems through architectural adjustments or regularized objectives, yet overlook the critical reliance on trajectory selection. In this work, we first conduct an analysis on these two limitations: training instability originates from loss divergence induced by unstable self-supervised term, whereas sampling inflexibility arises from error accumulation. Based on these insights and analysis, we propose the Dual-End Consistency Model (DE-CM) that selects vital sub-trajectory clusters to achieve stable and effective training. DE-CM decomposes the PF-ODE trajectory and selects three critical sub-trajectories as optimization targets. Specifically, our approach leverages continuous-time CMs objectives to achieve few-step distillation and utilizes flow matching as a boundary regularizer to stabilize the training process. Furthermore, we propose a novel noise-to-noisy (N2N) mapping that can map noise to any point, thereby alleviating the error accumulation in the first step. Extensive experimental results show the effectiveness of our method: it achieves a state-of-the-art FID score of 1.70 in one-step generation on the ImageNet 256x256 dataset, outperforming existing CM-based one-step approaches.
TLDR: The paper introduces Dual-End Consistency Model (DE-CM), a novel approach to improve consistency model training stability and sampling flexibility by selecting vital sub-trajectory clusters and utilizing noise-to-noisy mapping, achieving state-of-the-art one-step image generation results on ImageNet.
TLDR: 该论文介绍了 Dual-End Consistency Model (DE-CM),一种通过选择关键子轨迹簇和利用噪声到噪声映射来提高一致性模型训练稳定性和采样灵活性的新方法,并在 ImageNet 上实现了最先进的单步图像生成结果。
Read Paper (PDF)The success of text-guided diffusion models has established a new image generation paradigm driven by the iterative refinement of text prompts. However, modifying the original text prompt to achieve the expected semantic adjustments often results in unintended global structure changes that disrupt user intent. Existing methods rely on empirical feature map selection for intervention, whose performance heavily depends on appropriate selection, leading to suboptimal stability. This paper tries to solve the aforementioned problem from a frequency perspective and analyzes the impact of the frequency spectrum of noisy latent variables on the hierarchical emergence of the structure framework and fine-grained textures during the generation process. We find that lower-frequency components are primarily responsible for establishing the structure framework in the early generation stage. Their influence diminishes over time, giving way to higher-frequency components that synthesize fine-grained textures. In light of this, we propose a training-free frequency modulation method utilizing a frequency-dependent weighting function with dynamic decay. This method maintains the structure framework consistency while permitting targeted semantic modifications. By directly manipulating the noisy latent variable, the proposed method avoids the empirical selection of internal feature maps. Extensive experiments demonstrate that the proposed method significantly outperforms current state-of-the-art methods, achieving an effective balance between preserving structure and enabling semantic updates.
TLDR: This paper introduces a training-free frequency modulation method for text-driven image generation that dynamically adjusts the frequency components of the latent variable to better control structure preservation and semantic modifications, avoiding empirical feature map selection.
TLDR: 本文提出了一种免训练的频率调制方法,用于文本驱动的图像生成。该方法通过动态调整潜在变量的频率分量,更好地控制结构保持和语义修改,避免了经验性的特征图选择。
Read Paper (PDF)Despite the remarkable success of sampling-based generative models such as flow matching, they can still produce samples of inconsistent or degraded quality. To assess sample reliability and generate higher-quality outputs, we propose uncertainty-aware flow matching (UA-Flow), a lightweight extension of flow matching that predicts the velocity field together with heteroscedastic uncertainty. UA-Flow estimates per-sample uncertainty by propagating velocity uncertainty through the flow dynamics. These uncertainty estimates act as a reliability signal for individual samples, and we further use them to steer generation via uncertainty-aware classifier guidance and classifier-free guidance. Experiments on image generation show that UA-Flow produces uncertainty signals more highly correlated with sample fidelity than baseline methods, and that uncertainty-guided sampling further improves generation quality.
TLDR: This paper introduces Uncertainty-Aware Flow Matching (UA-Flow), an extension of flow matching that predicts heteroscedastic uncertainty alongside the velocity field, improving sample reliability and generation quality through uncertainty-guided sampling.
TLDR: 本文介绍了一种不确定性感知流匹配(UA-Flow),它是流匹配的扩展,可以预测速度场 junto 与异方差不确定性,并通过不确定性引导的采样来提高样本可靠性和生成质量。
Read Paper (PDF)Image-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions, yet preserving fine-grained object identity under changing viewpoints remains a persistent challenge. Unlike text-to-video models, existing I2V pipelines often suffer from appearance drift and geometric distortion, artifacts we attribute to the sparsity of single-view 2D observations and weak cross-modal alignment. Here we address this problem from both data and model perspectives. First, we curate ConsIDVid, a large-scale object-centric dataset built with a scalable pipeline for high-quality, temporally aligned videos, and establish ConsIDVid-Bench, where we present a novel benchmarking and evaluation framework for multi-view consistency using metrics sensitive to subtle geometric and appearance deviations. We further propose ConsID-Gen, a view-assisted I2V generation framework that augments the first frame with unposed auxiliary views and fuses semantic and structural cues via a dual-stream visual-geometric encoder as well as a text-visual connector, yielding unified conditioning for a Diffusion Transformer backbone. Experiments across ConsIDVid-Bench demonstrate that ConsID-Gen consistently outperforms in multiple metrics, with the best overall performance surpassing leading video generation models like Wan2.1 and HunyuanVideo, delivering superior identity fidelity and temporal coherence under challenging real-world scenarios. We will release our model and dataset at https://myangwu.github.io/ConsID-Gen.
TLDR: The paper introduces ConsID-Gen, a view-assisted image-to-video generation framework leveraging a new dataset (ConsIDVid) and a dual-stream visual-geometric encoder to improve identity preservation and temporal coherence, outperforming existing models in benchmark evaluations.
TLDR: 本文介绍了ConsID-Gen,一个利用新数据集(ConsIDVid)和双流视觉几何编码器的视角辅助图像到视频生成框架,以提高身份保持和时间连贯性,并在基准评估中优于现有模型。
Read Paper (PDF)Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce Seq$Δ$-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.
TLDR: The paper introduces Olaf-World, a method for pretraining action-conditioned video world models from unlabeled video by aligning latent actions with observed semantic effects, improving zero-shot action transfer and data efficiency.
TLDR: 本文介绍了一种名为Olaf-World的方法,通过将潜在动作与观察到的语义效果对齐,从未标记的视频中预训练动作条件视频世界模型,从而提高零样本动作迁移和数据效率。
Read Paper (PDF)Leveraging representation encoders for generative modeling offers a path for efficient, high-fidelity synthesis. However, standard diffusion transformers fail to converge on these representations directly. While recent work attributes this to a capacity bottleneck proposing computationally expensive width scaling of diffusion transformers we demonstrate that the failure is fundamentally geometric. We identify Geometric Interference as the root cause: standard Euclidean flow matching forces probability paths through the low-density interior of the hyperspherical feature space of representation encoders, rather than following the manifold surface. To resolve this, we propose Riemannian Flow Matching with Jacobi Regularization (RJF). By constraining the generative process to the manifold geodesics and correcting for curvature-induced error propagation, RJF enables standard Diffusion Transformer architectures to converge without width scaling. Our method RJF enables the standard DiT-B architecture (131M parameters) to converge effectively, achieving an FID of 3.37 where prior methods fail to converge. Code: https://github.com/amandpkr/RJF
TLDR: This paper identifies and addresses the problem of Geometric Interference in diffusion transformers when using representation encoders, proposing Riemannian Flow Matching with Jacobi Regularization (RJF) to improve convergence and performance without expensive scaling.
TLDR: 本文指出并解决了在使用表征编码器时扩散Transformer中的几何干扰问题,提出了黎曼流匹配与雅可比正则化方法(RJF)来提高收敛性和性能,且无需昂贵的缩放。
Read Paper (PDF)Causality -- referring to temporal, uni-directional cause-effect relationships between components -- underlies many complex generative processes, including videos, language, and robot trajectories. Current causal diffusion models entangle temporal reasoning with iterative denoising, applying causal attention across all layers, at every denoising step, and over the entire context. In this paper, we show that the causal reasoning in these models is separable from the multi-step denoising process. Through systematic probing of autoregressive video diffusers, we uncover two key regularities: (1) early layers produce highly similar features across denoising steps, indicating redundant computation along the diffusion trajectory; and (2) deeper layers exhibit sparse cross-frame attention and primarily perform intra-frame rendering. Motivated by these findings, we introduce Separable Causal Diffusion (SCD), a new architecture that explicitly decouples once-per-frame temporal reasoning, via a causal transformer encoder, from multi-step frame-wise rendering, via a lightweight diffusion decoder. Extensive experiments on both pretraining and post-training tasks across synthetic and real benchmarks show that SCD significantly improves throughput and per-frame latency while matching or surpassing the generation quality of strong causal diffusion baselines.
TLDR: This paper introduces Separable Causal Diffusion (SCD), an architecture that decouples temporal reasoning from frame-wise rendering in video diffusion models, leading to improved throughput and latency while maintaining generation quality.
TLDR: 本文介绍了可分离因果扩散 (SCD),该架构将视频扩散模型中的时间推理与逐帧渲染分离,从而提高了吞吐量和延迟,同时保持了生成质量。
Read Paper (PDF)