Daily papers related to Image/Video/Multimodal Generation from cs.CV
February 23, 2026
AIGC has rapidly expanded from text-to-image generation toward high-quality multimodal synthesis across video and audio. Within this context, joint audio-video generation (JAVG) has emerged as a fundamental task that produces synchronized and semantically aligned sound and vision from textual descriptions. However, compared with advanced commercial models such as Veo3, existing open-source methods still suffer from limitations in generation quality, temporal synchrony, and alignment with human preferences. To bridge the gap, this paper presents JavisDiT++, a concise yet powerful framework for unified modeling and optimization of JAVG. First, we introduce a modality-specific mixture-of-experts (MS-MoE) design that enables cross-modal interaction efficacy while enhancing single-modal generation quality. Then, we propose a temporal-aligned RoPE (TA-RoPE) strategy to achieve explicit, frame-level synchronization between audio and video tokens. Besides, we develop an audio-video direct preference optimization (AV-DPO) method to align model outputs with human preference across quality, consistency, and synchrony dimensions. Built upon Wan2.1-1.3B-T2V, our model achieves state-of-the-art performance merely with around 1M public training entries, significantly outperforming prior approaches in both qualitative and quantitative evaluations. Comprehensive ablation studies have been conducted to validate the effectiveness of our proposed modules. All the code, model, and dataset are released at https://JavisVerse.github.io/JavisDiT2-page.
TLDR: JavisDiT++ is a new framework for joint audio-video generation (JAVG) that uses modality-specific mixture-of-experts, temporal-aligned RoPE, and audio-video direct preference optimization to achieve state-of-the-art performance with limited training data.
TLDR: JavisDiT++是一个新的联合音视频生成(JAVG)框架,它使用特定模态的混合专家、时间对齐的RoPE和音视频直接偏好优化,在有限的训练数据下实现了最先进的性能。
Read Paper (PDF)Precise spatial control in diffusion-based style transfer remains challenging. This challenge arises because diffusion models treat style as a global feature and lack explicit spatial grounding of style representations, making it difficult to restrict style application to specific objects or regions. To our knowledge, existing diffusion models are unable to perform true localized style transfer, typically relying on handcrafted masks or multi-stage post-processing that introduce boundary artifacts and limit generalization. To address this, we propose an attention-supervised diffusion framework that explicitly teaches the model where to apply a given style by aligning the attention scores of style tokens with object masks during training. Two complementary objectives, a Focus loss based on KL divergence and a Cover loss using binary cross-entropy, jointly encourage accurate localization and dense coverage. A modular LoRA-MoE design further enables efficient and scalable multi-style adaptation. To evaluate localized stylization, we introduce the Regional Style Editing Score, which measures Regional Style Matching through CLIP-based similarity within the target region and Identity Preservation via masked LPIPS and pixel-level consistency on unedited areas. Experiments show that our method achieves mask-free, single-object style transfer at inference, producing regionally accurate and visually coherent results that outperform existing diffusion-based editing approaches.
TLDR: The paper introduces RegionRoute, a diffusion-based style transfer method that achieves localized style application without masks by using attention supervision and a LoRA-MoE design, demonstrating superior regional accuracy and visual coherence compared to existing methods.
TLDR: 本文介绍了一种名为RegionRoute的基于扩散模型的风格迁移方法,该方法通过注意力监督和LoRA-MoE设计,无需掩码即可实现局部风格应用。实验结果表明,与现有方法相比,该方法在区域准确性和视觉连贯性方面表现更优。
Read Paper (PDF)Latent diffusion models have enabled high-quality video synthesis, yet their inference remains costly and time-consuming. As diffusion transformers become increasingly efficient, the latency bottleneck inevitably shifts to VAE decoders. To reduce their latency while maintaining quality, we propose a universal acceleration framework for VAE decoders that preserves full alignment with the original latent distribution. Specifically, we propose (1) an independence-aware channel pruning method to effectively mitigate severe channel redundancy, and (2) a stage-wise dominant operator optimization strategy to address the high inference cost of the widely used causal 3D convolutions in VAE decoders. Based on these innovations, we construct a Flash-VAED family. Moreover, we design a three-phase dynamic distillation framework that efficiently transfers the capabilities of the original VAE decoder to Flash-VAED. Extensive experiments on Wan and LTX-Video VAE decoders demonstrate that our method outperforms baselines in both quality and speed, achieving approximately a 6$\times$ speedup while maintaining the reconstruction performance up to 96.9%. Notably, Flash-VAED accelerates the end-to-end generation pipeline by up to 36% with negligible quality drops on VBench-2.0.
TLDR: The paper introduces Flash-VAED, an acceleration framework for VAE decoders used in video generation, focusing on reducing latency while maintaining quality through channel pruning, operator optimization, and dynamic distillation. It achieves significant speedups with minimal quality loss.
TLDR: 该论文介绍了Flash-VAED,这是一个用于视频生成中VAE解码器的加速框架,通过通道剪枝、算子优化和动态蒸馏,专注于在保持质量的同时降低延迟。它实现了显著的加速,且质量损失极小。
Read Paper (PDF)The advent of one-step text-to-image (T2I) models offers unprecedented synthesis speed. However, their application to text-guided image editing remains severely hampered, as forcing existing training-free editors into a single inference step fails. This failure manifests as severe object distortion and a critical loss of consistency in non-edited regions, resulting from the high-energy, erratic trajectories produced by naive vector arithmetic on the models' structured fields. To address this problem, we introduce ChordEdit, a model agnostic, training-free, and inversion-free method that facilitates high-fidelity one-step editing. We recast editing as a transport problem between the source and target distributions defined by the source and target text prompts. Leveraging dynamic optimal transport theory, we derive a principled, low-energy control strategy. This strategy yields a smoothed, variance-reduced editing field that is inherently stable, facilitating the field to be traversed in a single, large integration step. A theoretically grounded and experimentally validated approach allows ChordEdit to deliver fast, lightweight and precise edits, finally achieving true real-time editing on these challenging models.
TLDR: ChordEdit introduces a novel, training-free method for one-step text-guided image editing that addresses distortion and inconsistency issues in existing training-free editors by employing dynamic optimal transport theory for low-energy editing.
TLDR: ChordEdit 提出了一种新颖的、无需训练的单步文本引导图像编辑方法,通过采用动态最优传输理论进行低能量编辑,解决了现有无需训练的编辑器中的失真和不一致问题。
Read Paper (PDF)Diffusion models are a strong backbone for visual generation, but their inherently sequential denoising process leads to slow inference. Previous methods accelerate sampling by caching and reusing intermediate outputs based on feature distances between adjacent timesteps. However, existing caching strategies typically rely on raw feature differences that entangle content and noise. This design overlooks spectral evolution, where low-frequency structure appears early and high-frequency detail is refined later. We introduce Spectral-Evolution-Aware Cache (SeaCache), a training-free cache schedule that bases reuse decisions on a spectrally aligned representation. Through theoretical and empirical analysis, we derive a Spectral-Evolution-Aware (SEA) filter that preserves content-relevant components while suppressing noise. Employing SEA-filtered input features to estimate redundancy leads to dynamic schedules that adapt to content while respecting the spectral priors underlying the diffusion model. Extensive experiments on diverse visual generative models and the baselines show that SeaCache achieves state-of-the-art latency-quality trade-offs.
TLDR: The paper introduces SeaCache, a novel caching strategy for diffusion models that leverages spectral evolution to improve the latency-quality trade-off during inference by better separating content and noise.
TLDR: 该论文介绍了SeaCache,一种新颖的扩散模型缓存策略,它利用频谱演化来改善推理过程中的延迟-质量平衡,更好地分离内容和噪声。
Read Paper (PDF)Personalized image generation requires effectively balancing content fidelity with stylistic consistency when synthesizing images based on text and reference examples. Low-Rank Adaptation (LoRA) offers an efficient personalization approach, with potential for precise control through combining LoRA weights on different concepts. However, existing combination techniques face persistent challenges: entanglement between content and style representations, insufficient guidance for controlling elements' influence, and unstable weight fusion that often require additional training. We address these limitations through CRAFT-LoRA, with complementary components: (1) rank-constrained backbone fine-tuning that injects low-rank projection residuals to encourage learning decoupled content and style subspaces; (2) a prompt-guided approach featuring an expert encoder with specialized branches that enables semantic extension and precise control through selective adapter aggregation; and (3) a training-free, timestep-dependent classifier-free guidance scheme that enhances generation stability by strategically adjusting noise predictions across diffusion steps. Our method significantly improves content-style disentanglement, enables flexible semantic control over LoRA module combinations, and achieves high-fidelity generation without additional retraining overhead.
TLDR: The paper introduces CRAFT-LoRA, a novel approach for personalized image generation that enhances content-style disentanglement and control through rank-constrained fine-tuning, prompt-guided adaptation, and a training-free guidance scheme.
TLDR: 该论文介绍了CRAFT-LoRA,一种新颖的个性化图像生成方法,通过秩约束微调、提示引导的适配和一个无需训练的引导方案,增强了内容-风格的解耦和控制能力。
Read Paper (PDF)