Daily papers related to Image/Video/Multimodal Generation from cs.CV
December 16, 2025
Building video world models upon pretrained video generation systems represents an important yet challenging step toward general spatiotemporal intelligence. A world model should possess three essential properties: controllability, long-term visual quality, and temporal consistency. To this end, we take a progressive approach-first enhancing controllability and then extending toward long-term, high-quality generation. We present LongVie 2, an end-to-end autoregressive framework trained in three stages: (1) Multi-modal guidance, which integrates dense and sparse control signals to provide implicit world-level supervision and improve controllability; (2) Degradation-aware training on the input frame, bridging the gap between training and long-term inference to maintain high visual quality; and (3) History-context guidance, which aligns contextual information across adjacent clips to ensure temporal consistency. We further introduce LongVGenBench, a comprehensive benchmark comprising 100 high-resolution one-minute videos covering diverse real-world and synthetic environments. Extensive experiments demonstrate that LongVie 2 achieves state-of-the-art performance in long-range controllability, temporal coherence, and visual fidelity, and supports continuous video generation lasting up to five minutes, marking a significant step toward unified video world modeling.
TLDR: LongVie 2 is a novel end-to-end framework for controllable, long-term, high-quality video generation, achieving state-of-the-art results on a new benchmark (LongVGenBench) and supporting video generation up to five minutes.
TLDR: LongVie 2是一个新颖的端到端框架,用于可控的、长期的、高质量的视频生成。它在新基准测试 (LongVGenBench) 上取得了最先进的结果,并支持长达五分钟的视频生成。
Read Paper (PDF)Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo.
TLDR: Seedance 1.5 pro is a new audio-visual generation foundation model achieving exceptional synchronization and quality through a dual-branch Diffusion Transformer, post-training optimizations (SFT, RLHF), a multi-stage data pipeline, and a 10X inference speedup. It enables precise lip-syncing, cinematic camera control, and improved narrative coherence.
TLDR: Seedance 1.5 pro 是一款新的音视频生成基础模型,通过双分支扩散Transformer、后期训练优化(SFT、RLHF)、多阶段数据管道和 10 倍的推理速度提升,实现了卓越的同步和质量。 它可以实现精确的唇形同步、电影摄像机控制和改进的叙事连贯性。
Read Paper (PDF)We propose a multimodal-driven framework for high-fidelity long-term digital human animation termed $\textbf{Soul}$, which generates semantically coherent videos from a single-frame portrait image, text prompts, and audio, achieving precise lip synchronization, vivid facial expressions, and robust identity preservation. We construct Soul-1M, containing 1 million finely annotated samples with a precise automated annotation pipeline (covering portrait, upper-body, full-body, and multi-person scenes) to mitigate data scarcity, and we carefully curate Soul-Bench for comprehensive and fair evaluation of audio-/text-guided animation methods. The model is built on the Wan2.2-5B backbone, integrating audio-injection layers and multiple training strategies together with threshold-aware codebook replacement to ensure long-term generation consistency. Meanwhile, step/CFG distillation and a lightweight VAE are used to optimize inference efficiency, achieving an 11.4$\times$ speedup with negligible quality loss. Extensive experiments show that Soul significantly outperforms current leading open-source and commercial models on video quality, video-text alignment, identity preservation, and lip-synchronization accuracy, demonstrating broad applicability in real-world scenarios such as virtual anchors and film production. Project page at https://zhangzjn.github.io/projects/Soul/
TLDR: The paper introduces Soul, a multimodal framework for generating high-fidelity, long-term digital human animations from a single portrait, text, and audio, along with a large dataset and benchmark for evaluation.
TLDR: 该论文介绍了Soul,一个多模态框架,用于从单人肖像、文本和音频生成高保真、长期的数字人动画,并提供了一个大型数据集和基准用于评估。
Read Paper (PDF)Native 4K (2160$\times$3840) video generation remains a critical challenge due to the quadratic computational explosion of full-attention as spatiotemporal resolution increases, making it difficult for models to strike a balance between efficiency and quality. This paper proposes a novel Transformer retrofit strategy termed $\textbf{T3}$ ($\textbf{T}$ransform $\textbf{T}$rained $\textbf{T}$ransformer) that, without altering the core architecture of full-attention pretrained models, significantly reduces compute requirements by optimizing their forward logic. Specifically, $\textbf{T3-Video}$ introduces a multi-scale weight-sharing window attention mechanism and, via hierarchical blocking together with an axis-preserving full-attention design, can effect an "attention pattern" transformation of a pretrained model using only modest compute and data. Results on 4K-VBench show that $\textbf{T3-Video}$ substantially outperforms existing approaches: while delivering performance improvements (+4.29$\uparrow$ VQA and +0.08$\uparrow$ VTC), it accelerates native 4K video generation by more than 10$\times$. Project page at https://zhangzjn.github.io/projects/T3-Video
TLDR: The paper introduces T3-Video, a Transformer retrofit strategy that accelerates native 4K video generation by over 10x without altering the original model architecture, achieving significant performance improvements.
TLDR: 该论文介绍了 T3-Video,一种 Transformer 改造策略,可在不改变原始模型架构的情况下将原生 4K 视频生成速度提高 10 倍以上,并实现显著的性能提升。
Read Paper (PDF)Pose-guided video generation refers to controlling the motion of subjects in generated video through a sequence of poses. It enables precise control over subject motion and has important applications in animation. However, current pose-guided video generation methods are limited to accepting only human poses as input, thus generalizing poorly to pose of other subjects. To address this issue, we propose PoseAnything, the first universal pose-guided video generation framework capable of handling both human and non-human characters, supporting arbitrary skeletal inputs. To enhance consistency preservation during motion, we introduce Part-aware Temporal Coherence Module, which divides the subject into different parts, establishes part correspondences, and computes cross-attention between corresponding parts across frames to achieve fine-grained part-level consistency. Additionally, we propose Subject and Camera Motion Decoupled CFG, a novel guidance strategy that, for the first time, enables independent camera movement control in pose-guided video generation, by separately injecting subject and camera motion control information into the positive and negative anchors of CFG. Furthermore, we present XPose, a high-quality public dataset containing 50,000 non-human pose-video pairs, along with an automated pipeline for annotation and filtering. Extensive experiments demonstrate that Pose-Anything significantly outperforms state-of-the-art methods in both effectiveness and generalization.
TLDR: The paper introduces PoseAnything, a universal pose-guided video generation framework, which extends pose-guided video generation to non-human characters with part-aware temporal coherence and independent camera control, and includes a new dataset XPose.
TLDR: 该论文介绍了PoseAnything,一个通用的姿势引导视频生成框架,它将姿势引导视频生成扩展到非人类角色,具有部分感知的时序连贯性和独立的相机控制,并包含一个新的数据集XPose。
Read Paper (PDF)Avatar video generation models have achieved remarkable progress in recent years. However, prior work exhibits limited efficiency in generating long-duration high-resolution videos, suffering from temporal drifting, quality degradation, and weak prompt following as video length increases. To address these challenges, we propose KlingAvatar 2.0, a spatio-temporal cascade framework that performs upscaling in both spatial resolution and temporal dimension. The framework first generates low-resolution blueprint video keyframes that capture global semantics and motion, and then refines them into high-resolution, temporally coherent sub-clips using a first-last frame strategy, while retaining smooth temporal transitions in long-form videos. To enhance cross-modal instruction fusion and alignment in extended videos, we introduce a Co-Reasoning Director composed of three modality-specific large language model (LLM) experts. These experts reason about modality priorities and infer underlying user intent, converting inputs into detailed storylines through multi-turn dialogue. A Negative Director further refines negative prompts to improve instruction alignment. Building on these components, we extend the framework to support ID-specific multi-character control. Extensive experiments demonstrate that our model effectively addresses the challenges of efficient, multimodally aligned long-form high-resolution video generation, delivering enhanced visual clarity, realistic lip-teeth rendering with accurate lip synchronization, strong identity preservation, and coherent multimodal instruction following.
TLDR: KlingAvatar 2.0 introduces a spatio-temporal cascade framework with multi-modal LLM co-reasoning to efficiently generate long-duration, high-resolution avatar videos with improved temporal coherence, instruction following, and identity preservation.
TLDR: KlingAvatar 2.0 引入了一个时空级联框架,结合多模态LLM协同推理,能够高效生成长时间、高分辨率的化身视频,并提升了时间连贯性、指令遵循和身份保持。
Read Paper (PDF)Diffusion models (DMs) have achieved remarkable success in image and video generation. However, they still struggle with (1) physical alignment and (2) out-of-distribution (OOD) instruction following. We argue that these issues stem from the models' failure to learn causal directions and to disentangle causal factors for novel recombination. We introduce the Causal Scene Graph (CSG) and the Physical Alignment Probe (PAP) dataset to enable diagnostic interventions. This analysis yields three key insights. First, DMs struggle with multi-hop reasoning for elements not explicitly determined in the prompt. Second, the prompt embedding contains disentangled representations for texture and physics. Third, visual causal structure is disproportionately established during the initial, computationally limited denoising steps. Based on these findings, we introduce LINA (Learning INterventions Adaptively), a novel framework that learns to predict prompt-specific interventions, which employs (1) targeted guidance in the prompt and visual latent spaces, and (2) a reallocated, causality-aware denoising schedule. Our approach enforces both physical alignment and OOD instruction following in image and video DMs, achieving state-of-the-art performance on challenging causal generation tasks and the Winoground dataset. Our project page is at https://opencausalab.github.io/LINA.
TLDR: The paper introduces LINA, a framework that learns adaptive prompt-specific interventions for diffusion models to improve physical alignment and out-of-distribution instruction following by targeting guidance in both the prompt and visual latent spaces with a reallocated denoising schedule, achieving SOTA results in causal generation tasks.
TLDR: 该论文介绍了LINA,一个学习自适应的提示特定干预的框架,旨在改善扩散模型中的物理对齐和超出分布的指令遵循。该框架通过在提示和视觉潜在空间中进行目标引导,并重新分配去噪时间表来实现上述目标,并在因果生成任务中实现了SOTA结果。
Read Paper (PDF)Instruction-based image editing with diffusion models has achieved impressive results, yet existing methods strug- gle with fine-grained instructions specifying precise attributes such as colors, positions, and quantities. While recent approaches employ Group Relative Policy Optimization (GRPO) for alignment, they optimize only at individual sampling steps, providing sparse feedback that limits trajectory-level control. We propose a unified framework CogniEdit, combining multi-modal reasoning with dense reward optimization that propagates gradients across con- secutive denoising steps, enabling trajectory-level gradient flow through the sampling process. Our method comprises three components: (1) Multi-modal Large Language Models for decomposing complex instructions into actionable directives, (2) Dynamic Token Focus Relocation that adaptively emphasizes fine-grained attributes, and (3) Dense GRPO-based optimization that propagates gradients across consecutive steps for trajectory-level supervision. Extensive experiments on benchmark datasets demonstrate that our CogniEdit achieves state-of-the-art performance in balancing fine-grained instruction following with visual quality and editability preservation
TLDR: CogniEdit improves fine-grained instruction-based image editing with diffusion models by using dense gradient flow optimization across denoising steps, achieving state-of-the-art performance.
TLDR: CogniEdit通过在去噪步骤中使用密集梯度流优化,改进了基于指令的精细图像编辑,并在扩散模型上实现了最先进的性能。
Read Paper (PDF)The development of clinical-grade artificial intelligence in pathology is limited by the scarcity of diverse, high-quality annotated datasets. Generative models offer a potential solution but suffer from semantic instability and morphological hallucinations that compromise diagnostic reliability. To address this challenge, we introduce a Correlation-Regulated Alignment Framework for Tissue Synthesis (CRAFTS), the first generative foundation model for pathology-specific text-to-image synthesis. By leveraging a dual-stage training strategy on approximately 2.8 million image-caption pairs, CRAFTS incorporates a novel alignment mechanism that suppresses semantic drift to ensure biological accuracy. This model generates diverse pathological images spanning 30 cancer types, with quality rigorously validated by objective metrics and pathologist evaluations. Furthermore, CRAFTS-augmented datasets enhance the performance across various clinical tasks, including classification, cross-modal retrieval, self-supervised learning, and visual question answering. In addition, coupling CRAFTS with ControlNet enables precise control over tissue architecture from inputs such as nuclear segmentation masks and fluorescence images. By overcoming the critical barriers of data scarcity and privacy concerns, CRAFTS provides a limitless source of diverse, annotated histology data, effectively unlocking the creation of robust diagnostic tools for rare and complex cancer phenotypes.
TLDR: The paper introduces CRAFTS, a pathology-specific text-to-image generative foundation model with a novel alignment mechanism that generates diverse and accurate pathological images, enhancing performance in various clinical tasks and overcoming data scarcity.
TLDR: 该论文介绍了一种病理学特定的文本到图像生成基础模型CRAFTS,它具有一种新颖的对齐机制,可以生成多样且准确的病理图像,从而提高各种临床任务的性能并克服数据稀缺性。
Read Paper (PDF)Instructional video generation is an emerging task that aims to synthesize coherent demonstrations of procedural activities from textual descriptions. Such capability has broad implications for content creation, education, and human-AI interaction, yet existing video diffusion models struggle to maintain temporal consistency and controllability across long sequences of multiple action steps. We introduce a pipeline for future-driven streaming instructional video generation, dubbed SneakPeek, a diffusion-based autoregressive framework designed to generate precise, stepwise instructional videos conditioned on an initial image and structured textual prompts. Our approach introduces three key innovations to enhance consistency and controllability: (1) predictive causal adaptation, where a causal model learns to perform next-frame prediction and anticipate future keyframes; (2) future-guided self-forcing with a dual-region KV caching scheme to address the exposure bias issue at inference time; (3) multi-prompt conditioning, which provides fine-grained and procedural control over multi-step instructions. Together, these components mitigate temporal drift, preserve motion consistency, and enable interactive video generation where future prompt updates dynamically influence ongoing streaming video generation. Experimental results demonstrate that our method produces temporally coherent and semantically faithful instructional videos that accurately follow complex, multi-step task descriptions.
TLDR: The paper introduces SneakPeek, a novel diffusion-based autoregressive framework for generating instructional videos from text, addressing temporal consistency and controllability issues with predictive causal adaptation, future-guided self-forcing, and multi-prompt conditioning.
TLDR: 该论文介绍了一种名为SneakPeek的新型基于扩散的自回归框架,用于从文本生成教学视频,通过预测因果适应、未来引导的自强制和多提示条件反射来解决时间一致性和可控性问题。
Read Paper (PDF)Recent unified models for joint understanding and generation have significantly advanced visual generation capabilities. However, their focus on conventional tasks like text-to-video generation has left the temporal reasoning potential of unified models largely underexplored. To address this gap, we introduce Next Scene Prediction (NSP), a new task that pushes unified video models toward temporal and causal reasoning. Unlike text-to-video generation, NSP requires predicting plausible futures from preceding context, demanding deeper understanding and reasoning. To tackle this task, we propose a unified framework combining Qwen-VL for comprehension and LTX for synthesis, bridged by a latent query embedding and a connector module. This model is trained in three stages on our newly curated, large-scale NSP dataset: text-to-video pre-training, supervised fine-tuning, and reinforcement learning (via GRPO) with our proposed causal consistency reward. Experiments demonstrate our model achieves state-of-the-art performance on our benchmark, advancing the capability of generalist multimodal systems to anticipate what happens next.
TLDR: The paper introduces Next Scene Prediction (NSP), a new task for unified video models that requires temporal and causal reasoning, and proposes a unified framework to solve it, achieving state-of-the-art performance on a newly curated dataset.
TLDR: 该论文介绍了下一场景预测 (NSP) 任务,这是一个针对统一视频模型的新任务,需要时序和因果推理。论文提出了一个统一的框架来解决该问题,并在一个新创建的数据集上取得了最先进的性能。
Read Paper (PDF)Given the inherently costly and time-intensive nature of pixel-level annotation, the generation of synthetic datasets comprising sufficiently diverse synthetic images paired with ground-truth pixel-level annotations has garnered increasing attention recently for training high-performance semantic segmentation models. However, existing methods necessitate to either predict pseudo annotations after image generation or generate images conditioned on manual annotation masks, which incurs image-annotation semantic inconsistency or scalability problem. To migrate both problems with one stone, we present a novel dataset generative diffusion framework for semantic segmentation, termed JoDiffusion. Firstly, given a standard latent diffusion model, JoDiffusion incorporates an independent annotation variational auto-encoder (VAE) network to map annotation masks into the latent space shared by images. Then, the diffusion model is tailored to capture the joint distribution of each image and its annotation mask conditioned on a text prompt. By doing these, JoDiffusion enables simultaneously generating paired images and semantically consistent annotation masks solely conditioned on text prompts, thereby demonstrating superior scalability. Additionally, a mask optimization strategy is developed to mitigate the annotation noise produced during generation. Experiments on Pascal VOC, COCO, and ADE20K datasets show that the annotated dataset generated by JoDiffusion yields substantial performance improvements in semantic segmentation compared to existing methods.
TLDR: JoDiffusion introduces a novel diffusion framework that jointly generates synthetic images and pixel-level annotations conditioned on text prompts, addressing the limitations of existing methods for semantic segmentation data generation. The approach achieves improved performance on standard semantic segmentation benchmarks.
TLDR: JoDiffusion 提出了一种新颖的扩散框架,能够在文本提示的引导下联合生成合成图像和像素级注释,从而解决了现有语义分割数据生成方法的局限性。该方法在标准语义分割基准测试中取得了更好的性能。
Read Paper (PDF)Diffusion distillation has dramatically accelerated class-conditional image synthesis, but its applicability to open-ended text-to-image (T2I) generation is still unclear. We present the first systematic study that adapts and compares state-of-the-art distillation techniques on a strong T2I teacher model, FLUX.1-lite. By casting existing methods into a unified framework, we identify the key obstacles that arise when moving from discrete class labels to free-form language prompts. Beyond a thorough methodological analysis, we offer practical guidelines on input scaling, network architecture, and hyperparameters, accompanied by an open-source implementation and pretrained student models. Our findings establish a solid foundation for deploying fast, high-fidelity, and resource-efficient diffusion generators in real-world T2I applications. Code is available on github.com/alibaba-damo-academy/T2I-Distill.
TLDR: This paper systematically studies and adapts diffusion distillation techniques for text-to-image generation, providing practical guidelines and insights for deploying efficient T2I models.
TLDR: 本文系统地研究并调整了用于文本到图像生成的扩散蒸馏技术,为部署高效的 T2I 模型提供了实践指导和见解。
Read Paper (PDF)Joint editing of audio and visual content is crucial for precise and controllable content creation. This new task poses challenges due to the limitations of paired audio-visual data before and after targeted edits, and the heterogeneity across modalities. To address the data and modeling challenges in joint audio-visual editing, we introduce SAVEBench, a paired audiovisual dataset with text and mask conditions to enable object-grounded source-to-target learning. With SAVEBench, we train the Schrodinger Audio-Visual Editor (SAVE), an end-to-end flow-matching model that edits audio and video in parallel while keeping them aligned throughout processing. SAVE incorporates a Schrodinger Bridge that learns a direct transport from source to target audiovisual mixtures. Our evaluation demonstrates that the proposed SAVE model is able to remove the target objects in audio and visual content while preserving the remaining content, with stronger temporal synchronization and audiovisual semantic correspondence compared with pairwise combinations of an audio editor and a video editor.
TLDR: This paper introduces SAVEBench, a paired audiovisual dataset, and SAVE, an end-to-end flow-matching model for object-level audiovisual removal, demonstrating improved temporal synchronization and semantic correspondence compared to separate audio and video editors.
TLDR: 该论文介绍了SAVEBench,一个配对的视听数据集,以及SAVE,一个端到端的流匹配模型,用于对象级别的视听移除,与单独的音频和视频编辑器相比,它展示了更好的时间同步和语义对应性。
Read Paper (PDF)Visual tokenizers play a crucial role in diffusion models. The dimensionality of latent space governs both reconstruction fidelity and the semantic expressiveness of the latent feature. However, a fundamental trade-off is inherent between dimensionality and generation quality, constraining existing methods to low-dimensional latent spaces. Although recent works have leveraged vision foundation models to enrich the semantics of visual tokenizers and accelerate convergence, high-dimensional tokenizers still underperform their low-dimensional counterparts. In this work, we propose RecTok, which overcomes the limitations of high-dimensional visual tokenizers through two key innovations: flow semantic distillation and reconstruction--alignment distillation. Our key insight is to make the forward flow in flow matching semantically rich, which serves as the training space of diffusion transformers, rather than focusing on the latent space as in previous works. Specifically, our method distills the semantic information in VFMs into the forward flow trajectories in flow matching. And we further enhance the semantics by introducing a masked feature reconstruction loss. Our RecTok achieves superior image reconstruction, generation quality, and discriminative performance. It achieves state-of-the-art results on the gFID-50K under both with and without classifier-free guidance settings, while maintaining a semantically rich latent space structure. Furthermore, as the latent dimensionality increases, we observe consistent improvements. Code and model are available at https://shi-qingyu.github.io/rectok.github.io.
TLDR: The paper introduces RecTok, a method for improving high-dimensional visual tokenizers in diffusion models by distilling semantic information into the forward flow trajectory and using a reconstruction-alignment distillation technique. It claims state-of-the-art results in image reconstruction and generation quality.
TLDR: 该论文介绍了RecTok,一种通过将语义信息提炼到前向流轨迹中并使用重构对齐蒸馏技术来改进扩散模型中高维视觉标记器的方法。 它声称在图像重建和生成质量方面达到了最先进的结果。
Read Paper (PDF)Generating 3D-based body movements from speech shows great potential in extensive downstream applications, while it still suffers challenges in imitating realistic human movements. Predominant research efforts focus on end-to-end generation schemes to generate co-speech gestures, spanning GANs, VQ-VAE, and recent diffusion models. As an ill-posed problem, in this paper, we argue that these prevailing learning schemes fail to model crucial inter- and intra-correlations across different motion units, i.e. head, body, and hands, thus leading to unnatural movements and poor coordination. To delve into these intrinsic correlations, we propose a unified Hierarchical Implicit Periodicity (HIP) learning approach for audio-inspired 3D gesture generation. Different from predominant research, our approach models this multi-modal implicit relationship by two explicit technique insights: i) To disentangle the complicated gesture movements, we first explore the gesture motion phase manifolds with periodic autoencoders to imitate human natures from realistic distributions while incorporating non-period ones from current latent states for instance-level diversities. ii) To model the hierarchical relationship of face motions, body gestures, and hand movements, driving the animation with cascaded guidance during learning. We exhibit our proposed approach on 3D avatars and extensive experiments show our method outperforms the state-of-the-art co-speech gesture generation methods by both quantitative and qualitative evaluations. Code and models will be publicly available.
TLDR: This paper proposes a Hierarchical Implicit Periodicity (HIP) learning approach for generating realistic 3D co-speech gestures by modeling inter- and intra-correlations across different motion units using periodic autoencoders and cascaded guidance; experimental results show it outperforms state-of-the-art methods.
TLDR: 本文提出了一种分层隐式周期性(HIP)学习方法,用于生成逼真的3D口语姿势。该方法通过使用周期性自动编码器和级联指导,对不同运动单元之间的相互关系进行建模。实验结果表明,该方法优于最先进的口语姿势生成方法。
Read Paper (PDF)