Daily papers related to Image/Video/Multimodal Generation from cs.CV
October 07, 2025
Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video. Unlike natural video, presentation video generation involves distinctive challenges: inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker. To address these challenges, we introduce PaperTalker, the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata. We further design four tailored evaluation metrics--Meta Similarity, PresentArena, PresentQuiz, and IP Memory--to measure how videos convey the paper's information to the audience. Building on this foundation, we propose PaperTalker, the first multi-agent framework for academic presentation video generation. It integrates slide generation with effective layout refinement by a novel effective tree search visual choice, cursor grounding, subtitling, speech synthesis, and talking-head rendering, while parallelizing slide-wise generation for efficiency. Experiments on Paper2Video demonstrate that the presentation videos produced by our approach are more faithful and informative than existing baselines, establishing a practical step toward automated and ready-to-use academic video generation. Our dataset, agent, and code are available at https://github.com/showlab/Paper2Video.
TLDR: The paper introduces PaperTalker, a benchmark dataset and multi-agent framework for automatically generating presentation videos from scientific papers, addressing the challenges of multi-modal information integration and efficient generation.
TLDR: 该论文介绍了PaperTalker,一个基准数据集和多智能体框架,用于从科学论文中自动生成演示视频,解决了多模态信息集成和高效生成方面的挑战。
Read Paper (PDF)Recent video generation models can produce smooth and visually appealing clips, but they often struggle to synthesize complex dynamics with a coherent chain of consequences. Accurately modeling visual outcomes and state transitions over time remains a core challenge. In contrast, large language and multimodal models (e.g., GPT-4o) exhibit strong visual state reasoning and future prediction capabilities. To bridge these strengths, we introduce VChain, a novel inference-time chain-of-visual-thought framework that injects visual reasoning signals from multimodal models into video generation. Specifically, VChain contains a dedicated pipeline that leverages large multimodal models to generate a sparse set of critical keyframes as snapshots, which are then used to guide the sparse inference-time tuning of a pre-trained video generator only at these key moments. Our approach is tuning-efficient, introduces minimal overhead and avoids dense supervision. Extensive experiments on complex, multi-step scenarios show that VChain significantly enhances the quality of generated videos.
TLDR: The paper introduces VChain, a framework that uses multimodal models to inject visual reasoning into video generation by generating keyframes to guide a pre-trained video generator, enhancing the quality of generated videos.
TLDR: 该论文介绍了VChain,一个利用多模态模型将视觉推理注入视频生成的框架,通过生成关键帧来指导预训练的视频生成器,从而提高生成的视频质量。
Read Paper (PDF)Text-to-video (T2V) generation technology holds potential to transform multiple domains such as education, marketing, entertainment, and assistive technologies for individuals with visual or reading comprehension challenges, by creating coherent visual content from natural language prompts. From its inception, the field has advanced from adversarial models to diffusion-based models, yielding higher-fidelity, temporally consistent outputs. Yet challenges persist, such as alignment, long-range coherence, and computational efficiency. Addressing this evolving landscape, we present a comprehensive survey of text-to-video generative models, tracing their development from early GANs and VAEs to hybrid Diffusion-Transformer (DiT) architectures, detailing how these models work, what limitations they addressed in their predecessors, and why shifts toward new architectural paradigms were necessary to overcome challenges in quality, coherence, and control. We provide a systematic account of the datasets, which the surveyed text-to-video models were trained and evaluated on, and, to support reproducibility and assess the accessibility of training such models, we detail their training configurations, including their hardware specifications, GPU counts, batch sizes, learning rates, optimizers, epochs, and other key hyperparameters. Further, we outline the evaluation metrics commonly used for evaluating such models and present their performance across standard benchmarks, while also discussing the limitations of these metrics and the emerging shift toward more holistic, perception-aligned evaluation strategies. Finally, drawing from our analysis, we outline the current open challenges and propose a few promising future directions, laying out a perspective for future researchers to explore and build upon in advancing T2V research and applications.
TLDR: This paper is a comprehensive survey of text-to-video generation models, covering their evolution, datasets, training configurations, evaluation metrics, and future directions.
TLDR: 本文全面综述了文本到视频生成模型,涵盖了其发展历程、数据集、训练配置、评估指标以及未来方向。
Read Paper (PDF)Treating human motion and camera trajectory generation separately overlooks a core principle of cinematography: the tight interplay between actor performance and camera work in the screen space. In this paper, we are the first to cast this task as a text-conditioned joint generation, aiming to maintain consistent on-screen framing while producing two heterogeneous, yet intrinsically linked, modalities: human motion and camera trajectories. We propose a simple, model-agnostic framework that enforces multimodal coherence via an auxiliary modality: the on-screen framing induced by projecting human joints onto the camera. This on-screen framing provides a natural and effective bridge between modalities, promoting consistency and leading to more precise joint distribution. We first design a joint autoencoder that learns a shared latent space, together with a lightweight linear transform from the human and camera latents to a framing latent. We then introduce auxiliary sampling, which exploits this linear transform to steer generation toward a coherent framing modality. To support this task, we also introduce the PulpMotion dataset, a human-motion and camera-trajectory dataset with rich captions, and high-quality human motions. Extensive experiments across DiT- and MAR-based architectures show the generality and effectiveness of our method in generating on-frame coherent human-camera motions, while also achieving gains on textual alignment for both modalities. Our qualitative results yield more cinematographically meaningful framings setting the new state of the art for this task. Code, models and data are available in our \href{https://www.lix.polytechnique.fr/vista/projects/2025_pulpmotion_courant/}{project page}.
TLDR: This paper introduces a novel framework for joint generation of human motion and camera trajectories conditioned on text, emphasizing coherent on-screen framing using a shared latent space and a new dataset called PulpMotion.
TLDR: 本文介绍了一种新的框架,用于在文本条件下联合生成人体运动和相机轨迹,强调使用共享潜在空间和一个名为 PulpMotion 的新数据集来实现连贯的屏幕取景。
Read Paper (PDF)Imagine Mr. Bean stepping into Tom and Jerry--can we generate videos where characters interact naturally across different worlds? We study inter-character interaction in text-to-video generation, where the key challenge is to preserve each character's identity and behaviors while enabling coherent cross-context interaction. This is difficult because characters may never have coexisted and because mixing styles often causes style delusion, where realistic characters appear cartoonish or vice versa. We introduce a framework that tackles these issues with Cross-Character Embedding (CCE), which learns identity and behavioral logic across multimodal sources, and Cross-Character Augmentation (CCA), which enriches training with synthetic co-existence and mixed-style data. Together, these techniques allow natural interactions between previously uncoexistent characters without losing stylistic fidelity. Experiments on a curated benchmark of cartoons and live-action series with 10 characters show clear improvements in identity preservation, interaction quality, and robustness to style delusion, enabling new forms of generative storytelling.Additional results and videos are available on our project page: https://tingtingliao.github.io/mimix/.
TLDR: This paper introduces a framework for text-to-video generation that allows characters from different styles and contexts to interact naturally by preserving their identities and behaviors, using Cross-Character Embedding and Augmentation techniques.
TLDR: 本文提出了一种文本到视频生成的框架,该框架允许来自不同风格和背景的角色自然地互动,通过使用跨角色嵌入和增强技术来保持其身份和行为。
Read Paper (PDF)While modern visual generation models excel at creating aesthetically pleasing natural images, they struggle with producing or editing structured visuals like charts, diagrams, and mathematical figures, which demand composition planning, text rendering, and multimodal reasoning for factual fidelity. To address this, we present the first comprehensive, systematic investigation of this domain, encompassing data construction, model training, and an evaluation benchmark. First, we construct a large-scale dataset of 1.3 million high-quality structured image pairs derived from executable drawing programs and augmented with chain-of-thought reasoning annotations. Building on it, we train a unified model that integrates a VLM with FLUX.1 Kontext via a lightweight connector for enhanced multimodal understanding. A three-stage training curriculum enables progressive feature alignment, knowledge infusion, and reasoning-augmented generation, further boosted by an external reasoner at inference time. Finally, we introduce StructBench, a novel benchmark for generation and editing with over 1,700 challenging instances, and an accompanying evaluation metric, StructScore, which employs a multi-round Q\&A protocol to assess fine-grained factual accuracy. Evaluations of 15 models reveal that even leading closed-source systems remain far from satisfactory. Our model attains strong editing performance, and inference-time reasoning yields consistent gains across diverse architectures. By releasing the dataset, model, and benchmark, we aim to advance unified multimodal foundations for structured visuals.
TLDR: This paper introduces a new dataset, model, and benchmark (StructBench) for generating and editing structured visuals with improved factual accuracy, addressing a weakness in current visual generation models.
TLDR: 本文介绍了一个新的数据集、模型和基准测试 (StructBench),用于生成和编辑具有更高事实准确性的结构化视觉内容,解决了当前视觉生成模型的一个弱点。
Read Paper (PDF)Large-scale text-to-image diffusion models have become the backbone of modern image editing, yet text prompts alone do not offer adequate control over the editing process. Two properties are especially desirable: disentanglement, where changing one attribute does not unintentionally alter others, and continuous control, where the strength of an edit can be smoothly adjusted. We introduce a method for disentangled and continuous editing through token-level manipulation of text embeddings. The edits are applied by manipulating the embeddings along carefully chosen directions, which control the strength of the target attribute. To identify such directions, we employ a Sparse Autoencoder (SAE), whose sparse latent space exposes semantically isolated dimensions. Our method operates directly on text embeddings without modifying the diffusion process, making it model agnostic and broadly applicable to various image synthesis backbones. Experiments show that it enables intuitive and efficient manipulations with continuous control across diverse attributes and domains.
TLDR: This paper introduces SAEdit, a method for disentangled and continuous image editing via token-level manipulation of text embeddings using a Sparse Autoencoder to control the strength of attributes in text-to-image diffusion models.
TLDR: 本文介绍了SAEdit,一种通过稀疏自动编码器,利用token级别操作文本嵌入,从而实现解耦和连续图像编辑的方法,可以控制文本到图像扩散模型中属性的强度。
Read Paper (PDF)Tokenizers are a key component of state-of-the-art generative image models, extracting the most important features from the signal while reducing data dimension and redundancy. Most current tokenizers are based on KL-regularized variational autoencoders (KL-VAE), trained with reconstruction, perceptual and adversarial losses. Diffusion decoders have been proposed as a more principled alternative to model the distribution over images conditioned on the latent. However, matching the performance of KL-VAE still requires adversarial losses, as well as a higher decoding time due to iterative sampling. To address these limitations, we introduce a new pixel diffusion decoder architecture for improved scaling and training stability, benefiting from transformer components and GAN-free training. We use distillation to replicate the performance of the diffusion decoder in an efficient single-step decoder. This makes SSDD the first diffusion decoder optimized for single-step reconstruction trained without adversarial losses, reaching higher reconstruction quality and faster sampling than KL-VAE. In particular, SSDD improves reconstruction FID from $0.87$ to $0.50$ with $1.4\times$ higher throughput and preserve generation quality of DiTs with $3.8\times$ faster sampling. As such, SSDD can be used as a drop-in replacement for KL-VAE, and for building higher-quality and faster generative models.
TLDR: The paper introduces SSDD, a single-step diffusion decoder for image tokenization, outperforming KL-VAEs in reconstruction quality and sampling speed without adversarial losses. It can be used as a drop-in replacement for KL-VAE tokenizers.
TLDR: 本文介绍了一种用于图像标记的单步扩散解码器SSDD,它在重建质量和采样速度上优于KL-VAEs,并且没有对抗性损失。它可以作为KL-VAE标记器的直接替代品。
Read Paper (PDF)In recent years, multi-concept personalization for text-to-image (T2I) diffusion models to represent several subjects in an image has gained much more attention. The main challenge of this task is "concept mixing", where multiple learned concepts interfere or blend undesirably in the output image. To address this issue, in this paper, we present ConceptSplit, a novel framework to split the individual concepts through training and inference. Our framework comprises two key components. First, we introduce Token-wise Value Adaptation (ToVA), a merging-free training method that focuses exclusively on adapting the value projection in cross-attention. Based on our empirical analysis, we found that modifying the key projection, a common approach in existing methods, can disrupt the attention mechanism and lead to concept mixing. Second, we propose Latent Optimization for Disentangled Attention (LODA), which alleviates attention entanglement during inference by optimizing the input latent. Through extensive qualitative and quantitative experiments, we demonstrate that ConceptSplit achieves robust multi-concept personalization, mitigating unintended concept interference. Code is available at https://github.com/KU-VGI/ConceptSplit
TLDR: The paper introduces ConceptSplit, a novel framework for multi-concept personalization in text-to-image diffusion models that aims to mitigate concept mixing by adapting value projection and optimizing the input latent.
TLDR: 该论文介绍了 ConceptSplit,这是一种用于文本到图像扩散模型中多概念个性化的新型框架,旨在通过调整值投影和优化输入潜在变量来减轻概念混合。
Read Paper (PDF)Deep generative models have made significant advances in generating complex content, yet conditional generation remains a fundamental challenge. Existing conditional generative adversarial networks often struggle to balance the dual objectives of assessing authenticity and conditional alignment of input samples within their conditional discriminators. To address this, we propose a novel discriminator design that integrates three key capabilities: unconditional discrimination, matching-aware supervision to enhance alignment sensitivity, and adaptive weighting to dynamically balance all objectives. Specifically, we introduce Sum of Naturalness and Alignment (SONA), which employs separate projections for naturalness (authenticity) and alignment in the final layer with an inductive bias, supported by dedicated objective functions and an adaptive weighting mechanism. Extensive experiments on class-conditional generation tasks show that \ours achieves superior sample quality and conditional alignment compared to state-of-the-art methods. Furthermore, we demonstrate its effectiveness in text-to-image generation, confirming the versatility and robustness of our approach.
TLDR: This paper introduces SONA, a novel discriminator for conditional GANs that balances authenticity and conditional alignment through unconditional discrimination, matching-aware supervision, and adaptive weighting, demonstrating superior performance in class-conditional and text-to-image generation tasks.
TLDR: 本文介绍了一种名为SONA的新型判别器,用于条件GAN,通过无条件判别、匹配感知监督和自适应加权来平衡真实性和条件对齐,并在类别条件和文本到图像生成任务中表现出卓越性能。
Read Paper (PDF)Recent diffusion models achieve the state-of-the-art performance in image generation, but often suffer from semantic inconsistencies or hallucinations. While various inference-time guidance methods can enhance generation, they often operate indirectly by relying on external signals or architectural modifications, which introduces additional computational overhead. In this paper, we propose Tangential Amplifying Guidance (TAG), a more efficient and direct guidance method that operates solely on trajectory signals without modifying the underlying diffusion model. TAG leverages an intermediate sample as a projection basis and amplifies the tangential components of the estimated scores with respect to this basis to correct the sampling trajectory. We formalize this guidance process by leveraging a first-order Taylor expansion, which demonstrates that amplifying the tangential component steers the state toward higher-probability regions, thereby reducing inconsistencies and enhancing sample quality. TAG is a plug-and-play, architecture-agnostic module that improves diffusion sampling fidelity with minimal computational addition, offering a new perspective on diffusion guidance.
TLDR: The paper introduces Tangential Amplifying Guidance (TAG), a novel, efficient, and architecture-agnostic guidance method for diffusion models that reduces semantic inconsistencies by amplifying tangential components of estimated scores during sampling.
TLDR: 该论文介绍了切向放大引导(TAG), 一种新颖、高效且架构无关的扩散模型引导方法, 通过在采样过程中放大估计分数的切向分量来减少语义不一致性。
Read Paper (PDF)Diffusion models have achieved impressive results in generating high-quality images. Yet, they often struggle to faithfully align the generated images with the input prompts. This limitation arises from synchronous denoising, where all pixels simultaneously evolve from random noise to clear images. As a result, during generation, the prompt-related regions can only reference the unrelated regions at the same noise level, failing to obtain clear context and ultimately impairing text-to-image alignment. To address this issue, we propose asynchronous diffusion models -- a novel framework that allocates distinct timesteps to different pixels and reformulates the pixel-wise denoising process. By dynamically modulating the timestep schedules of individual pixels, prompt-related regions are denoised more gradually than unrelated regions, thereby allowing them to leverage clearer inter-pixel context. Consequently, these prompt-related regions achieve better alignment in the final images. Extensive experiments demonstrate that our asynchronous diffusion models can significantly improve text-to-image alignment across diverse prompts. The code repository for this work is available at https://github.com/hu-zijing/AsynDM.
TLDR: The paper introduces asynchronous diffusion models to improve text-to-image alignment by dynamically adjusting denoising timesteps for different pixels, allowing prompt-related regions to leverage clearer inter-pixel context.
TLDR: 该论文提出了一种异步扩散模型,通过为不同的像素动态调整去噪时间步长来改善文本到图像的对齐,从而使与提示相关的区域能够利用更清晰的像素间上下文。
Read Paper (PDF)Visual autoregressive (AR) generation offers a promising path toward unifying vision and language models, yet its performance remains suboptimal against diffusion models. Prior work often attributes this gap to tokenizer limitations and rasterization ordering. In this work, we identify a core bottleneck from the perspective of generator-tokenizer inconsistency, i.e., the AR-generated tokens may not be well-decoded by the tokenizer. To address this, we propose reAR, a simple training strategy introducing a token-wise regularization objective: when predicting the next token, the causal transformer is also trained to recover the visual embedding of the current token and predict the embedding of the target token under a noisy context. It requires no changes to the tokenizer, generation order, inference pipeline, or external models. Despite its simplicity, reAR substantially improves performance. On ImageNet, it reduces gFID from 3.02 to 1.86 and improves IS to 316.9 using a standard rasterization-based tokenizer. When applied to advanced tokenizers, it achieves a gFID of 1.42 with only 177M parameters, matching the performance with larger state-of-the-art diffusion models (675M).
TLDR: The paper introduces reAR, a generator-tokenizer consistency regularization method for visual autoregressive models, improving image generation performance to match state-of-the-art diffusion models with fewer parameters.
TLDR: 该论文介绍了reAR,一种用于视觉自回归模型的生成器-tokenizer一致性正则化方法,提高了图像生成性能,以更少的参数匹配了最先进的扩散模型。
Read Paper (PDF)World models that support controllable and editable spatiotemporal environments are valuable for robotics, enabling scalable training data, repro ducible evaluation, and flexible task design. While recent text-to-video models generate realistic dynam ics, they are constrained to 2D views and offer limited interaction. We introduce MorphoSim, a language guided framework that generates 4D scenes with multi-view consistency and object-level controls. From natural language instructions, MorphoSim produces dynamic environments where objects can be directed, recolored, or removed, and scenes can be observed from arbitrary viewpoints. The framework integrates trajectory-guided generation with feature field dis tillation, allowing edits to be applied interactively without full re-generation. Experiments show that Mor phoSim maintains high scene fidelity while enabling controllability and editability. The code is available at https://github.com/eric-ai-lab/Morph4D.
TLDR: MorphoSim is a language-guided framework for generating controllable and editable 4D scenes with multi-view consistency, enabling object manipulation and scene observation from arbitrary viewpoints. It integrates trajectory-guided generation with feature field distillation for interactive edits.
TLDR: MorphoSim是一个语言引导的框架,用于生成具有多视角一致性的可控和可编辑的4D场景,从而实现对象操作和从任意视点观察场景。它集成了轨迹引导生成和特征场蒸馏,以实现交互式编辑。
Read Paper (PDF)This paper proposes "3Dify," a procedural 3D computer graphics (3D-CG) generation framework utilizing Large Language Models (LLMs). The framework enables users to generate 3D-CG content solely through natural language instructions. 3Dify is built upon Dify, an open-source platform for AI application development, and incorporates several state-of-the-art LLM-related technologies such as the Model Context Protocol (MCP) and Retrieval-Augmented Generation (RAG). For 3D-CG generation support, 3Dify automates the operation of various Digital Content Creation (DCC) tools via MCP. When DCC tools do not support MCP-based interaction, the framework employs the Computer-Using Agent (CUA) method to automate Graphical User Interface (GUI) operations. Moreover, to enhance image generation quality, 3Dify allows users to provide feedback by selecting preferred images from multiple candidates. The LLM then learns variable patterns from these selections and applies them to subsequent generations. Furthermore, 3Dify supports the integration of locally deployed LLMs, enabling users to utilize custom-developed models and to reduce both time and monetary costs associated with external API calls by leveraging their own computational resources.
TLDR: The paper introduces 3Dify, a framework leveraging LLMs, MCP, and RAG for procedural 3D content generation from natural language instructions, integrating DCC tools and user feedback for improved quality and cost reduction.
TLDR: 该论文介绍了3Dify,一个利用LLMs,MCP和RAG的框架,用于从自然语言指令程序化生成3D内容,集成了DCC工具和用户反馈,以提高质量并降低成本。
Read Paper (PDF)Recent advances in image generation and editing technologies have enabled state-of-the-art models to achieve impressive results in general domains. However, when applied to e-commerce scenarios, these general models often encounter consistency limitations. To address this challenge, we introduce TBStar-Edit, an new image editing model tailored for the e-commerce domain. Through rigorous data engineering, model architecture design and training strategy, TBStar-Edit achieves precise and high-fidelity image editing while maintaining the integrity of product appearance and layout. Specifically, for data engineering, we establish a comprehensive data construction pipeline, encompassing data collection, construction, filtering, and augmentation, to acquire high-quality, instruction-following, and strongly consistent editing data to support model training. For model architecture design, we design a hierarchical model framework consisting of a base model, pattern shifting modules, and consistency enhancement modules. For model training, we adopt a two-stage training strategy to enhance the consistency preservation: first stage for editing pattern shifting, and second stage for consistency enhancement. Each stage involves training different modules with separate datasets. Finally, we conduct extensive evaluations of TBStar-Edit on a self-proposed e-commerce benchmark, and the results demonstrate that TBStar-Edit outperforms existing general-domain editing models in both objective metrics (VIE Score) and subjective user preference.
TLDR: The paper introduces TBStar-Edit, a novel image editing model specifically designed for e-commerce scenarios, addressing the consistency issues encountered by general models. It uses a tailored architecture, data engineering pipeline, and two-stage training strategy to achieve high-fidelity editing with preserved product integrity, outperforming existing models on an e-commerce benchmark.
TLDR: 该论文介绍了TBStar-Edit,一种专为电商场景设计的新型图像编辑模型,旨在解决通用模型遇到的一致性问题。它采用定制的架构、数据工程流程和两阶段训练策略,以实现高保真编辑并保持产品完整性, 在电商基准测试中优于现有模型。
Read Paper (PDF)