Daily papers related to Image/Video/Multimodal Generation from cs.CV
December 03, 2025
Diffusion models have achieved remarkable success in image generation, yet their deployment remains constrained by the heavy computational cost and the need for numerous inference steps. Previous efforts on fewer-step distillation attempt to skip redundant steps by training compact student models, yet they often suffer from heavy retraining costs and degraded generalization. In this work, we take a different perspective: we accelerate smartly, not evenly, applying smaller speedups to early semantic stages and larger ones to later redundant phases. We instantiate this phase-aware strategy with two experts that specialize in slow and fast denoising phases. Surprisingly, instead of investing massive effort in retraining student models, we find that simply equipping the base model with lightweight LoRA adapters achieves both efficient acceleration and strong generalization. We refer to these two adapters as Slow-LoRA and Fast-LoRA. Through extensive experiments, our method achieves up to 5 acceleration over the base model while maintaining comparable visual quality across diverse benchmarks. Remarkably, the LoRA experts are trained with only 1 samples on a single V100 within one hour, yet the resulting models generalize strongly on unseen prompts.
TLDR: This paper introduces Glance, a method to accelerate diffusion models by using LoRA adapters specialized for slow and fast denoising phases, achieving significant speedups with minimal retraining and strong generalization using only 1 training sample.
TLDR: 这篇论文介绍了Glance,一种加速扩散模型的方法,它使用专门用于慢速和快速去噪阶段的LoRA适配器,仅用一个训练样本就实现了显著的加速,且只需少量再训练并具有很强的泛化能力。
Read Paper (PDF)Recent advances in video diffusion models have remarkably improved camera-controlled video generation, but most methods rely solely on supervised fine-tuning (SFT), leaving online reinforcement learning (RL) post-training largely underexplored. In this work, we introduce an online RL post-training framework that optimizes a pretrained video generator for precise camera control. To make RL effective in this setting, we design a verifiable geometry reward that delivers dense segment-level feedback to guide model optimization. Specifically, we estimate the 3D camera trajectories for both generated and reference videos, divide each trajectory into short segments, and compute segment-wise relative poses. The reward function then compares each generated-reference segment pair and assigns an alignment score as the reward signal, which helps alleviate reward sparsity and improve optimization efficiency. Moreover, we construct a comprehensive dataset featuring diverse large-amplitude camera motions and scenes with varied subject dynamics. Extensive experiments show that our online RL post-training clearly outperforms SFT baselines across multiple aspects, including camera-control accuracy, geometric consistency, and visual quality, demonstrating its superiority in advancing camera-controlled video generation.
TLDR: This paper introduces an online reinforcement learning (RL) post-training framework with a verifiable geometry reward for camera-controlled video generation, demonstrating improved camera control accuracy, geometric consistency, and visual quality compared to supervised fine-tuning.
TLDR: 该论文介绍了一个在线强化学习(RL)后训练框架,该框架具有可验证的几何奖励,用于相机控制的视频生成。实验表明,与监督微调相比,该框架在相机控制精度、几何一致性和视觉质量方面都有所提高。
Read Paper (PDF)Unified multimodal models (UMMs) aim to jointly perform multimodal understanding and generation within a single framework. We present TUNA, a native UMM that builds a unified continuous visual representation by cascading a VAE encoder with a representation encoder. This unified representation space allows end-to-end processing of images and videos for both understanding and generation tasks. Compared to prior UMMs with decoupled representations, TUNA's unified visual space avoids representation format mismatches introduced by separate encoders, outperforming decoupled alternatives in both understanding and generation. Moreover, we observe that stronger pretrained representation encoders consistently yield better performance across all multimodal tasks, highlighting the importance of the representation encoder. Finally, in this unified setting, jointly training on both understanding and generation data allows the two tasks to benefit from each other rather than interfere. Our extensive experiments on multimodal understanding and generation benchmarks show that TUNA achieves state-of-the-art results in image and video understanding, image and video generation, and image editing, demonstrating the effectiveness and scalability of its unified representation design.
TLDR: The paper introduces TUNA, a native unified multimodal model (UMM) that uses a cascaded VAE and representation encoder to create a unified visual representation space for end-to-end image and video understanding and generation, achieving state-of-the-art results.
TLDR: 该论文介绍了TUNA,一种原生统一多模态模型(UMM),它使用级联的VAE和表示编码器来创建统一的视觉表示空间,以实现端到端的图像和视频理解与生成,并取得了最先进的结果。
Read Paper (PDF)The next frontier for video generation lies in developing models capable of zero-shot reasoning, where understanding real-world scientific laws is crucial for accurate physical outcome modeling under diverse conditions. However, existing video benchmarks are physical commonsense-based, offering limited insight into video models' scientific reasoning capability. We introduce VideoScience-Bench, a benchmark designed to evaluate undergraduate-level scientific understanding in video models. Each prompt encodes a composite scientific scenario that requires understanding and reasoning across multiple scientific concepts to generate the correct phenomenon. The benchmark comprises 200 carefully curated prompts spanning 14 topics and 103 concepts in physics and chemistry. We conduct expert-annotated evaluations across seven state-of-the-art video models in T2V and I2V settings along five dimensions: Prompt Consistency, Phenomenon Congruency, Correct Dynamism, Immutability, and Spatio-Temporal Continuity. Using a VLM-as-a-Judge to assess video generations, we observe strong correlation with human assessments. To the best of our knowledge, VideoScience-Bench is the first benchmark to evaluate video models not only as generators but also as reasoners, requiring their generations to demonstrate scientific understanding consistent with expected physical and chemical phenomena. Our data and evaluation code are available at: \href{https://github.com/hao-ai-lab/VideoScience}{github.com/hao-ai-lab/VideoScience}.
TLDR: The paper introduces VideoScience-Bench, a new benchmark for evaluating scientific understanding and reasoning in video generation models, focusing on physics and chemistry concepts.
TLDR: 本文介绍了VideoScience-Bench,这是一个新的基准测试,用于评估视频生成模型中的科学理解和推理能力,重点关注物理和化学概念。
Read Paper (PDF)Text-guided video editing, particularly for object removal and addition, remains a challenging task due to the need for precise spatial and temporal consistency. Existing methods often rely on auxiliary masks or reference images for editing guidance, which limits their scalability and generalization. To address these issues, we propose LoVoRA, a novel framework for mask-free video object removal and addition using object-aware localization mechanism. Our approach utilizes a unique dataset construction pipeline that integrates image-to-video translation, optical flow-based mask propagation, and video inpainting, enabling temporally consistent edits. The core innovation of LoVoRA is its learnable object-aware localization mechanism, which provides dense spatio-temporal supervision for both object insertion and removal tasks. By leveraging a Diffusion Mask Predictor, LoVoRA achieves end-to-end video editing without requiring external control signals during inference. Extensive experiments and human evaluation demonstrate the effectiveness and high-quality performance of LoVoRA.
TLDR: LoVoRA presents a mask-free framework for text-guided video object removal and addition utilizing a learned object-aware localization mechanism and a novel dataset construction pipeline, achieving temporally consistent edits without external control signals during inference.
TLDR: LoVoRA 提出了一个无掩码框架,用于文本引导的视频对象移除和添加,该框架利用学习到的对象感知定位机制和新的数据集构建流程,在推理过程中无需外部控制信号即可实现时间上一致的编辑。
Read Paper (PDF)In this paper, we investigate the underexplored challenge of sample diversity in autoregressive (AR) generative models with bitwise visual tokenizers. We first analyze the factors that limit diversity in bitwise AR models and identify two key issues: (1) the binary classification nature of bitwise modeling, which restricts the prediction space, and (2) the overly sharp logits distribution, which causes sampling collapse and reduces diversity. Building on these insights, we propose DiverseAR, a principled and effective method that enhances image diversity without sacrificing visual quality. Specifically, we introduce an adaptive logits distribution scaling mechanism that dynamically adjusts the sharpness of the binary output distribution during sampling, resulting in smoother predictions and greater diversity. To mitigate potential fidelity loss caused by distribution smoothing, we further develop an energy-based generation path search algorithm that avoids sampling low-confidence tokens, thereby preserving high visual quality. Extensive experiments demonstrate that DiverseAR substantially improves sample diversity in bitwise autoregressive image generation.
TLDR: DiverseAR addresses the limited diversity in bitwise autoregressive image generation by adaptively scaling logits distributions and employing an energy-based generation path search to maintain visual quality.
TLDR: DiverseAR通过自适应地调整logits分布并采用基于能量的生成路径搜索来解决逐位自回归图像生成中有限的多样性问题,同时保持视觉质量。
Read Paper (PDF)We present MindGPT-4ov, a multimodal large language model (MLLM) that introduces a general post-training paradigm spanning data production, model training, and efficient deployment. It achieves state-of-the-art performance across multiple benchmarks at low cost, effectively enhancing the foundational capabilities of MLLMs and the generalization ability. Focusing on data construction, supervised fine-tuning strategies, and multimodal reinforcement learning methods, this work proposes three key innovations: (1) An information density-based data generation scheme, integrated with a dual-dimensional tree-structured label system, enabling automated generation of high-quality cross-domain data. (2) A collaborative curriculum supervised fine-tuning approach that balances the injection of domain-specific knowledge with the preservation of general capabilities. (3) A hybrid reinforcement learning paradigm that enhances reasoning ability while simultaneously addressing multi-objective optimization such as diversity exploration, maintenance of multimodal perception, and response conciseness. Moreover, we implement a series of infrastructure optimizations, such as 5D parallel training, operator optimization, and inference quantization to enhance training and inference efficiency while reducing the cost of domain adaptation. Experimental results demonstrate that the MindGPT-4ov model outperforms state-of-the-art models on benchmarks such as MMBench, MMStar, MathVision, and MathVista. In addition, MindGPT-4ov also demonstrates superior user experience in vertical domain tasks, enabling a seamless transition from academic research to industrial deployment. MindGPT-4ov provides a general post-training paradigm applicable to a wide range of MLLMs. The model weights, datasets, and code for the Qwen3-VL-based variants will be recently open-sourced to support the community's development of MLLMs.
TLDR: MindGPT-4ov introduces a post-training paradigm for MLLMs, achieving state-of-the-art results on multiple benchmarks through innovations in data generation, fine-tuning, and reinforcement learning, with a focus on efficient deployment and vertical domain performance.
TLDR: MindGPT-4ov 提出了一个多模态大语言模型(MLLM)的后训练范式,通过在数据生成、微调和强化学习方面的创新,在多个基准测试中取得了最先进的结果,重点关注高效部署和垂直领域性能。
Read Paper (PDF)We present LumiX, a structured diffusion framework for coherent text-to-intrinsic generation. Conditioned on text prompts, LumiX jointly generates a comprehensive set of intrinsic maps (e.g., albedo, irradiance, normal, depth, and final color), providing a structured and physically consistent description of an underlying scene. This is enabled by two key contributions: 1) Query-Broadcast Attention, a mechanism that ensures structural consistency by sharing queries across all maps in each self-attention block. 2) Tensor LoRA, a tensor-based adaptation that parameter-efficiently models cross-map relations for efficient joint training. Together, these designs enable stable joint diffusion training and unified generation of multiple intrinsic properties. Experiments show that LumiX produces coherent and physically meaningful results, achieving 23% higher alignment and a better preference score (0.19 vs. -0.41) compared to the state of the art, and it can also perform image-conditioned intrinsic decomposition within the same framework.
TLDR: LumiX is a structured diffusion framework for text-to-intrinsic generation, jointly generating coherent intrinsic maps like albedo, irradiance, and depth using Query-Broadcast Attention and Tensor LoRA for structural consistency and efficient training, achieving state-of-the-art results.
TLDR: LumiX是一个用于文本到内在属性生成的结构化扩散框架,它使用查询广播注意力机制和张量LoRA联合生成连贯的内在属性图(如反照率、辐照度和深度),以实现结构一致性和高效训练,并达到最先进的结果。
Read Paper (PDF)Recent advances in video generation have enabled the synthesis of videos with strong temporal consistency and impressive visual quality, marking a crucial step toward vision foundation models. To evaluate these video generation models, existing benchmarks primarily focus on factors related to visual perception and understanding, like visual aesthetics, instruction adherence, and temporal coherence. However, the rule-based reasoning capabilities of video generation models remain largely unexplored. Although recent studies have carried out preliminary explorations into whether video models can serve as zero-shot learners, they still lack a fine-grained decomposition of reasoning capabilities and a comprehensive evaluation protocol. To address this gap, we introduce RULER-Bench, a benchmark designed to evaluate the reasoning ability of video generation models from the perspective of cognitive rules. Built upon two fundamental paradigms: text-to-video and image-to-video, RULER-Bench covers 40 representative tasks spanning six rule categories with 622 high-quality annotated instances. For the evaluation of each generated video, we construct a checklist covering four metrics and leverage GPT-o3 to assign scores to each question, achieving 85% alignment with human judgements. Extensive experiments show that the state-of-the-art model achieves only 48.87% on the rule coherence metric, highlighting significant room for improvement in the reasoning capability of next-level video models. We expect that the insight obtained from RULER-Bench will facilitate further development of reasoning-aware video generation, advancing video generation models toward vision foundation intelligence.
TLDR: The paper introduces RULER-Bench, a new benchmark to evaluate rule-based reasoning capabilities of video generation models, revealing significant shortcomings in state-of-the-art models.
TLDR: 该论文介绍了RULER-Bench,一个新的基准,用于评估视频生成模型的基于规则的推理能力,揭示了最先进模型中存在的显著不足。
Read Paper (PDF)Synthesizing synchronized and natural co-speech gesture videos remains a formidable challenge. Recent approaches have leveraged motion graphs to harness the potential of existing video data. To retrieve an appropriate trajectory from the graph, previous methods either utilize the distance between features extracted from the input audio and those associated with the motions in the graph or embed both the input audio and motion into a shared feature space. However, these techniques may not be optimal due to the many-to-many mapping nature between audio and gestures, which cannot be adequately addressed by one-to-one mapping. To alleviate this limitation, we propose a novel framework that initially employs a diffusion model to generate gesture motions. The diffusion model implicitly learns the joint distribution of audio and motion, enabling the generation of contextually appropriate gestures from input audio sequences. Furthermore, our method extracts both low-level and high-level features from the input audio to enrich the training process of the diffusion model. Subsequently, a meticulously designed motion-based retrieval algorithm is applied to identify the most suitable path within the graph by assessing both global and local similarities in motion. Given that not all nodes in the retrieved path are sequentially continuous, the final step involves seamlessly stitching together these segments to produce a coherent video output. Experimental results substantiate the efficacy of our proposed method, demonstrating a significant improvement over prior approaches in terms of synchronization accuracy and naturalness of generated gestures.
TLDR: This paper presents a novel framework for generating co-speech gesture videos using a diffusion model and a motion-based graph retrieval algorithm, addressing the limitations of previous audio-gesture mapping techniques.
TLDR: 本文提出了一种新的框架,使用扩散模型和基于运动的图检索算法来生成语音同步的手势视频,解决了以前的音频-手势映射技术的局限性。
Read Paper (PDF)Person re-identification (ReID) suffers from a lack of large-scale high-quality training data due to challenges in data privacy and annotation costs. While previous approaches have explored pedestrian generation for data augmentation, they often fail to ensure identity consistency and suffer from insufficient controllability, thereby limiting their effectiveness in dataset augmentation. To address this, We introduce OmniPerson, the first unified identity-preserving pedestrian generation pipeline for visible/infrared image/video ReID tasks. Our contributions are threefold: 1) We proposed OmniPerson, a unified generation model, offering holistic and fine-grained control over all key pedestrian attributes. Supporting RGB/IR modality image/video generation with any number of reference images, two kinds of person poses, and text. Also including RGB-to-IR transfer and image super-resolution abilities.2) We designed Multi-Refer Fuser for robust identity preservation with any number of reference images as input, making OmniPerson could distill a unified identity from a set of multi-view reference images, ensuring our generated pedestrians achieve high-fidelity pedestrian generation.3) We introduce PersonSyn, the first large-scale dataset for multi-reference, controllable pedestrian generation, and present its automated curation pipeline which transforms public, ID-only ReID benchmarks into a richly annotated resource with the dense, multi-modal supervision required for this task. Experimental results demonstrate that OmniPerson achieves SoTA in pedestrian generation, excelling in both visual fidelity and identity consistency. Furthermore, augmenting existing datasets with our generated data consistently improves the performance of ReID models. We will open-source the full codebase, pretrained model, and the PersonSyn dataset.
TLDR: The paper introduces OmniPerson, a novel unified pipeline for generating identity-preserving pedestrian images and videos across RGB/IR modalities, using multiple reference images and textual/pose control, and creates a new large-scale dataset, PersonSyn, for training and evaluation. The proposed model achieves state-of-the-art results in pedestrian generation and improves ReID model performance when used for data augmentation.
TLDR: 该论文介绍了OmniPerson,一种新颖的统一流程,用于生成身份保留的行人图像和视频,涵盖RGB/IR模态,使用多个参考图像和文本/姿势控制,并创建了一个新的大型数据集PersonSyn,用于训练和评估。该模型在行人生成方面取得了最先进的结果,并且在用于数据增强时提高了ReID模型的性能。
Read Paper (PDF)Recent progress in multimodal large language models (MLLMs) has highlighted the challenge of efficiently bridging pre-trained Vision-Language Models (VLMs) with Diffusion Models. While methods using a fixed number of learnable query tokens offer computational efficiency, they suffer from task generalization collapse, failing to adapt to new tasks that are distant from their pre-training tasks. To overcome this, we propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization, enhancing continual learning. Additionally, we introduce a VAE branch with linear projection to recover fine-grained image details. Experimental results confirm our approach mitigates generalization collapse and enables stable continual learning across diverse tasks.
TLDR: The paper proposes a method, WeMMU, using noisy query tokens and a VAE branch to improve the bridging of VLMs and Diffusion Models, addressing task generalization collapse and enabling stable continual learning in multimodal generation.
TLDR: 该论文提出了一种名为WeMMU的方法,通过使用噪声查询令牌和一个VAE分支来改进VLM和Diffusion模型的桥接,解决了任务泛化崩溃的问题,并实现了多模态生成中的稳定持续学习。
Read Paper (PDF)Autoregressive models are structurally misaligned with the inherently parallel nature of geospatial understanding, forcing a rigid sequential narrative onto scenes and fundamentally hindering the generation of structured and coherent outputs. We challenge this paradigm by reframing geospatial generation as a parallel refinement process, enabling a holistic, coarse-to-fine synthesis that resolves all semantic elements simultaneously. To operationalize this, we introduce GeoDiT, the first diffusion-based vision-language model tailored for the geospatial domain. Extensive experiments demonstrate that GeoDiT establishes a new state-of-the-art on benchmarks requiring structured, object-centric outputs. It achieves significant gains in image captioning, visual grounding, and multi-object detection, precisely the tasks where autoregressive models falter. Our work validates that aligning the generative process with the data's intrinsic structure is key to unlocking superior performance in complex geospatial analysis.
TLDR: GeoDiT, a diffusion-based vision-language model, addresses the limitations of autoregressive models in geospatial understanding by enabling parallel refinement for structured outputs, achieving state-of-the-art results in tasks like image captioning and object detection.
TLDR: GeoDiT是一个基于扩散的视觉语言模型,通过并行细化来实现对地理空间理解,克服了自回归模型在结构化输出方面的局限性,并在图像描述和目标检测等任务中取得了最先进的成果。
Read Paper (PDF)While diffusion model for audio-driven avatar video generation have achieved notable process in synthesizing long sequences with natural audio-visual synchronization and identity consistency, the generation of music-performance videos with camera motions remains largely unexplored. We present YingVideo-MV, the first cascaded framework for music-driven long-video generation. Our approach integrates audio semantic analysis, an interpretable shot planning module (MV-Director), temporal-aware diffusion Transformer architectures, and long-sequence consistency modeling to enable automatic synthesis of high-quality music performance videos from audio signals. We construct a large-scale Music-in-the-Wild Dataset by collecting web data to support the achievement of diverse, high-quality results. Observing that existing long-video generation methods lack explicit camera motion control, we introduce a camera adapter module that embeds camera poses into latent noise. To enhance continulity between clips during long-sequence inference, we further propose a time-aware dynamic window range strategy that adaptively adjust denoising ranges based on audio embedding. Comprehensive benchmark tests demonstrate that YingVideo-MV achieves outstanding performance in generating coherent and expressive music videos, and enables precise music-motion-camera synchronization. More videos are available in our project page: https://giantailab.github.io/YingVideo-MV/ .
TLDR: The paper introduces YingVideo-MV, a cascaded framework for music-driven video generation with camera motion control, using audio semantic analysis, interpretable shot planning, temporal-aware diffusion Transformers, and long-sequence consistency modeling.
TLDR: 该论文介绍了YingVideo-MV,一个用于音乐驱动的视频生成级联框架,可控制相机运动,它使用音频语义分析、可解释的镜头规划、时间感知扩散Transformer和长序列一致性建模。
Read Paper (PDF)Video world models have attracted significant attention for their ability to produce high-fidelity future visual observations conditioned on past observations and navigation actions. Temporally- and spatially-consistent, long-term world modeling has been a long-standing problem, unresolved with even recent state-of-the-art models, due to the prohibitively expensive computational costs for long-context inputs. In this paper, we propose WorldPack, a video world model with efficient compressed memory, which significantly improves spatial consistency, fidelity, and quality in long-term generation despite much shorter context length. Our compressed memory consists of trajectory packing and memory retrieval; trajectory packing realizes high context efficiency, and memory retrieval maintains the consistency in rollouts and helps long-term generations that require spatial reasoning. Our performance is evaluated with LoopNav, a benchmark on Minecraft, specialized for the evaluation of long-term consistency, and we verify that WorldPack notably outperforms strong state-of-the-art models.
TLDR: The paper introduces WorldPack, a video world model with compressed memory techniques (trajectory packing and memory retrieval) to improve spatial consistency and fidelity in long-term video generation, demonstrating superior performance on the LoopNav Minecraft benchmark.
TLDR: 该论文介绍了 WorldPack,一种具有压缩记忆技术的视频世界模型(轨迹打包和记忆检索),旨在提高长期视频生成中的空间一致性和保真度,并在 LoopNav Minecraft 基准测试中表现出卓越的性能。
Read Paper (PDF)Recent audio-video generative systems suggest that coupling modalities benefits not only audio-video synchrony but also the video modality itself. We pose a fundamental question: Does audio-video joint denoising training improve video generation, even when we only care about video quality? To study this, we introduce a parameter-efficient Audio-Video Full DiT (AVFullDiT) architecture that leverages pre-trained text-to-video (T2V) and text-to-audio (T2A) modules for joint denoising. We train (i) a T2AV model with AVFullDiT and (ii) a T2V-only counterpart under identical settings. Our results provide the first systematic evidence that audio-video joint denoising can deliver more than synchrony. We observe consistent improvements on challenging subsets featuring large and object contact motions. We hypothesize that predicting audio acts as a privileged signal, encouraging the model to internalize causal relationships between visual events and their acoustic consequences (e.g., collision $\times$ impact sound), which in turn regularizes video dynamics. Our findings suggest that cross-modal co-training is a promising approach to developing stronger, more physically grounded world models. Code and dataset will be made publicly available.
TLDR: This paper investigates whether audio-video joint denoising improves video generation, finding that it does, particularly for videos with large motion and object contact, suggesting cross-modal co-training can enhance world models.
TLDR: 本文研究了音视频联合去噪是否能改善视频生成,发现确实能,尤其是在具有大幅运动和物体接触的视频中,表明跨模态共同训练可以增强世界模型。
Read Paper (PDF)Text-to-image (T2I) models are capable of generating visually impressive images, yet they often fail to accurately capture specific attributes in user prompts, such as the correct number of objects with the specified colors. The diversity of such errors underscores the need for a hierarchical evaluation framework that can compare prompt adherence abilities of different image generation models. Simultaneously, benchmarks of vision language models (VLMs) have not kept pace with the complexity of scenes that VLMs are used to annotate. In this work, we propose a structured methodology for jointly evaluating T2I models and VLMs by testing whether VLMs can identify 27 specific failure modes in the images generated by T2I models conditioned on challenging prompts. Our second contribution is a dataset of prompts and images generated by 5 T2I models (Flux, SD3-Medium, SD3-Large, SD3.5-Medium, SD3.5-Large) and the corresponding annotations from VLMs (Molmo, InternVL3, Pixtral) annotated by an LLM (Llama3) to test whether VLMs correctly identify the failure mode in a generated image. By analyzing failure modes on a curated set of prompts, we reveal systematic errors in attribute fidelity and object representation. Our findings suggest that current metrics are insufficient to capture these nuanced errors, highlighting the importance of targeted benchmarks for advancing generative model reliability and interpretability.
TLDR: The paper introduces FineGRAIN, a methodology and dataset for evaluating text-to-image models by testing vision language model (VLM) judges on their ability to identify specific failure modes in generated images, highlighting shortcomings in current metrics and model reliability.
TLDR: 该论文介绍了FineGRAIN,一种用于评估文本到图像模型的方法和数据集。该方法通过测试视觉语言模型(VLM)判断器识别生成图像中特定失败模式的能力,揭示了现有指标和模型可靠性的不足。
Read Paper (PDF)We introduce Material Coating, a novel image editing task that simulates applying a thin material layer onto an object while preserving its underlying coarse and fine geometry. Material coating is fundamentally different from existing "material transfer" methods, which are designed to replace an object's intrinsic material, often overwriting fine details. To address this new task, we construct a large-scale synthetic dataset (110K images) of 3D objects with varied, physically-based coatings, named DataCoat110K. We then propose CoatFusion, a novel architecture that enables this task by conditioning a diffusion model on both a 2D albedo texture and granular, PBR-style parametric controls, including roughness, metalness, transmission, and a key thickness parameter. Experiments and user studies show CoatFusion produces realistic, controllable coatings and significantly outperforms existing material editing and transfer methods on this new task.
TLDR: The paper introduces a new image editing task, Material Coating, and a novel diffusion-based architecture, CoatFusion, trained on a large-scale synthetic dataset, DataCoat110K, for controllable material coating while preserving underlying geometry.
TLDR: 该论文介绍了一种新的图像编辑任务“材料涂层”,以及一种新的基于扩散的架构CoatFusion,该架构在大规模合成数据集DataCoat110K上训练,用于在保留底层几何形状的同时实现可控的材料涂层。
Read Paper (PDF)Video generators are increasingly evaluated as potential world models, which requires them to encode and understand physical laws. We investigate their representation of a fundamental law: gravity. Out-of-the-box video generators consistently generate objects falling at an effectively slower acceleration. However, these physical tests are often confounded by ambiguous metric scale. We first investigate if observed physical errors are artifacts of these ambiguities (e.g., incorrect frame rate assumptions). We find that even temporal rescaling cannot correct the high-variance gravity artifacts. To rigorously isolate the underlying physical representation from these confounds, we introduce a unit-free, two-object protocol that tests the timing ratio $t_1^2/t_2^2 = h_1/h_2$, a relationship independent of $g$, focal length, and scale. This relative test reveals violations of Galileo's equivalence principle. We then demonstrate that this physical gap can be partially mitigated with targeted specialization. A lightweight low-rank adaptor fine-tuned on only 100 single-ball clips raises $g_{\mathrm{eff}}$ from $1.81\,\mathrm{m/s^2}$ to $6.43\,\mathrm{m/s^2}$ (reaching $65\%$ of terrestrial gravity). This specialist adaptor also generalizes zero-shot to two-ball drops and inclined planes, offering initial evidence that specific physical laws can be corrected with minimal data.
TLDR: The paper investigates the physical accuracy of video generators, finding that generated objects fall slower than they should, violating the equivalence principle. They show that these violations can be partially mitigated with fine-tuning on limited data.
TLDR: 该论文研究了视频生成器的物理精确性,发现生成的物体下落速度低于应有速度,违反了等效原理。他们表明,通过对有限数据进行微调可以部分缓解这些违规行为。
Read Paper (PDF)Camera and object motions are central to a video's narrative. However, precisely editing these captured motions remains a significant challenge, especially under complex object movements. Current motion-controlled image-to-video (I2V) approaches often lack full-scene context for consistent video editing, while video-to-video (V2V) methods provide viewpoint changes or basic object translation, but offer limited control over fine-grained object motion. We present a track-conditioned V2V framework that enables joint editing of camera and object motion. We achieve this by conditioning a video generation model on a source video and paired 3D point tracks representing source and target motions. These 3D tracks establish sparse correspondences that transfer rich context from the source video to new motions while preserving spatiotemporal coherence. Crucially, compared to 2D tracks, 3D tracks provide explicit depth cues, allowing the model to resolve depth order and handle occlusions for precise motion editing. Trained in two stages on synthetic and real data, our model supports diverse motion edits, including joint camera/object manipulation, motion transfer, and non-rigid deformation, unlocking new creative potential in video editing.
TLDR: This paper introduces a track-conditioned video-to-video framework using 3D point tracks to enable precise joint editing of camera and object motion in videos, including motion transfer and non-rigid deformations.
TLDR: 本文提出了一种基于轨迹条件的视频到视频框架,该框架使用 3D 点轨迹来实现视频中相机和物体运动的精确联合编辑,包括运动传递和非刚性变形。
Read Paper (PDF)MeanFlow (MF) has recently been established as a framework for one-step generative modeling. However, its ``fastforward'' nature introduces key challenges in both the training objective and the guidance mechanism. First, the original MF's training target depends not only on the underlying ground-truth fields but also on the network itself. To address this issue, we recast the objective as a loss on the instantaneous velocity $v$, re-parameterized by a network that predicts the average velocity $u$. Our reformulation yields a more standard regression problem and improves the training stability. Second, the original MF fixes the classifier-free guidance scale during training, which sacrifices flexibility. We tackle this issue by formulating guidance as explicit conditioning variables, thereby retaining flexibility at test time. The diverse conditions are processed through in-context conditioning, which reduces model size and benefits performance. Overall, our $\textbf{improved MeanFlow}$ ($\textbf{iMF}$) method, trained entirely from scratch, achieves $\textbf{1.72}$ FID with a single function evaluation (1-NFE) on ImageNet 256$\times$256. iMF substantially outperforms prior methods of this kind and closes the gap with multi-step methods while using no distillation. We hope our work will further advance fastforward generative modeling as a stand-alone paradigm.
TLDR: The paper introduces an improved MeanFlow (iMF) method for one-step image generation that addresses training instability and guidance inflexibility, achieving state-of-the-art FID on ImageNet 256x256 without distillation.
TLDR: 该论文介绍了一种改进的 MeanFlow (iMF) 方法,用于单步图像生成,解决了训练不稳定性和引导不灵活的问题,在 ImageNet 256x256 上实现了最先进的 FID,且无需蒸馏。
Read Paper (PDF)Physical AI aims to develop models that can perceive and predict real-world dynamics; yet, the extent to which current multi-modal large language models and video generative models support these abilities is insufficiently understood. We introduce Physical AI Bench (PAI-Bench), a unified and comprehensive benchmark that evaluates perception and prediction capabilities across video generation, conditional video generation, and video understanding, comprising 2,808 real-world cases with task-aligned metrics designed to capture physical plausibility and domain-specific reasoning. Our study provides a systematic assessment of recent models and shows that video generative models, despite strong visual fidelity, often struggle to maintain physically coherent dynamics, while multi-modal large language models exhibit limited performance in forecasting and causal interpretation. These observations suggest that current systems are still at an early stage in handling the perceptual and predictive demands of Physical AI. In summary, PAI-Bench establishes a realistic foundation for evaluating Physical AI and highlights key gaps that future systems must address.
TLDR: The paper introduces PAI-Bench, a new benchmark for evaluating Physical AI capabilities in video generation, conditional video generation, and video understanding, revealing limitations in current models' physical coherence and reasoning abilities.
TLDR: 该论文介绍了一个新的基准测试 PAI-Bench,用于评估物理人工智能在视频生成、条件视频生成和视频理解方面的能力,揭示了当前模型在物理连贯性和推理能力方面的局限性。
Read Paper (PDF)Modeling and synthesizing complex hand-object interactions remains a significant challenge, even for state-of-the-art physics engines. Conventional simulation-based approaches rely on explicitly defined rigid object models and pre-scripted hand gestures, making them inadequate for capturing dynamic interactions with non-rigid or articulated entities such as deformable fabrics, elastic materials, hinge-based structures, furry surfaces, or even living creatures. In this paper, we present SpriteHand, an autoregressive video generation framework for real-time synthesis of versatile hand-object interaction videos across a wide range of object types and motion patterns. SpriteHand takes as input a static object image and a video stream in which the hands are imagined to interact with the virtual object embedded in a real-world scene, and generates corresponding hand-object interaction effects in real time. Our model employs a causal inference architecture for autoregressive generation and leverages a hybrid post-training approach to enhance visual realism and temporal coherence. Our 1.3B model supports real-time streaming generation at around 18 FPS and 640x368 resolution, with an approximate 150 ms latency on a single NVIDIA RTX 5090 GPU, and more than a minute of continuous output. Experiments demonstrate superior visual quality, physical plausibility, and interaction fidelity compared to both generative and engine-based baselines.
TLDR: SpriteHand is an autoregressive video generation framework for real-time synthesis of hand-object interaction videos, even with complex and non-rigid objects, achieving real-time performance and superior visual quality compared to baselines.
TLDR: SpriteHand是一个自回归视频生成框架,用于实时合成手部与物体的交互视频,即使对于复杂的非刚性物体也能达到实时性能,并且相比基线方法具有更优异的视觉质量。
Read Paper (PDF)