Daily papers related to Image/Video/Multimodal Generation from cs.CV
January 15, 2026
We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.
TLDR: STEP3-VL-10B is a new, efficient 10B parameter open-source vision-language model that achieves state-of-the-art performance on several multimodal benchmarks, rivaling models 10-20x larger.
TLDR: STEP3-VL-10B是一个新的、高效的10B参数开源视觉语言模型,在多个多模态基准测试中实现了最先进的性能,可以与大10-20倍的模型相媲美。
Read Paper (PDF)Recent diffusion-based video generation models can synthesize visually plausible videos, yet they often struggle to satisfy physical constraints. A key reason is that most existing approaches remain single-stage: they entangle high-level physical understanding with low-level visual synthesis, making it hard to generate content that require explicit physical reasoning. To address this limitation, we propose a training-free three-stage pipeline,\textit{PhyRPR}:\textit{Phy\uline{R}eason}--\textit{Phy\uline{P}lan}--\textit{Phy\uline{R}efine}, which decouples physical understanding from visual synthesis. Specifically, \textit{PhyReason} uses a large multimodal model for physical state reasoning and an image generator for keyframe synthesis; \textit{PhyPlan} deterministically synthesizes a controllable coarse motion scaffold; and \textit{PhyRefine} injects this scaffold into diffusion sampling via a latent fusion strategy to refine appearance while preserving the planned dynamics. This staged design enables explicit physical control during generation. Extensive experiments under physics constraints show that our method consistently improves physical plausibility and motion controllability.
TLDR: The paper introduces PhyRPR, a training-free, three-stage pipeline for physics-constrained video generation that decouples physical understanding from visual synthesis, leading to improved physical plausibility and motion control.
TLDR: 该论文介绍了一个名为 PhyRPR 的免训练三阶段流程,用于物理约束的视频生成,它将物理理解与视觉合成分离,从而提高了物理合理性和运动控制。
Read Paper (PDF)Substation meters play a critical role in monitoring and ensuring the stable operation of power grids, yet their detection of cracks and other physical defects is often hampered by a severe scarcity of annotated samples. To address this few-shot generation challenge, we propose a novel framework that integrates Knowledge Embedding and Hypernetwork-Guided Conditional Control into a Stable Diffusion pipeline, enabling realistic and controllable synthesis of defect images from limited data. First, we bridge the substantial domain gap between natural-image pre-trained models and industrial equipment by fine-tuning a Stable Diffusion backbone using DreamBooth-style knowledge embedding. This process encodes the unique structural and textural priors of substation meters, ensuring generated images retain authentic meter characteristics. Second, we introduce a geometric crack modeling module that parameterizes defect attributes--such as location, length, curvature, and branching pattern--to produce spatially constrained control maps. These maps provide precise, pixel-level guidance during generation. Third, we design a lightweight hypernetwork that dynamically modulates the denoising process of the diffusion model in response to the control maps and high-level defect descriptors, achieving a flexible balance between generation fidelity and controllability. Extensive experiments on a real-world substation meter dataset demonstrate that our method substantially outperforms existing augmentation and generation baselines. It reduces Frechet Inception Distance (FID) by 32.7%, increases diversity metrics, and--most importantly--boosts the mAP of a downstream defect detector by 15.3% when trained on augmented data. The framework offers a practical, high-quality data synthesis solution for industrial inspection systems where defect samples are rare.
TLDR: The paper introduces a novel Stable Diffusion-based framework using knowledge embedding and hypernetworks for few-shot generation of realistic and controllable substation meter defect images, significantly improving downstream defect detection performance.
TLDR: 本文提出了一种基于稳定扩散的新颖框架,该框架利用知识嵌入和超网络进行小样本学习,以生成逼真且可控的变电站仪表缺陷图像,从而显著提高了下游缺陷检测性能。
Read Paper (PDF)Despite significant progress in autoregressive image generation, inference remains slow due to the sequential nature of AR models and the ambiguity of image tokens, even when using speculative decoding. Recent works attempt to address this with relaxed speculative decoding but lack theoretical grounding. In this paper, we establish the theoretical basis of relaxed SD and propose COOL-SD, an annealed relaxation of speculative decoding built on two key insights. The first analyzes the total variation (TV) distance between the target model and relaxed speculative decoding and yields an optimal resampling distribution that minimizes an upper bound of the distance. The second uses perturbation analysis to reveal an annealing behaviour in relaxed speculative decoding, motivating our annealed design. Together, these insights enable COOL-SD to generate images faster with comparable quality, or achieve better quality at similar latency. Experiments validate the effectiveness of COOL-SD, showing consistent improvements over prior methods in speed-quality trade-offs.
TLDR: The paper introduces COOL-SD, an annealed relaxation of speculative decoding for faster autoregressive image generation, with theoretical grounding and demonstrated improvements in speed-quality trade-offs compared to prior methods.
TLDR: 该论文介绍了一种名为COOL-SD的退火松弛推测解码方法,用于加速自回归图像生成。该方法具有理论基础,并展示了与先前方法相比,在速度-质量权衡方面的改进。
Read Paper (PDF)Despite the rapid progress of video generation models, the role of data in influencing motion is poorly understood. We present Motive (MOTIon attribution for Video gEneration), a motion-centric, gradient-based data attribution framework that scales to modern, large, high-quality video datasets and models. We use this to study which fine-tuning clips improve or degrade temporal dynamics. Motive isolates temporal dynamics from static appearance via motion-weighted loss masks, yielding efficient and scalable motion-specific influence computation. On text-to-video models, Motive identifies clips that strongly affect motion and guides data curation that improves temporal consistency and physical plausibility. With Motive-selected high-influence data, our method improves both motion smoothness and dynamic degree on VBench, achieving a 74.1% human preference win rate compared with the pretrained base model. To our knowledge, this is the first framework to attribute motion rather than visual appearance in video generative models and to use it to curate fine-tuning data.
TLDR: The paper introduces Motive, a framework for attributing the influence of video data on the motion generated by video generation models. By identifying high-influence clips, Motive enables data curation that improves temporal consistency and physical plausibility in generated videos.
TLDR: 该论文介绍了一种名为Motive的框架,用于评估视频数据对视频生成模型所产生运动的影响。通过识别高影响力片段,Motive能够实现数据管理,从而提高生成视频中的时间一致性和物理合理性。
Read Paper (PDF)