Daily papers related to Image/Video/Multimodal Generation from cs.CV
April 10, 2026
Recent advances in generative video modeling, driven by large-scale datasets and powerful architectures, have yielded remarkable visual realism. However, emerging evidence suggests that simply scaling data and model size does not endow these systems with an understanding of the underlying physical laws that govern real-world dynamics. Existing approaches often fail to capture or enforce such physical consistency, resulting in unrealistic motion and dynamics. In his work, we investigate whether integrating the inference of latent physical properties directly into the video generation process can equip models with the ability to produce physically plausible videos. To this end, we propose Phantom, a Physics-Infused Video Generation model that jointly models the visual content and latent physical dynamics. Conditioned on observed video frames and inferred physical states, Phantom jointly predicts latent physical dynamics and generates future video frames. Phantom leverages a physics-aware video representation that serves as an abstract yet informaive embedding of the underlying physics, facilitating the joint prediction of physical dynamics alongside video content without requiring an explicit specification of a complex set of physical dynamics and properties. By integrating the inference of physical-aware video representation directly into the video generation process, Phantom produces video sequences that are both visually realistic and physically consistent. Quantitative and qualitative results on both standard video generation and physics-aware benchmarks demonstrate that Phantom not only outperforms existing methods in terms of adherence to physical dynamics but also delivers competitive perceptual fidelity.
TLDR: The paper introduces Phantom, a physics-infused video generation model that jointly models visual content and latent physical dynamics to generate physically plausible and visually realistic videos, outperforming existing methods on physics-aware benchmarks.
TLDR: 该论文介绍了 Phantom,一个物理信息驱动的视频生成模型,它联合建模视觉内容和潜在物理动态,以生成物理上合理且视觉上逼真的视频,并在物理感知基准测试中优于现有方法。
Read Paper (PDF)Unified multimodal models integrating visual understanding and generation face a fundamental challenge: visual generation incurs substantially higher computational costs than understanding, particularly for video. This imbalance motivates us to invert the conventional paradigm: rather than extending understanding-centric MLLMs to support generation, we propose Uni-ViGU, a framework that unifies video generation and understanding by extending a video generator as the foundation. We introduce a unified flow method that performs continuous flow matching for video and discrete flow matching for text within a single process, enabling coherent multimodal generation. We further propose a modality-driven MoE-based framework that augments Transformer blocks with lightweight layers for text generation while preserving generative priors. To repurpose generation knowledge for understanding, we design a bidirectional training mechanism with two stages: Knowledge Recall reconstructs input prompts to leverage learned text-video correspondences, while Capability Refinement fine-tunes on detailed captions to establish discriminative shared representations. Experiments demonstrate that Uni-ViGU achieves competitive performance on both video generation and understanding, validating generation-centric architectures as a scalable path toward unified multimodal intelligence. Project Page and Code: https://fr0zencrane.github.io/uni-vigu-page/.
TLDR: The paper introduces Uni-ViGU, a framework unifying video generation and understanding by extending a video generator and using flow matching and a MoE-based Transformer augmentation. It achieves competitive performance on both generation and understanding tasks.
TLDR: 该论文介绍了 Uni-ViGU,一个通过扩展视频生成器并通过流匹配和基于 MoE 的 Transformer 增强统一视频生成和理解的框架。它在生成和理解任务上都取得了有竞争力的表现。
Read Paper (PDF)Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA , a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration. On the introduced CountBench, NUMINA improves counting accuracy by up to 7.4% on Wan2.1-1.3B, and by 4.9% and 5.5% on 5B and 14B models, respectively. Furthermore, CLIP alignment is improved while maintaining temporal consistency. These results demonstrate that structural guidance complements seed search and prompt enhancement, offering a practical path toward count-accurate text-to-video diffusion. The code is available at https://github.com/H-EmbodVis/NUMINA.
TLDR: The paper introduces NUMINA, a training-free framework for improving numerical accuracy in text-to-video diffusion models by identifying and refining latent layouts. It demonstrates improved counting accuracy and CLIP alignment while maintaining temporal consistency.
TLDR: 该论文介绍了NUMINA,一个无需训练的框架,通过识别和优化潜在布局来提高文本到视频扩散模型中数字的准确性。该方法在保持时间一致性的同时,提高了计数准确性和CLIP对齐。
Read Paper (PDF)Text-to-Audio-Video (T2AV) generation is rapidly becoming a core interface for media creation, yet its evaluation remains fragmented. Existing benchmarks largely assess audio and video in isolation or rely on coarse embedding similarity, failing to capture the fine-grained joint correctness required by realistic prompts. We introduce AVGen-Bench, a task-driven benchmark for T2AV generation featuring high-quality prompts across 11 real-world categories. To support comprehensive assessment, we propose a multi-granular evaluation framework that combines lightweight specialist models with Multimodal Large Language Models (MLLMs), enabling evaluation from perceptual quality to fine-grained semantic controllability. Our evaluation reveals a pronounced gap between strong audio-visual aesthetics and weak semantic reliability, including persistent failures in text rendering, speech coherence, physical reasoning, and a universal breakdown in musical pitch control. Code and benchmark resources are available at http://aka.ms/avgenbench.
TLDR: The paper introduces AVGen-Bench, a new benchmark for Text-to-Audio-Video generation, highlighting weaknesses in current models' semantic reliability despite strong aesthetics and offering a multi-granular evaluation framework.
TLDR: 该论文介绍了AVGen-Bench,一个新的文本到音视频生成基准,强调了当前模型在语义可靠性方面的弱点(即使其在美学表现良好),并提供了一个多粒度评估框架。
Read Paper (PDF)We introduce RewardFlow, an inversion-free framework that steers pretrained diffusion and flow-matching models at inference time through multi-reward Langevin dynamics. RewardFlow unifies complementary differentiable rewards for semantic alignment, perceptual fidelity, localized grounding, object consistency, and human preference, and further introduces a differentiable VQA-based reward that provides fine-grained semantic supervision through language-vision reasoning. To coordinate these heterogeneous objectives, we design a prompt-aware adaptive policy that extracts semantic primitives from the instruction, infers edit intent, and dynamically modulates reward weights and step sizes throughout sampling. Across several image editing and compositional generation benchmarks, RewardFlow delivers state-of-the-art edit fidelity and compositional alignment.
TLDR: RewardFlow guides pretrained diffusion and flow-matching models using multi-reward Langevin dynamics, unifying various differentiable rewards with a prompt-aware adaptive policy for improved image editing and compositional generation.
TLDR: RewardFlow 框架通过多奖励 Langevin 动力学引导预训练的扩散模型和流动匹配模型,结合提示感知的自适应策略统一了各种可微奖励,从而改进图像编辑和组合生成。
Read Paper (PDF)High-quality training triplets (source-target image pairs with precise editing instructions) are a critical bottleneck for scaling instruction-guided image editing models. Vision-language models (VLMs) are widely used for automated instruction synthesis, but we identify three systematic failure modes in image-pair settings: orientation inconsistency (e.g., left/right confusion), viewpoint ambiguity, and insufficient fine-grained attribute description. Human evaluation shows that over 47% of instructions from strong baseline VLMs contain critical errors unusable for downstream training. We propose EditCaption, a scalable two-stage post-training pipeline for VLM-based instruction synthesis. Stage 1 builds a 100K supervised fine-tuning (SFT) dataset by combining GLM automatic annotation, EditScore-based filtering, and human refinement for spatial, directional, and attribute-level accuracy. Stage 2 collects 10K human preference pairs targeting the three failure modes and applies direct preference optimization (DPO) for alignment beyond SFT alone. On Eval-400, ByteMorph-Bench, and HQ-Edit, fine-tuned Qwen3-VL models outperform open-source baselines; the 235B model reaches 4.712 on Eval-400 (vs. Gemini-3-Pro 4.706, GPT-4.1 4.220, Kimi-K2.5 4.111) and 4.588 on ByteMorph-Bench (vs. Gemini-3-Pro 4.522, GPT-4.1 3.412). Human evaluation shows critical errors falling from 47.75% to 23% and correctness rising from 41.75% to 66%. The work offers a practical path to scalable, human-aligned instruction synthesis for image editing data.
TLDR: The paper introduces EditCaption, a two-stage pipeline using supervised fine-tuning and direct preference optimization to improve the quality of VLM-generated image editing instructions by addressing common errors like orientation inconsistency and insufficient attribute description.
TLDR: 本文介绍了EditCaption,一个两阶段流程,通过监督微调和直接偏好优化,来提高VLM生成的图像编辑指令的质量,解决了诸如方向不一致和属性描述不足等常见错误。
Read Paper (PDF)Classifier-Free Guidance (CFG) is a widely used inference-time technique to boost the image quality of diffusion models. Yet, its reliance on text conditions prevents its use in unconditional generation. We propose a simple method to enable CFG-like guidance for both conditional and unconditional generation. The key idea is to generate a perturbed prediction via simple token swap operations, and use the direction between it and the clean prediction to steer sampling towards higher-fidelity distributions. In practice, we swap pairs of most semantically dissimilar token latents in either spatial or channel dimensions. Unlike existing methods that apply perturbation in a global or less constrained manner, our approach selectively exchanges and recomposes token latents, allowing finer control over perturbation and its influence on generated samples. Experiments on MS-COCO 2014, MS-COCO 2017, and ImageNet datasets demonstrate that the proposed Self-Swap Guidance (SSG), when applied to popular diffusion models, outperforms previous condition-free methods in image fidelity and prompt alignment under different set-ups. Its fine-grained perturbation granularity also improves robustness, reducing side-effects across a wider range of perturbation strengths. Overall, SSG extends CFG to a broader scope of applications including both conditional and unconditional generation, and can be readily inserted into any diffusion model as a plug-in to gain immediate improvements.
TLDR: This paper introduces Self-Swap Guidance (SSG), a novel method for guiding diffusion models (both conditional and unconditional) by swapping semantically dissimilar token latents, enhancing image fidelity and prompt alignment, and improving robustness compared to existing condition-free methods.
TLDR: 本文介绍了一种名为 Self-Swap Guidance (SSG) 的新方法,通过交换语义上不同的 token latent 来引导扩散模型(包括有条件和无条件模型),从而提高图像保真度和 prompt 对齐,并比现有的无条件方法提高鲁棒性。
Read Paper (PDF)Talking-head generation has advanced rapidly with diffusion-based generative models, but training usually depends on centralized face-video and speech datasets, raising major privacy concerns. The problem is more acute for personalized talking-head generation, where identity-specific data are highly sensitive and often cannot be pooled across users or devices. PrivFedTalk is presented as a privacy-aware federated framework for personalized talking-head generation that combines conditional latent diffusion with parameter-efficient identity adaptation. A shared diffusion backbone is trained across clients, while each client learns lightweight LoRA identity adapters from local private audio-visual data, avoiding raw data sharing and reducing communication cost. To address heterogeneous client distributions, Identity-Stable Federated Aggregation (ISFA) weights client updates using privacy-safe scalar reliability signals computed from on-device identity consistency and temporal stability estimates. Temporal-Denoising Consistency (TDC) regularization is introduced to reduce inter-frame drift, flicker, and identity drift during federated denoising. To limit update-side privacy risk, secure aggregation and client-level differential privacy are applied to adapter updates. The implementation supports both low-memory GPU execution and multi-GPU client-parallel training on heterogeneous shared hardware. Comparative experiments on the present setup across multiple training and aggregation conditions with PrivFedTalk, FedAvg, and FedProx show stable federated optimization and successful end-to-end training and evaluation under constrained resources. The results support the feasibility of privacy-aware personalized talking-head training in federated environments, while suggesting that stronger component-wise, privacy-utility, and qualitative claims need further standardized evaluation.
TLDR: PrivFedTalk introduces a privacy-aware federated learning framework for personalized talking-head generation using diffusion models, identity-stable adapters, and secure aggregation techniques. It addresses privacy concerns in training talking-head models on decentralized data.
TLDR: PrivFedTalk 提出了一个隐私保护的联邦学习框架,用于个性化的说话人头部生成。该框架使用扩散模型、身份稳定的适配器和安全聚合技术,旨在解决在分散数据上训练说话人头部模型时的隐私问题。
Read Paper (PDF)Diffusion models have achieved remarkable progress in video generation, but their controllability remains a major limitation. Key scene factors such as layout, lighting, and camera trajectory are often entangled or only weakly modeled, restricting their applicability in domains like filmmaking and virtual production where explicit scene control is essential. We present LiVER, a diffusion-based framework for scene-controllable video generation. To achieve this, we introduce a novel framework that conditions video synthesis on explicit 3D scene properties, supported by a new large-scale dataset with dense annotations of object layout, lighting, and camera parameters. Our method disentangles these properties by rendering control signals from a unified 3D representation. We propose a lightweight conditioning module and a progressive training strategy to integrate these signals into a foundational video diffusion model, ensuring stable convergence and high fidelity. Our framework enables a wide range of applications, including image-to-video and video-to-video synthesis where the underlying 3D scene is fully editable. To further enhance usability, we develop a scene agent that automatically translates high-level user instructions into the required 3D control signals. Experiments show that LiVER achieves state-of-the-art photorealism and temporal consistency while enabling precise, disentangled control over scene factors, setting a new standard for controllable video generation.
TLDR: The paper introduces LiVER, a diffusion-based framework for controllable video generation by disentangling and controlling 3D scene properties like layout, lighting, and camera parameters using a renderer-based agent and a novel large-scale dataset.
TLDR: 该论文介绍了一种名为LiVER的基于扩散模型的可控视频生成框架,通过使用基于渲染器的代理和一个新的大型数据集来解耦和控制3D场景属性,如布局、光照和相机参数。
Read Paper (PDF)Model editing aims to update knowledge to add new concepts and change relevant information without retraining. Lifelong editing is a challenging task, prone to disrupting previously learned concepts, especially for Vision Language Models (VLMs), because sequential edits can lead to degraded reasoning and cross modal misalignment. Existing VLM knowledge editing methods based on gated adapters, activation edits, and parameter merging techniques address catastrophic forgetting seen in full fine tuning; however, they still operate in the shared representation space of the VLM, where concepts are entangled, so edits interfere with other non relevant concepts. We hypothesize that this instability persists because current methods algorithmically control edits via optimization rather than structurally separating knowledge. We introduce Dynamic Subspace Concept Alignment (DSCA) which by design mitigates this limitation by decomposing the representation space into a set of orthogonal semantic subspaces and proposing edits only in those transformed spaces. These subspaces are obtained through incremental clustering and PCA on joint vision language representations. This process structurally isolates concepts, enabling precise, non interfering edits by turning isolation from a soft training objective into an architectural property. The surgical edits are guided by a multi term loss function for maintaining task fidelity, edit locality, and cross modal alignment. With the base model frozen, our method achieves 98 percent single edit success, remains over 95 percent after 1000 sequential edits, lowers hallucination by 3 to 5 percent, and achieves the best backward transfer (BWT) scores on continual instruction tuning benchmarks. Extensive experiments demonstrate DSCA state of the art stability and knowledge retention capability in continual lifelong editing across various datasets and benchmarks.
TLDR: The paper introduces DSCA, a novel method for lifelong VLM editing that decomposes the representation space into orthogonal semantic subspaces, enabling precise and non-interfering knowledge updates, achieving state-of-the-art stability and knowledge retention.
TLDR: 该论文介绍了DSCA,一种用于终身VLM编辑的新方法,它将表示空间分解为正交语义子空间,从而实现精确且非干扰的知识更新,实现了最先进的稳定性和知识保留。
Read Paper (PDF)Current video editing models often rely on expensive paired video data, which limits their practical scalability. In essence, most video editing tasks can be formulated as a decoupled spatiotemporal process, where the temporal dynamics of the pretrained model are preserved while spatial content is selectively and precisely modified. Based on this insight, we propose ImVideoEdit, an efficient framework that learns video editing capabilities entirely from image pairs. By freezing the pre-trained 3D attention modules and treating images as single-frame videos, we decouple the 2D spatial learning process to help preserve the original temporal dynamics. The core of our approach is a Predict-Update Spatial Difference Attention module that progressively extracts and injects spatial differences. Rather than relying on rigid external masks, we incorporate a Text-Guided Dynamic Semantic Gating mechanism for adaptive and implicit text-driven modifications. Despite training on only 13K image pairs for 5 epochs with exceptionally low computational overhead, ImVideoEdit achieves editing fidelity and temporal consistency comparable to larger models trained on extensive video datasets.
TLDR: ImVideoEdit proposes a framework for video editing that learns from image pairs by freezing pre-trained 3D attention and using a spatial difference attention module for text-guided modifications, achieving comparable results to video-trained models with significantly less data and computation.
TLDR: ImVideoEdit提出了一个视频编辑框架,通过冻结预训练的3D注意力并使用空间差异注意力模块进行文本引导的修改,从而从图像对中学习。该方法在数据和计算量显著减少的情况下,实现了与视频训练模型相当的结果。
Read Paper (PDF)High-fidelity generative models are increasingly needed in privacy-sensitive scenarios, where access to data is severely restricted due to regulatory and copyright constraints. This scarcity hampers model development--ironically, in settings where generative models are most needed to compensate for the lack of data. This creates a self-reinforcing challenge: limited data leads to poor generative models, which in turn fail to mitigate data scarcity. To break this cycle, we propose a reinforcement-guided synthetic data generation framework that adapts general-domain generative priors to privacy-sensitive identity recognition tasks. We first perform a cold-start adaptation to align a pretrained generator with the target domain, establishing semantic relevance and initial fidelity. Building on this foundation, we introduce a multi-objective reward that jointly optimizes semantic consistency, coverage diversity, and expression richness, guiding the generator to produce both realistic and task-effective samples. During downstream training, a dynamic sample selection mechanism further prioritizes high-utility synthetic samples, enabling adaptive data scaling and improved domain alignment. Extensive experiments on benchmark datasets demonstrate that our framework significantly improves both generation fidelity and classification accuracy, while also exhibiting strong generalization to novel categories in small-data regimes.
TLDR: The paper proposes a reinforcement-guided synthetic data generation framework to address data scarcity in privacy-sensitive identity recognition tasks, improving generation fidelity and classification accuracy.
TLDR: 该论文提出了一种强化学习引导的合成数据生成框架,旨在解决隐私敏感身份识别任务中的数据稀缺问题,从而提高生成保真度和分类准确率。
Read Paper (PDF)Empowering Large Multimodal Models (LMMs) with image generation often leads to catastrophic forgetting in understanding tasks due to severe gradient conflicts. While existing paradigms like Mixture-of-Transformers (MoT) mitigate this conflict through structural isolation, they fundamentally sever cross-modal synergy and suffer from capacity fragmentation. In this work, we present Symbiotic-MoE, a unified pre-training framework that resolves task interference within a native multimodal Mixture-of-Experts (MoE) Transformers architecture with zero-parameter overhead. We first identify that standard MoE tuning leads to routing collapse, where generative gradients dominate expert utilization. To address this, we introduce Modality-Aware Expert Disentanglement, which partitions experts into task-specific groups while utilizing shared experts as a multimodal semantic bridge. Crucially, this design allows shared experts to absorb fine-grained visual semantics from generative tasks to enrich textual representations. To optimize this, we propose a Progressive Training Strategy featuring differential learning rates and early-stage gradient shielding. This mechanism not only shields pre-trained knowledge from early volatility but eventually transforms generative signals into constructive feedback for understanding. Extensive experiments demonstrate that Symbiotic-MoE achieves rapid generative convergence while unlocking cross-modal synergy, boosting inherent understanding with remarkable gains on MMLU and OCRBench.
TLDR: The paper introduces Symbiotic-MoE, a multimodal Mixture-of-Experts transformer architecture that mitigates catastrophic forgetting in LMMs during joint generation and understanding tasks by modality-aware expert disentanglement and a progressive training strategy, resulting in improved performance on downstream tasks.
TLDR: 该论文介绍了Symbiotic-MoE,一种多模态混合专家Transformer架构,通过模态感知专家解耦和渐进式训练策略,缓解了大型多模态模型在联合生成和理解任务中的灾难性遗忘问题,从而提高了下游任务的性能。
Read Paper (PDF)The performance of visual anomaly inspection in industrial quality control is often constrained by the scarcity of real anomalous samples. Consequently, anomaly synthesis techniques have been developed to enlarge training sets and enhance downstream inspection. However, existing methods either suffer from poor integration caused by inpainting or fail to provide accurate masks. To address these limitations, we propose GroundingAnomaly, a novel few-shot anomaly image generation framework. Our framework introduces a Spatial Conditioning Module that leverages per-pixel semantic maps to enable precise spatial control over the synthesized anomalies. Furthermore, a Gated Self-Attention Module is designed to inject conditioning tokens into a frozen U-Net via gated attention layers. This carefully preserves pretrained priors while ensuring stable few-shot adaptation. Extensive evaluations on the MVTec AD and VisA datasets demonstrate that GroundingAnomaly generates high-quality anomalies and achieves state-of-the-art performance across multiple downstream tasks, including anomaly detection, segmentation, and instance-level detection.
TLDR: The paper introduces GroundingAnomaly, a few-shot anomaly image generation framework that uses spatial conditioning and gated self-attention to improve anomaly synthesis for visual inspection in industrial quality control, achieving state-of-the-art performance on downstream tasks.
TLDR: 该论文介绍了 GroundingAnomaly,一个少样本异常图像生成框架。该框架利用空间条件和门控自注意力机制来改进异常合成,从而用于工业质量控制中的视觉检测,并在下游任务中实现了最先进的性能。
Read Paper (PDF)