Daily papers related to Image/Video/Multimodal Generation from cs.CV
January 07, 2026
While Unified Multimodal Models (UMMs) have achieved remarkable success in cross-modal comprehension, a significant gap persists in their ability to leverage such internal knowledge for high-quality generation. We formalize this discrepancy as Conduction Aphasia, a phenomenon where models accurately interpret multimodal inputs but struggle to translate that understanding into faithful and controllable synthesis. To address this, we propose UniCorn, a simple yet elegant self-improvement framework that eliminates the need for external data or teacher supervision. By partitioning a single UMM into three collaborative roles: Proposer, Solver, and Judge, UniCorn generates high-quality interactions via self-play and employs cognitive pattern reconstruction to distill latent understanding into explicit generative signals. To validate the restoration of multimodal coherence, we introduce UniCycle, a cycle-consistency benchmark based on a Text to Image to Text reconstruction loop. Extensive experiments demonstrate that UniCorn achieves comprehensive and substantial improvements over the base model across six general image generation benchmarks. Notably, it achieves SOTA performance on TIIF(73.8), DPG(86.8), CompBench(88.5), and UniCycle while further delivering substantial gains of +5.0 on WISE and +6.5 on OneIG. These results highlight that our method significantly enhances T2I generation while maintaining robust comprehension, demonstrating the scalability of fully self-supervised refinement for unified multimodal intelligence.
TLDR: The paper introduces UniCorn, a self-improvement framework for Unified Multimodal Models (UMMs) that uses self-play and cognitive pattern reconstruction to improve text-to-image generation without external data, achieving SOTA results on several benchmarks.
TLDR: 该论文介绍了UniCorn,一个用于统一多模态模型(UMMs)的自我改进框架,通过自博弈和认知模式重构来提升文本到图像的生成能力,无需外部数据,并在多个基准测试中实现了SOTA结果。
Read Paper (PDF)Despite impressive progress in high-fidelity image synthesis, generative models still struggle with logic-intensive instruction following, exposing a persistent reasoning--execution gap. Meanwhile, closed-source systems (e.g., Nano Banana) have demonstrated strong reasoning-driven image generation, highlighting a substantial gap to current open-source models. We argue that closing this gap requires not merely better visual generators, but executable reasoning: decomposing high-level intents into grounded, verifiable plans that directly steer the generative process. To this end, we propose Unified Thinker, a task-agnostic reasoning architecture for general image generation, designed as a unified planning core that can plug into diverse generators and workflows. Unified Thinker decouples a dedicated Thinker from the image Generator, enabling modular upgrades of reasoning without retraining the entire generative model. We further introduce a two-stage training paradigm: we first build a structured planning interface for the Thinker, then apply reinforcement learning to ground its policy in pixel-level feedback, encouraging plans that optimize visual correctness over textual plausibility. Extensive experiments on text-to-image generation and image editing show that Unified Thinker substantially improves image reasoning and generation quality.
TLDR: The paper introduces Unified Thinker, a modular reasoning architecture for image generation that decouples the reasoning process from the generator, enabling improved logic and instruction following via a two-stage training approach involving structured planning and reinforcement learning.
TLDR: 该论文介绍了一种用于图像生成的模块化推理架构 Unified Thinker,它将推理过程与生成器分离,通过包括结构化规划和强化学习的两阶段训练方法,从而改进了逻辑和指令遵循。
Read Paper (PDF)We present VINO, a unified visual generator that performs image and video generation and editing within a single framework. Instead of relying on task-specific models or independent modules for each modality, VINO uses a shared diffusion backbone that conditions on text, images and videos, enabling a broad range of visual creation and editing tasks under one model. Specifically, VINO couples a vision-language model (VLM) with a Multimodal Diffusion Transformer (MMDiT), where multimodal inputs are encoded as interleaved conditioning tokens, and then used to guide the diffusion process. This design supports multi-reference grounding, long-form instruction following, and coherent identity preservation across static and dynamic content, while avoiding modality-specific architectural components. To train such a unified system, we introduce a multi-stage training pipeline that progressively expands a video generation base model into a unified, multi-task generator capable of both image and video input and output. Across diverse generation and editing benchmarks, VINO demonstrates strong visual quality, faithful instruction following, improved reference and attribute preservation, and more controllable multi-identity edits. Our results highlight a practical path toward scalable unified visual generation, and the promise of interleaved, in-context computation as a foundation for general-purpose visual creation.
TLDR: VINO is a unified visual generator for image and video generation/editing using a shared diffusion backbone conditioned on text, images, and videos. It utilizes a VLM and a Multimodal Diffusion Transformer (MMDiT) trained with a multi-stage pipeline, demonstrating strong performance across diverse benchmarks.
TLDR: VINO是一个统一的视觉生成器,使用共享的扩散骨干网络,以文本、图像和视频为条件,用于图像和视频的生成和编辑。 它利用VLM和多模态扩散Transformer (MMDiT),通过多阶段训练流程进行训练,并在各种基准测试中表现出强大的性能。
Read Paper (PDF)Text-to-Image editing using diffusion models faces challenges in balancing content preservation with edit application and handling real-image editing. To address these, we propose LAMS-Edit, leveraging intermediate states from the inversion process--an essential step in real-image editing--during edited image generation. Specifically, latent representations and attention maps from both processes are combined at each step using weighted interpolation, controlled by a scheduler. This technique, Latent and Attention Mixing with Schedulers (LAMS), integrates with Prompt-to-Prompt (P2P) to form LAMS-Edit--an extensible framework that supports precise editing with region masks and enables style transfer via LoRA. Extensive experiments demonstrate that LAMS-Edit effectively balances content preservation and edit application.
TLDR: LAMS-Edit improves content preservation in diffusion-based image editing by mixing latent representations and attention maps during the image generation process, offering better control and style transfer capabilities.
TLDR: LAMS-Edit通过在图像生成过程中混合潜在表示和注意力图谱,改进了基于扩散的图像编辑中的内容保留,提供了更好的控制和风格迁移功能。
Read Paper (PDF)Training a unified model integrating video-to-audio (V2A), text-to-audio (T2A), and joint video-text-to-audio (VT2A) generation offers significant application flexibility, yet faces two unexplored foundational challenges: (1) the scarcity of high-quality audio captions with tight A-V-T alignment, leading to severe semantic conflict between multimodal conditions, and (2) cross-task and intra-task competition, manifesting as an adverse V2A-T2A performance trade-off and modality bias in the VT2A task. First, to address data scarcity, we introduce SoundAtlas, a large-scale dataset (470k pairs) that significantly outperforms existing benchmarks and even human experts in quality. Powered by a novel agentic pipeline, it integrates Vision-to-Language Compression to mitigate visual bias of MLLMs, a Junior-Senior Agent Handoff for a 5 times cost reduction, and rigorous Post-hoc Filtering to ensure fidelity. Consequently, SoundAtlas delivers semantically rich and temporally detailed captions with tight V-A-T alignment. Second, we propose Omni2Sound, a unified VT2A diffusion model supporting flexible input modalities. To resolve the inherent cross-task and intra-task competition, we design a three-stage multi-task progressive training schedule that converts cross-task competition into joint optimization and mitigates modality bias in the VT2A task, maintaining both audio-visual alignment and off-screen audio generation faithfulness. Finally, we construct VGGSound-Omni, a comprehensive benchmark for unified evaluation, including challenging off-screen tracks. With a standard DiT backbone, Omni2Sound achieves unified SOTA performance across all three tasks within a single model, demonstrating strong generalization across benchmarks with heterogeneous input conditions. The project page is at https://swapforward.github.io/Omni2Sound.
TLDR: The paper introduces Omni2Sound, a unified video-text-to-audio diffusion model and a large-scale high-quality dataset (SoundAtlas) to address the challenges of multimodal audio generation, achieving state-of-the-art performance across video-to-audio, text-to-audio, and joint video-text-to-audio tasks.
TLDR: 本文介绍了Omni2Sound,一个统一的视频-文本到音频的扩散模型和一个大规模高质量数据集(SoundAtlas),以解决多模态音频生成中的挑战,并在视频到音频、文本到音频和联合视频-文本到音频任务中实现了最先进的性能。
Read Paper (PDF)This paper introduces a diffusion-based framework for universal image segmentation, making agnostic segmentation possible without depending on mask-based frameworks and instead predicting the full segmentation in a holistic manner. We present several key adaptations to diffusion models, which are important in this discrete setting. Notably, we show that a location-aware palette with our 2D gray code ordering improves performance. Adding a final tanh activation function is crucial for discrete data. On optimizing diffusion parameters, the sigmoid loss weighting consistently outperforms alternatives, regardless of the prediction type used, and we settle on x-prediction. While our current model does not yet surpass leading mask-based architectures, it narrows the performance gap and introduces unique capabilities, such as principled ambiguity modeling, that these models lack. All models were trained from scratch, and we believe that combining our proposed improvements with large-scale pretraining or promptable conditioning could lead to competitive models.
TLDR: This paper introduces a diffusion-based framework for universal image segmentation that predicts the full segmentation holistically, using adaptations like a location-aware palette with 2D gray code, tanh activation, and sigmoid loss weighting. While not yet surpassing mask-based methods, it offers unique capabilities like ambiguity modeling.
TLDR: 本文提出了一种基于扩散的通用图像分割框架,该框架以整体方式预测完整的分割,采用了一些改进,如具有2D格雷码的位置感知调色板、tanh激活函数和sigmoid损失加权。虽然尚未超越基于掩码的方法,但它提供了独特的建模歧义等能力。
Read Paper (PDF)