ArXiv CS.CV Papers (Image/Video Generation)

LIVE: Long-horizon Interactive Video World Modeling

Autoregressive video world models predict future visual observations conditioned on actions. While effective over short horizons, these models often struggle with long-horizon generation, as small prediction errors accumulate over time. Prior methods alleviate this by introducing pre-trained teacher models and sequence-level distribution matching, which incur additional computational cost and fail to prevent error propagation beyond the training horizon. In this work, we propose LIVE, a Long-horizon Interactive Video world modEl that enforces bounded error accumulation via a novel cycle-consistency objective, thereby eliminating the need for teacher-based distillation. Specifically, LIVE first performs a forward rollout from ground-truth frames and then applies a reverse generation process to reconstruct the initial state. The diffusion loss is subsequently computed on the reconstructed terminal state, providing an explicit constraint on long-horizon error propagation. Moreover, we provide an unified view that encompasses different approaches and introduce progressive training curriculum to stabilize training. Experiments demonstrate that LIVE achieves state-of-the-art performance on long-horizon benchmarks, generating stable, high-quality videos far beyond training rollout lengths.

TLDR: The paper introduces LIVE, a long-horizon video world model that uses a cycle-consistency objective to ensure bounded error accumulation, leading to improved long-horizon video generation without teacher distillation.

TLDR: 该论文介绍了一种名为LIVE的长程视频世界模型，它使用循环一致性目标来确保有界的误差累积，从而在没有教师蒸馏的情况下改进长程视频生成。

Relevance: (10/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Junchao Huang, Ziyang Ye, Xinting Hu, Tianyu He, Guiyu Zhang, Shaoshuai Shi, Jiang Bian, Li Jiang

MUSE: A Multi-agent Framework for Unconstrained Story Envisioning via Closed-Loop Cognitive Orchestration

Generating long-form audio-visual stories from a short user prompt remains challenging due to an intent-execution gap, where high-level narrative intent must be preserved across coherent, shot-level multimodal generation over long horizons. Existing approaches typically rely on feed-forward pipelines or prompt-only refinement, which often leads to semantic drift and identity inconsistency as sequences grow longer. We address this challenge by formulating storytelling as a closed-loop constraint enforcement problem and propose MUSE, a multi-agent framework that coordinates generation through an iterative plan-execute-verify-revise loop. MUSE translates narrative intent into explicit, machine-executable controls over identity, spatial composition, and temporal continuity, and applies targeted multimodal feedback to correct violations during generation. To evaluate open-ended storytelling without ground-truth references, we introduce MUSEBench, a reference-free evaluation protocol validated by human judgments. Experiments demonstrate that MUSE substantially improves long-horizon narrative coherence, cross-modal identity consistency, and cinematic quality compared with representative baselines.

TLDR: The paper introduces MUSE, a multi-agent framework for long-form audio-visual story generation that uses a closed-loop constraint enforcement approach to improve narrative coherence and consistency, and proposes MUSEBench, a reference-free evaluation protocol.

TLDR: 该论文介绍了一种名为MUSE的多智能体框架，用于生成长篇音视频故事，该框架使用闭环约束执行方法来提高叙事连贯性和一致性，并提出了MUSEBench，一种无需参考的评估协议。

Relevance: (10/10)

Novelty: (9/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Wenzhang Sun, Zhenyu Wang, Zhangchi Hu, Chunfeng Wang, Hao Li, Wei Chen

UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing

Unified multimodal models often struggle with complex synthesis tasks that demand deep reasoning, and typically treat text-to-image generation and image editing as isolated capabilities rather than interconnected reasoning steps. To address this, we propose UniReason, a unified framework that harmonizes these two tasks through a dual reasoning paradigm. We formulate generation as world knowledge-enhanced planning to inject implicit constraints, and leverage editing capabilities for fine-grained visual refinement to further correct visual errors via self-reflection. This approach unifies generation and editing within a shared representation, mirroring the human cognitive process of planning followed by refinement. We support this framework by systematically constructing a large-scale reasoning-centric dataset (~300k samples) covering five major knowledge domains (e.g., cultural commonsense, physics, etc.) for planning, alongside an agent-generated corpus for visual self-correction. Extensive experiments demonstrate that UniReason achieves advanced performance on reasoning-intensive benchmarks such as WISE, KrisBench and UniREditBench, while maintaining superior general synthesis capabilities.

TLDR: The paper introduces UniReason, a unified framework for image generation and editing that uses a dual reasoning paradigm (world knowledge-enhanced planning and editing with self-reflection) and a large-scale reasoning dataset. It claims to achieve state-of-the-art performance on reasoning-intensive benchmarks.

TLDR: 这篇论文介绍了UniReason，一个统一的图像生成和编辑框架，它采用双重推理范式（世界知识增强的规划和具有自我反思的编辑）以及大规模推理数据集。论文声称在推理密集型基准测试中实现了最先进的性能。

Relevance: (10/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Dianyi Wang, Chaofan Ma, Feng Han, Size Wu, Wei Song, Yibin Wang, Zhixiong Zhang, Tianhang Wang, Siyuan Wang, Zhongyu Wei, Jiaqi Wang

3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation

Existing methods for human motion control in video generation typically rely on either 2D poses or explicit 3D parametric models (e.g., SMPL) as control signals. However, 2D poses rigidly bind motion to the driving viewpoint, precluding novel-view synthesis. Explicit 3D models, though structurally informative, suffer from inherent inaccuracies (e.g., depth ambiguity and inaccurate dynamics) which, when used as a strong constraint, override the powerful intrinsic 3D awareness of large-scale video generators. In this work, we revisit motion control from a 3D-aware perspective, advocating for an implicit, view-agnostic motion representation that naturally aligns with the generator's spatial priors rather than depending on externally reconstructed constraints. We introduce 3DiMo, which jointly trains a motion encoder with a pretrained video generator to distill driving frames into compact, view-agnostic motion tokens, injected semantically via cross-attention. To foster 3D awareness, we train with view-rich supervision (i.e., single-view, multi-view, and moving-camera videos), forcing motion consistency across diverse viewpoints. Additionally, we use auxiliary geometric supervision that leverages SMPL only for early initialization and is annealed to zero, enabling the model to transition from external 3D guidance to learning genuine 3D spatial motion understanding from the data and the generator's priors. Experiments confirm that 3DiMo faithfully reproduces driving motions with flexible, text-driven camera control, significantly surpassing existing methods in both motion fidelity and visual quality.

TLDR: The paper introduces 3DiMo, a novel approach to 3D-aware human motion control in video generation, using an implicit, view-agnostic motion representation and view-rich supervision to achieve improved motion fidelity and visual quality with flexible camera control.

TLDR: 该论文介绍了3DiMo，一种用于视频生成中3D感知的人体运动控制的新方法，它使用隐式的、视角无关的运动表示和丰富的视角监督，以实现改进的运动保真度和视觉质量以及灵活的相机控制。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Zhixue Fang, Xu He, Songlin Tang, Haoxian Zhang, Qingfeng Li, Xiaoqiang Liu, Pengfei Wan, Kun Gai

BridgeV2W: Bridging Video Generation Models to Embodied World Models via Embodiment Masks

Embodied world models have emerged as a promising paradigm in robotics, most of which leverage large-scale Internet videos or pretrained video generation models to enrich visual and motion priors. However, they still face key challenges: a misalignment between coordinate-space actions and pixel-space videos, sensitivity to camera viewpoint, and non-unified architectures across embodiments. To this end, we present BridgeV2W, which converts coordinate-space actions into pixel-aligned embodiment masks rendered from the URDF and camera parameters. These masks are then injected into a pretrained video generation model via a ControlNet-style pathway, which aligns the action control signals with predicted videos, adds view-specific conditioning to accommodate camera viewpoints, and yields a unified world model architecture across embodiments. To mitigate overfitting to static backgrounds, BridgeV2W further introduces a flow-based motion loss that focuses on learning dynamic and task-relevant regions. Experiments on single-arm (DROID) and dual-arm (AgiBot-G1) datasets, covering diverse and challenging conditions with unseen viewpoints and scenes, show that BridgeV2W improves video generation quality compared to prior state-of-the-art methods. We further demonstrate the potential of BridgeV2W on downstream real-world tasks, including policy evaluation and goal-conditioned planning. More results can be found on our project website at https://BridgeV2W.github.io .

TLDR: BridgeV2W addresses the misalignment between actions and videos in embodied world models by converting actions into pixel-aligned embodiment masks and injecting them into a pre-trained video generation model. It shows improved video generation and promising results on downstream real-world tasks.

TLDR: BridgeV2W通过将坐标空间动作转换为像素对齐的实体掩码，并将其注入预训练的视频生成模型，解决了具身世界模型中动作和视频之间的不对齐问题。实验表明，它提高了视频生成质量，并在下游现实任务中展现了潜力。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yixiang Chen, Peiyan Li, Jiabing Yang, Keji He, Xiangnan Wu, Yuan Xu, Kai Wang, Jing Liu, Nianfeng Liu, Yan Huang, Liang Wang

Test-Time Conditioning with Representation-Aligned Visual Features

While representation alignment with self-supervised models has been shown to improve diffusion model training, its potential for enhancing inference-time conditioning remains largely unexplored. We introduce Representation-Aligned Guidance (REPA-G), a framework that leverages these aligned representations, with rich semantic properties, to enable test-time conditioning from features in generation. By optimizing a similarity objective (the potential) at inference, we steer the denoising process toward a conditioned representation extracted from a pre-trained feature extractor. Our method provides versatile control at multiple scales, ranging from fine-grained texture matching via single patches to broad semantic guidance using global image feature tokens. We further extend this to multi-concept composition, allowing for the faithful combination of distinct concepts. REPA-G operates entirely at inference time, offering a flexible and precise alternative to often ambiguous text prompts or coarse class labels. We theoretically justify how this guidance enables sampling from the potential-induced tilted distribution. Quantitative results on ImageNet and COCO demonstrate that our approach achieves high-quality, diverse generations. Code is available at https://github.com/valeoai/REPA-G.

TLDR: The paper introduces REPA-G, a novel inference-time conditioning method for diffusion models that uses representation alignment with self-supervised features to steer image generation, offering versatile control and improved compositionality compared to text prompts.

TLDR: 该论文介绍了一种名为REPA-G的新型推理时扩散模型条件控制方法，它利用自监督模型的表征对齐来引导图像生成，与文本提示相比，提供了更灵活的控制和改进的组合性。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Nicolas Sereyjol-Garros, Ellington Kirby, Victor Letzelter, Victor Besnier, Nermin Samet

Semantic Routing: Exploring Multi-Layer LLM Feature Weighting for Diffusion Transformers

Recent DiT-based text-to-image models increasingly adopt LLMs as text encoders, yet text conditioning remains largely static and often utilizes only a single LLM layer, despite pronounced semantic hierarchy across LLM layers and non-stationary denoising dynamics over both diffusion time and network depth. To better match the dynamic process of DiT generation and thereby enhance the diffusion model's generative capability, we introduce a unified normalized convex fusion framework equipped with lightweight gates to systematically organize multi-layer LLM hidden states via time-wise, depth-wise, and joint fusion. Experiments establish Depth-wise Semantic Routing as the superior conditioning strategy, consistently improving text-image alignment and compositional generation (e.g., +9.97 on the GenAI-Bench Counting task). Conversely, we find that purely time-wise fusion can paradoxically degrade visual generation fidelity. We attribute this to a train-inference trajectory mismatch: under classifier-free guidance, nominal timesteps fail to track the effective SNR, causing semantically mistimed feature injection during inference. Overall, our results position depth-wise routing as a strong and effective baseline and highlight the critical need for trajectory-aware signals to enable robust time-dependent conditioning.

TLDR: The paper introduces a method called Depth-wise Semantic Routing to improve text-to-image generation by strategically fusing multi-layer LLM hidden states in diffusion transformers, demonstrating improved text-image alignment and compositional generation.

TLDR: 该论文介绍了一种名为深度语义路由的方法，通过在扩散Transformer中策略性地融合多层LLM的隐藏状态，从而改进文本到图像的生成，实验表明该方法在文本图像对齐和组合生成方面有所提高。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Bozhou Li, Yushuo Guan, Haolin Li, Bohan Zeng, Yiyan Ji, Yue Ding, Pengfei Wan, Kun Gai, Yuanxing Zhang, Wentao Zhang

Hierarchical Concept-to-Appearance Guidance for Multi-Subject Image Generation

Multi-subject image generation aims to synthesize images that faithfully preserve the identities of multiple reference subjects while following textual instructions. However, existing methods often suffer from identity inconsistency and limited compositional control, as they rely on diffusion models to implicitly associate text prompts with reference images. In this work, we propose Hierarchical Concept-to-Appearance Guidance (CAG), a framework that provides explicit, structured supervision from high-level concepts to fine-grained appearances. At the conceptual level, we introduce a VAE dropout training strategy that randomly omits reference VAE features, encouraging the model to rely more on robust semantic signals from a Visual Language Model (VLM) and thereby promoting consistent concept-level generation in the absence of complete appearance cues. At the appearance level, we integrate the VLM-derived correspondences into a correspondence-aware masked attention module within the Diffusion Transformer (DiT). This module restricts each text token to attend only to its matched reference regions, ensuring precise attribute binding and reliable multi-subject composition. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the multi-subject image generation, substantially improving prompt following and subject consistency.

TLDR: This paper introduces a Hierarchical Concept-to-Appearance Guidance (CAG) framework for multi-subject image generation that improves identity consistency and compositional control by explicitly guiding the diffusion process with structured supervision from a Visual Language Model.

TLDR: 本文提出了一种用于多主体图像生成的层级概念到外观引导（CAG）框架，通过视觉语言模型的结构化监督显式地引导扩散过程，从而提高身份一致性和组合控制。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yijia Xu, Zihao Wang, Jinshi Cui

Socratic-Geo: Synthetic Data Generation and Geometric Reasoning via Multi-Agent Interaction

Multimodal Large Language Models (MLLMs) have significantly advanced vision-language understanding. However, even state-of-the-art models struggle with geometric reasoning, revealing a critical bottleneck: the extreme scarcity of high-quality image-text pairs. Human annotation is prohibitively expensive, while automated methods fail to ensure fidelity and training effectiveness. Existing approaches either passively adapt to available images or employ inefficient random exploration with filtering, decoupling generation from learning needs. We propose Socratic-Geo, a fully autonomous framework that dynamically couples data synthesis with model learning through multi-agent interaction. The Teacher agent generates parameterized Python scripts with reflective feedback (Reflect for solvability, RePI for visual validity), ensuring image-text pair purity. The Solver agent optimizes reasoning through preference learning, with failure paths guiding Teacher's targeted augmentation. Independently, the Generator learns image generation capabilities on accumulated "image-code-instruction" triplets, distilling programmatic drawing intelligence into visual generation. Starting from only 108 seed problems, Socratic-Solver achieves 49.11 on six benchmarks using one-quarter of baseline data, surpassing strong baselines by 2.43 points. Socratic-Generator achieves 42.4% on GenExam, establishing new state-of-the-art for open-source models, surpassing Seedream-4.0 (39.8%) and approaching Gemini-2.5-Flash-Image (43.1%).

TLDR: Socratic-Geo introduces a multi-agent framework for generating synthetic image-text pairs for geometric reasoning, achieving state-of-the-art performance on several benchmarks with significantly less data using two agents: Solver and Generator. This approach couples data synthesis with model learning through dynamic interaction.

TLDR: Socratic-Geo 引入了一个多智能体框架，用于生成几何推理的合成图像-文本对，通过使用两个智能体：Solver 和 Generator，以明显更少的数据在多个基准测试中实现了最先进的性能。这种方法通过动态交互将数据合成与模型学习相结合。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Zhengbo Jiao, Shaobo Wang, Zifan Zhang, Wei Wang, Bing Zhao, Hu Wei, Linfeng Zhang

Composable Visual Tokenizers with Generator-Free Diagnostics of Learnability

We introduce CompTok, a training framework for learning visual tokenizers whose tokens are enhanced for compositionality. CompTok uses a token-conditioned diffusion decoder. By employing an InfoGAN-style objective, where we train a recognition model to predict the tokens used to condition the diffusion decoder using the decoded images, we enforce the decoder to not ignore any of the tokens. To promote compositional control, besides the original images, CompTok also trains on tokens formed by swapping token subsets between images, enabling more compositional control of the token over the decoder. As the swapped tokens between images do not have ground truth image targets, we apply a manifold constraint via an adversarial flow regularizer to keep unpaired swap generations on the natural-image distribution. The resulting tokenizer not only achieves state-of-the-art performance on image class-conditioned generation, but also demonstrates properties such as swapping tokens between images to achieve high level semantic editing of an image. Additionally, we propose two metrics that measures the landscape of the token space that can be useful to describe not only the compositionality of the tokens, but also how easy to learn the landscape is for a generator to be trained on this space. We show in experiments that CompTok can improve on both of the metrics as well as supporting state-of-the-art generators for class conditioned generation.

TLDR: The paper introduces CompTok, a framework for learning composable visual tokenizers using a token-conditioned diffusion decoder with an InfoGAN-style objective and adversarial flow regularizer. The approach achieves state-of-the-art performance in class-conditioned image generation and offers mechanisms for semantic image editing via token swapping.

TLDR: 本文介绍了一种名为CompTok的框架，用于学习可组合的视觉标记器，该框架使用带标记调节的扩散解码器、InfoGAN风格的目标函数和对抗流正则化器。该方法在类别条件图像生成方面实现了最先进的性能，并提供了通过标记交换进行语义图像编辑的机制。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Bingchen Zhao, Qiushan Guo, Ye Wang, Yixuan Huang, Zhonghua Zhai, Yu Tian

R1-SyntheticVL: Is Synthetic Data from Generative Models Ready for Multimodal Large Language Model?

In this work, we aim to develop effective data synthesis techniques that autonomously synthesize multimodal training data for enhancing MLLMs in solving complex real-world tasks. To this end, we propose Collective Adversarial Data Synthesis (CADS), a novel and general approach to synthesize high-quality, diverse and challenging multimodal data for MLLMs. The core idea of CADS is to leverage collective intelligence to ensure high-quality and diverse generation, while exploring adversarial learning to synthesize challenging samples for effectively driving model improvement. Specifically, CADS operates with two cyclic phases, i.e., Collective Adversarial Data Generation (CAD-Generate) and Collective Adversarial Data Judgment (CAD-Judge). CAD-Generate leverages collective knowledge to jointly generate new and diverse multimodal data, while CAD-Judge collaboratively assesses the quality of synthesized data. In addition, CADS introduces an Adversarial Context Optimization mechanism to optimize the generation context to encourage challenging and high-value data generation. With CADS, we construct MMSynthetic-20K and train our model R1-SyntheticVL, which demonstrates superior performance on various benchmarks.

TLDR: This paper proposes a method called Collective Adversarial Data Synthesis (CADS) to generate multimodal training data for MLLMs, resulting in a model, R1-SyntheticVL, that performs well on benchmarks.

TLDR: 本文提出了一种名为集体对抗数据合成（CADS）的方法，用于生成多模态训练数据，从而训练出在基准测试中表现良好的模型 R1-SyntheticVL。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Jingyi Zhang, Tianyi Lin, Huanjin Yao, Xiang Lan, Shunyu Liu, Jiaxing Huang

InstaDrive: Instance-Aware Driving World Models for Realistic and Consistent Video Generation

Autonomous driving relies on robust models trained on high-quality, large-scale multi-view driving videos. While world models offer a cost-effective solution for generating realistic driving videos, they struggle to maintain instance-level temporal consistency and spatial geometric fidelity. To address these challenges, we propose InstaDrive, a novel framework that enhances driving video realism through two key advancements: (1) Instance Flow Guider, which extracts and propagates instance features across frames to enforce temporal consistency, preserving instance identity over time. (2) Spatial Geometric Aligner, which improves spatial reasoning, ensures precise instance positioning, and explicitly models occlusion hierarchies. By incorporating these instance-aware mechanisms, InstaDrive achieves state-of-the-art video generation quality and enhances downstream autonomous driving tasks on the nuScenes dataset. Additionally, we utilize CARLA's autopilot to procedurally and stochastically simulate rare but safety-critical driving scenarios across diverse maps and regions, enabling rigorous safety evaluation for autonomous systems. Our project page is https://shanpoyang654.github.io/InstaDrive/page.html.

TLDR: InstaDrive addresses the challenges of temporal consistency and spatial fidelity in driving video generation by introducing instance-aware mechanisms for feature propagation and geometric alignment, resulting in state-of-the-art performance and improved safety evaluation.

TLDR: InstaDrive 通过引入实例感知机制进行特征传播和几何对齐，解决了驾驶视频生成中时间一致性和空间保真度方面的挑战，从而实现了最先进的性能并改进了安全评估。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Zhuoran Yang, Xi Guo, Chenjing Ding, Chiyu Wang, Wei Wu, Yanyong Zhang

PokeFusion Attention: Enhancing Reference-Free Style-Conditioned Generation

This paper studies reference-free style-conditioned character generation in text-to-image diffusion models, where high-quality synthesis requires both stable character structure and consistent, fine-grained style expression across diverse prompts. Existing approaches primarily rely on text-only prompting, which is often under-specified for visual style and tends to produce noticeable style drift and geometric inconsistency, or introduce reference-based adapters that depend on external images at inference time, increasing architectural complexity and limiting deployment flexibility.We propose PokeFusion Attention, a lightweight decoder-level cross-attention mechanism that fuses textual semantics with learned style embeddings directly inside the diffusion decoder. By decoupling text and style conditioning at the attention level, our method enables effective reference-free stylized generation while keeping the pretrained diffusion backbone fully frozen.PokeFusion Attention trains only decoder cross-attention layers together with a compact style projection module, resulting in a parameter-efficient and plug-and-play control component that can be easily integrated into existing diffusion pipelines and transferred across different backbones.Experiments on a stylized character generation benchmark (Pokemon-style) demonstrate that our method consistently improves style fidelity, semantic alignment, and character shape consistency compared with representative adapter-based baselines, while maintaining low parameter overhead and inference-time simplicity.

TLDR: The paper introduces PokeFusion Attention, a reference-free and parameter-efficient method for style-conditioned image generation in diffusion models by decoupling text and style conditioning at the attention level, demonstrating improved style fidelity and consistency on stylized character generation tasks.

TLDR: 该论文介绍了PokeFusion Attention，一种无需参考且参数高效的风格条件图像生成方法，通过在注意力层解耦文本和风格条件，改进了扩散模型中风格化角色生成的风格保真度和一致性。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Jingbang Tang

ConsisDrive: Identity-Preserving Driving World Models for Video Generation by Instance Mask

Autonomous driving relies on robust models trained on large-scale, high-quality multi-view driving videos. Although world models provide a cost-effective solution for generating realistic driving data, they often suffer from identity drift, where the same object changes its appearance or category across frames due to the absence of instance-level temporal constraints. We introduce ConsisDrive, an identity-preserving driving world model designed to enforce temporal consistency at the instance level. Our framework incorporates two key components: (1) Instance-Masked Attention, which applies instance identity masks and trajectory masks within attention blocks to ensure that visual tokens interact only with their corresponding instance features across spatial and temporal dimensions, thereby preserving object identity consistency; and (2) Instance-Masked Loss, which adaptively emphasizes foreground regions with probabilistic instance masking, reducing background noise while maintaining overall scene fidelity. By integrating these mechanisms, ConsisDrive achieves state-of-the-art driving video generation quality and demonstrates significant improvements in downstream autonomous driving tasks on the nuScenes dataset. Our project page is https://shanpoyang654.github.io/ConsisDrive/page.html.

TLDR: The paper introduces ConsisDrive, a driving world model for video generation that enforces temporal consistency at the instance level using instance-masked attention and loss, improving video generation quality and downstream autonomous driving task performance.

TLDR: 该论文介绍了 ConsisDrive，一个用于视频生成的驾驶世界模型，它使用实例掩码注意力机制和损失函数在实例级别强制执行时间一致性，从而提高视频生成质量和下游自动驾驶任务的性能。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Zhuoran Yang, Yanyong Zhang

VIRAL: Visual In-Context Reasoning via Analogy in Diffusion Transformers

Replicating In-Context Learning (ICL) in computer vision remains challenging due to task heterogeneity. We propose \textbf{VIRAL}, a framework that elicits visual reasoning from a pre-trained image editing model by formulating ICL as conditional generation via visual analogy ($x_s : x_t :: x_q : y_q$). We adapt a frozen Diffusion Transformer (DiT) using role-aware multi-image conditioning and introduce a Mixture-of-Experts LoRA to mitigate gradient interference across diverse tasks. Additionally, to bridge the gaps in current visual context datasets, we curate a large-scale dataset spanning perception, restoration, and editing. Experiments demonstrate that VIRAL outperforms existing methods, validating that a unified V-ICL paradigm can handle the majority of visual tasks, including open-domain editing. Our code is available at https://anonymous.4open.science/r/VIRAL-744A

TLDR: The paper introduces VIRAL, a framework for visual in-context learning using diffusion transformers and visual analogy, achieving strong performance across various image tasks like perception, restoration, and editing, suggesting a unified V-ICL paradigm.

TLDR: 该论文介绍了VIRAL，一个利用扩散Transformer和视觉类比进行视觉上下文学习的框架。该框架在感知、修复和编辑等各种图像任务中表现出色，表明了一种统一的V-ICL范式。

Relevance: (8/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Zhiwen Li, Zhongjie Duan, Jinyan Ye, Cen Chen, Daoyuan Chen, Yaliang Li, Yingda Chen

Spectral Evolution Search: Efficient Inference-Time Scaling for Reward-Aligned Image Generation

Inference-time scaling offers a versatile paradigm for aligning visual generative models with downstream objectives without parameter updates. However, existing approaches that optimize the high-dimensional initial noise suffer from severe inefficiency, as many search directions exert negligible influence on the final generation. We show that this inefficiency is closely related to a spectral bias in generative dynamics: model sensitivity to initial perturbations diminishes rapidly as frequency increases. Building on this insight, we propose Spectral Evolution Search (SES), a plug-and-play framework for initial noise optimization that executes gradient-free evolutionary search within a low-frequency subspace. Theoretically, we derive the Spectral Scaling Prediction from perturbation propagation dynamics, which explains the systematic differences in the impact of perturbations across frequencies. Extensive experiments demonstrate that SES significantly advances the Pareto frontier of generation quality versus computational cost, consistently outperforming strong baselines under equivalent budgets.

TLDR: The paper introduces Spectral Evolution Search (SES), a method for efficient inference-time optimization of initial noise in image generation by focusing on low-frequency subspaces, leading to improved generation quality and reduced computational cost.

TLDR: 本文介绍了一种名为频谱演化搜索 (SES) 的方法，通过专注于低频子空间来有效优化图像生成中初始噪声的推理时间，从而提高生成质量并降低计算成本。

Relevance: (8/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Jinyan Ye, Zhongjie Duan, Zhiwen Li, Cen Chen, Daoyuan Chen, Yaliang Li, Yingda Chen

Diversity-Preserved Distribution Matching Distillation for Fast Visual Synthesis

Distribution matching distillation (DMD) aligns a multi-step generator with its few-step counterpart to enable high-quality generation under low inference cost. However, DMD tends to suffer from mode collapse, as its reverse-KL formulation inherently encourages mode-seeking behavior, for which existing remedies typically rely on perceptual or adversarial regularization, thereby incurring substantial computational overhead and training instability. In this work, we propose a role-separated distillation framework that explicitly disentangles the roles of distilled steps: the first step is dedicated to preserving sample diversity via a target-prediction (e.g., v-prediction) objective, while subsequent steps focus on quality refinement under the standard DMD loss, with gradients from the DMD objective blocked at the first step. We term this approach Diversity-Preserved DMD (DP-DMD), which, despite its simplicity -- no perceptual backbone, no discriminator, no auxiliary networks, and no additional ground-truth images -- preserves sample diversity while maintaining visual quality on par with state-of-the-art methods in extensive text-to-image experiments.

TLDR: This paper introduces Diversity-Preserved Distribution Matching Distillation (DP-DMD) for fast visual synthesis, which addresses mode collapse in traditional DMD by disentangling distillation steps to preserve sample diversity and refine quality, achieving state-of-the-art results without complex regularization techniques.

TLDR: 本文介绍了一种用于快速视觉合成的保多样性分布匹配蒸馏（DP-DMD）方法。该方法通过解耦蒸馏步骤来解决传统DMD中的模式崩溃问题，从而保持样本多样性并提高质量，无需复杂的正则化技术即可达到最先进的水平。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Tianhe Wu, Ruibin Li, Lei Zhang, Kede Ma

HP-GAN: Harnessing pretrained networks for GAN improvement with FakeTwins and discriminator consistency

Generative Adversarial Networks (GANs) have made significant progress in enhancing the quality of image synthesis. Recent methods frequently leverage pretrained networks to calculate perceptual losses or utilize pretrained feature spaces. In this paper, we extend the capabilities of pretrained networks by incorporating innovative self-supervised learning techniques and enforcing consistency between discriminators during GAN training. Our proposed method, named HP-GAN, effectively exploits neural network priors through two primary strategies: FakeTwins and discriminator consistency. FakeTwins leverages pretrained networks as encoders to compute a self-supervised loss and applies this through the generated images to train the generator, thereby enabling the generation of more diverse and high quality images. Additionally, we introduce a consistency mechanism between discriminators that evaluate feature maps extracted from Convolutional Neural Network (CNN) and Vision Transformer (ViT) feature networks. Discriminator consistency promotes coherent learning among discriminators and enhances training robustness by aligning their assessments of image quality. Our extensive evaluation across seventeen datasets-including scenarios with large, small, and limited data, and covering a variety of image domains-demonstrates that HP-GAN consistently outperforms current state-of-the-art methods in terms of Fréchet Inception Distance (FID), achieving significant improvements in image diversity and quality. Code is available at: https://github.com/higun2/HP-GAN.

TLDR: HP-GAN improves image generation by using pretrained networks with novel self-supervised learning (FakeTwins) and discriminator consistency techniques, demonstrating state-of-the-art results across various datasets.

TLDR: HP-GAN通过使用预训练网络和创新的自监督学习（FakeTwins）以及判别器一致性技术来改进图像生成，并在各种数据集上展示了最先进的结果。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Geonhui Son, Jeong Ryong Lee, Dosik Hwang

From Tokens to Numbers: Continuous Number Modeling for SVG Generation

For certain image generation tasks, vector graphics such as Scalable Vector Graphics (SVGs) offer clear benefits such as increased flexibility, size efficiency, and editing ease, but remain less explored than raster-based approaches. A core challenge is that the numerical, geometric parameters, which make up a large proportion of SVGs, are inefficiently encoded as long sequences of tokens. This slows training, reduces accuracy, and hurts generalization. To address these problems, we propose Continuous Number Modeling (CNM), an approach that directly models numbers as first-class, continuous values rather than discrete tokens. This formulation restores the mathematical elegance of the representation by aligning the model's inputs with the data's continuous nature, removing discretization artifacts introduced by token-based encoding. We then train a multimodal transformer on 2 million raster-to-SVG samples, followed by fine-tuning via reinforcement learning using perceptual feedback to further improve visual quality. Our approach improves training speed by over 30% while maintaining higher perceptual fidelity compared to alternative approaches. This work establishes CNM as a practical and efficient approach for high-quality vector generation, with potential for broader applications. We make our code available http://github.com/mikeogezi/CNM.

TLDR: This paper introduces Continuous Number Modeling (CNM), a novel approach for efficient and high-quality SVG generation by directly modeling numerical parameters as continuous values rather than discrete tokens, resulting in faster training and improved perceptual fidelity.

TLDR: 本文介绍了一种用于高效、高质量SVG生成的新方法：连续数值建模（CNM）。CNM直接将数值参数建模为连续值，而不是离散token，从而加快了训练速度并提高了感知保真度。

Relevance: (8/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Michael Ogezi, Martin Bell, Freda Shi, Ethan Smith

PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss

Pixel diffusion generates images directly in pixel space in an end-to-end manner, avoiding the artifacts and bottlenecks introduced by VAEs in two-stage latent diffusion. However, it is challenging to optimize high-dimensional pixel manifolds that contain many perceptually irrelevant signals, leaving existing pixel diffusion methods lagging behind latent diffusion models. We propose PixelGen, a simple pixel diffusion framework with perceptual supervision. Instead of modeling the full image manifold, PixelGen introduces two complementary perceptual losses to guide diffusion model towards learning a more meaningful perceptual manifold. An LPIPS loss facilitates learning better local patterns, while a DINO-based perceptual loss strengthens global semantics. With perceptual supervision, PixelGen surpasses strong latent diffusion baselines. It achieves an FID of 5.11 on ImageNet-256 without classifier-free guidance using only 80 training epochs, and demonstrates favorable scaling performance on large-scale text-to-image generation with a GenEval score of 0.79. PixelGen requires no VAEs, no latent representations, and no auxiliary stages, providing a simpler yet more powerful generative paradigm. Codes are publicly available at https://github.com/Zehong-Ma/PixelGen.

TLDR: PixelGen introduces perceptual loss (LPIPS and DINO) into pixel diffusion models, outperforming latent diffusion models on image generation tasks without VAEs, achieving a strong FID score and favorable scaling performance.

TLDR: PixelGen将感知损失（LPIPS和DINO）引入像素扩散模型，在图像生成任务上优于潜在扩散模型，且无需VAE，实现了强大的FID分数和良好的规模性能。

Relevance: (8/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Zehong Ma, Ruihan Xu, Shiliang Zhang

MentisOculi: Revealing the Limits of Reasoning with Mental Imagery

Frontier models are transitioning from multimodal large language models (MLLMs) that merely ingest visual information to unified multimodal models (UMMs) capable of native interleaved generation. This shift has sparked interest in using intermediate visualizations as a reasoning aid, akin to human mental imagery. Central to this idea is the ability to form, maintain, and manipulate visual representations in a goal-oriented manner. To evaluate and probe this capability, we develop MentisOculi, a procedural, stratified suite of multi-step reasoning problems amenable to visual solution, tuned to challenge frontier models. Evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance. Analysis of UMMs specifically exposes a critical limitation: While they possess the textual reasoning capacity to solve a task and can sometimes generate correct visuals, they suffer from compounding generation errors and fail to leverage even ground-truth visualizations. Our findings suggest that despite their inherent appeal, visual thoughts do not yet benefit model reasoning. MentisOculi establishes the necessary foundation to analyze and close this gap across diverse model families.

TLDR: The paper introduces MentisOculi, a benchmark to evaluate if UMMs can leverage intermediate visualizations for reasoning, finding that current models fail to improve performance even with ground-truth visuals.

TLDR: 该论文介绍了MentisOculi，一个评估UMM是否可以利用中间可视化进行推理的基准。研究发现，即使使用ground-truth图像，目前的模型也无法提高性能。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Jana Zeller, Thaddäus Wiedemer, Fanfei Li, Thomas Klein, Prasanna Mayilvahanan, Matthias Bethge, Felix Wichmann, Ryan Cotterell, Wieland Brendel

SceneLinker: Compositional 3D Scene Generation via Semantic Scene Graph from RGB Sequences

We introduce SceneLinker, a novel framework that generates compositional 3D scenes via semantic scene graph from RGB sequences. To adaptively experience Mixed Reality (MR) content based on each user's space, it is essential to generate a 3D scene that reflects the real-world layout by compactly capturing the semantic cues of the surroundings. Prior works struggled to fully capture the contextual relationship between objects or mainly focused on synthesizing diverse shapes, making it challenging to generate 3D scenes aligned with object arrangements. We address these challenges by designing a graph network with cross-check feature attention for scene graph prediction and constructing a graph-variational autoencoder (graph-VAE), which consists of a joint shape and layout block for 3D scene generation. Experiments on the 3RScan/3DSSG and SG-FRONT datasets demonstrate that our approach outperforms state-of-the-art methods in both quantitative and qualitative evaluations, even in complex indoor environments and under challenging scene graph constraints. Our work enables users to generate consistent 3D spaces from their physical environments via scene graphs, allowing them to create spatial MR content. Project page is https://scenelinker2026.github.io.

TLDR: The paper presents SceneLinker, a framework that generates compositional 3D scenes from RGB sequences using a semantic scene graph, outperforming existing methods in complex indoor environments. It enables the creation of spatial MR content from physical environments.

TLDR: 该论文介绍了SceneLinker，一个通过语义场景图从RGB序列生成组合式3D场景的框架，其在复杂的室内环境中优于现有方法。它可以从物理环境中创建空间MR内容。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Seok-Young Kim, Dooyoung Kim, Woojin Cho, Hail Song, Suji Kang, Woontack Woo

SLIM-Diff: Shared Latent Image-Mask Diffusion with Lp loss for Data-Scarce Epilepsy FLAIR MRI

Focal cortical dysplasia (FCD) lesions in epilepsy FLAIR MRI are subtle and scarce, making joint image--mask generative modeling prone to instability and memorization. We propose SLIM-Diff, a compact joint diffusion model whose main contributions are (i) a single shared-bottleneck U-Net that enforces tight coupling between anatomy and lesion geometry from a 2-channel image+mask representation, and (ii) loss-geometry tuning via a tunable $L_p$ objective. As an internal baseline, we include the canonical DDPM-style objective ($ε$-prediction with $L_2$ loss) and isolate the effect of prediction parameterization and $L_p$ geometry under a matched setup. Experiments show that $x_0$-prediction is consistently the strongest choice for joint synthesis, and that fractional sub-quadratic penalties ($L_{1.5}$) improve image fidelity while $L_2$ better preserves lesion mask morphology. Our code and model weights are available in https://github.com/MarioPasc/slim-diff

TLDR: The paper introduces SLIM-Diff, a compact joint diffusion model for generating epilepsy FLAIR MRI images and lesion masks in data-scarce scenarios, using a shared-bottleneck U-Net and tunable Lp loss. They find that x0-prediction with fractional sub-quadratic penalties (L1.5) improves image fidelity.

TLDR: 该论文提出了SLIM-Diff，一种紧凑的联合扩散模型，用于在数据稀缺的情况下生成癫痫 FLAIR MRI 图像和病灶掩模，它使用共享瓶颈 U-Net 和可调 Lp 损失。他们发现使用分数次二次惩罚 (L1.5) 进行 x0 预测可提高图像保真度。

Relevance: (6/10)

Novelty: (7/10)

Clarity: (8/10)

Potential Impact: (6/10)

Overall: (6/10)

Read Paper (PDF)

Authors: Mario Pascual-González, Ariadna Jiménez-Partinen, R. M. Luque-Baena, Fátima Nagib-Raya, Ezequiel López-Rubio

AIGC Daily Papers

LIVE: Long-horizon Interactive Video World Modeling

MUSE: A Multi-agent Framework for Unconstrained Story Envisioning via Closed-Loop Cognitive Orchestration

UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing

3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation

BridgeV2W: Bridging Video Generation Models to Embodied World Models via Embodiment Masks

Test-Time Conditioning with Representation-Aligned Visual Features

Semantic Routing: Exploring Multi-Layer LLM Feature Weighting for Diffusion Transformers

Hierarchical Concept-to-Appearance Guidance for Multi-Subject Image Generation

Socratic-Geo: Synthetic Data Generation and Geometric Reasoning via Multi-Agent Interaction

Composable Visual Tokenizers with Generator-Free Diagnostics of Learnability

R1-SyntheticVL: Is Synthetic Data from Generative Models Ready for Multimodal Large Language Model?

InstaDrive: Instance-Aware Driving World Models for Realistic and Consistent Video Generation

PokeFusion Attention: Enhancing Reference-Free Style-Conditioned Generation

ConsisDrive: Identity-Preserving Driving World Models for Video Generation by Instance Mask

VIRAL: Visual In-Context Reasoning via Analogy in Diffusion Transformers

Spectral Evolution Search: Efficient Inference-Time Scaling for Reward-Aligned Image Generation

Diversity-Preserved Distribution Matching Distillation for Fast Visual Synthesis

HP-GAN: Harnessing pretrained networks for GAN improvement with FakeTwins and discriminator consistency

From Tokens to Numbers: Continuous Number Modeling for SVG Generation

PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss

MentisOculi: Revealing the Limits of Reasoning with Mental Imagery

SceneLinker: Compositional 3D Scene Generation via Semantic Scene Graph from RGB Sequences

SLIM-Diff: Shared Latent Image-Mask Diffusion with Lp loss for Data-Scarce Epilepsy FLAIR MRI