ArXiv CS.CV Papers (Image/Video Generation)

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

This paper presents WorldPlay, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. WorldPlay draws power from three key innovations. 1) We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs. 2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation. 3) We also propose Context Forcing, a novel distillation method designed for memory-aware model. Aligning memory context between the teacher and student preserves the student's capacity to use long-range information, enabling real-time speeds while preventing error drift. Taken together, WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes. Project page and online demo can be found: https://3d-models.hunyuan.tencent.com/world/ and https://3d.hunyuan.tencent.com/sceneTo3D.

TLDR: WorldPlay is a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency using novel techniques like Dual Action Representation, Reconstituted Context Memory, and Context Forcing, achieving 24 FPS at 720p.

TLDR: WorldPlay是一个流式视频扩散模型，通过双重动作表示、重构上下文记忆和上下文强制等创新技术，实现具有长期几何一致性的实时交互式世界建模，并以720p分辨率达到24 FPS。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, Chunchao Guo

OUSAC: Optimized Guidance Scheduling with Adaptive Caching for DiT Acceleration

Diffusion models have emerged as the dominant paradigm for high-quality image generation, yet their computational expense remains substantial due to iterative denoising. Classifier-Free Guidance (CFG) significantly enhances generation quality and controllability but doubles the computation by requiring both conditional and unconditional forward passes at every timestep. We present OUSAC (Optimized gUidance Scheduling with Adaptive Caching), a framework that accelerates diffusion transformers (DiT) through systematic optimization. Our key insight is that variable guidance scales enable sparse computation: adjusting scales at certain timesteps can compensate for skipping CFG at others, enabling both fewer total sampling steps and fewer CFG steps while maintaining quality. However, variable guidance patterns introduce denoising deviations that undermine standard caching methods, which assume constant CFG scales across steps. Moreover, different transformer blocks are affected at different levels under dynamic conditions. This paper develops a two-stage approach leveraging these insights. Stage-1 employs evolutionary algorithms to jointly optimize which timesteps to skip and what guidance scale to use, eliminating up to 82% of unconditional passes. Stage-2 introduces adaptive rank allocation that tailors calibration efforts per transformer block, maintaining caching effectiveness under variable guidance. Experiments demonstrate that OUSAC significantly outperforms state-of-the-art acceleration methods, achieving 53% computational savings with 15% quality improvement on DiT-XL/2 (ImageNet 512x512), 60% savings with 16.1% improvement on PixArt-alpha (MSCOCO), and 5x speedup on FLUX while improving CLIP Score over the 50-step baseline.

TLDR: The paper introduces OUSAC, a framework that accelerates diffusion transformers (DiT) by optimizing guidance scheduling and adaptive caching, achieving significant computational savings and quality improvements in image generation.

TLDR: 该论文介绍了一种名为OUSAC的框架，通过优化引导调度和自适应缓存来加速扩散变换器（DiT），从而在图像生成方面实现了显著的计算节省和质量提升。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Ruitong Sun, Tianze Yang, Wei Niu, Jin Sun

Towards Scalable Pre-training of Visual Tokenizers for Generation

The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundation flaw: better pixel-level accuracy does not lead to higher-quality generation. This implies that pouring extensive compute into visual tokenizer pre-training translates poorly to improved performance in generation. We identify this as the ``pre-training scaling problem`` and suggest a necessary shift: to be effective for generation, a latent space must concisely represent high-level semantics. We present VTP, a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses. Our large-scale study reveals two principal findings: (1) understanding is a key driver of generation, and (2) much better scaling properties, where generative performance scales effectively with compute, parameters, and data allocated to the pretraining of the visual tokenizer. After large-scale pre-training, our tokenizer delivers a competitive profile (78.2 zero-shot accuracy and 0.36 rFID on ImageNet) and 4.1 times faster convergence on generation compared to advanced distillation methods. More importantly, it scales effectively: without modifying standard DiT training specs, solely investing more FLOPS in pretraining VTP achieves 65.8\% FID improvement in downstream generation, while conventional autoencoder stagnates very early at 1/10 FLOPS. Our pre-trained models are available at https://github.com/MiniMax-AI/VTP.

TLDR: The paper introduces VTP, a visual tokenizer pre-training framework that jointly optimizes image-text contrastive, self-supervised, and reconstruction losses, addressing the "pre-training scaling problem" in visual tokenizers and achieving significantly improved generation performance and faster convergence.

TLDR: 本文介绍了VTP，一种视觉标记器预训练框架，它联合优化图像-文本对比、自监督和重建损失，解决了视觉标记器中的“预训练缩放问题”，并实现了显著改进的生成性能和更快的收敛速度。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Jingfeng Yao, Yuda Song, Yucong Zhou, Xinggang Wang

The Devil is in Attention Sharing: Improving Complex Non-rigid Image Editing Faithfulness via Attention Synergy

Training-free image editing with large diffusion models has become practical, yet faithfully performing complex non-rigid edits (e.g., pose or shape changes) remains highly challenging. We identify a key underlying cause: attention collapse in existing attention sharing mechanisms, where either positional embeddings or semantic features dominate visual content retrieval, leading to over-editing or under-editing.To address this issue, we introduce SynPS, a method that Synergistically leverages Positional embeddings and Semantic information for faithful non-rigid image editing. We first propose an editing measurement that quantifies the required editing magnitude at each denoising step. Based on this measurement, we design an attention synergy pipeline that dynamically modulates the influence of positional embeddings, enabling SynPS to balance semantic modifications and fidelity preservation.By adaptively integrating positional and semantic cues, SynPS effectively avoids both over- and under-editing. Extensive experiments on public and newly curated benchmarks demonstrate the superior performance and faithfulness of our approach.

TLDR: This paper introduces SynPS, a method to improve the faithfulness of non-rigid image editing using diffusion models by synergistically leveraging positional and semantic information in the attention mechanism, thereby addressing attention collapse issues.

TLDR: 本文介绍了一种名为 SynPS 的方法，通过协同利用注意力机制中的位置和语义信息，改进了使用扩散模型进行非刚性图像编辑的保真度，从而解决了注意力崩溃问题。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Zhuo Chen, Fanyue Wei, Runze Xu, Jingjing Li, Lixin Duan, Angela Yao, Wen Li

SS4D: Native 4D Generative Model via Structured Spacetime Latents

We present SS4D, a native 4D generative model that synthesizes dynamic 3D objects directly from monocular video. Unlike prior approaches that construct 4D representations by optimizing over 3D or video generative models, we train a generator directly on 4D data, achieving high fidelity, temporal coherence, and structural consistency. At the core of our method is a compressed set of structured spacetime latents. Specifically, (1) To address the scarcity of 4D training data, we build on a pre-trained single-image-to-3D model, preserving strong spatial consistency. (2) Temporal consistency is enforced by introducing dedicated temporal layers that reason across frames. (3) To support efficient training and inference over long video sequences, we compress the latent sequence along the temporal axis using factorized 4D convolutions and temporal downsampling blocks. In addition, we employ a carefully designed training strategy to enhance robustness against occlusion

TLDR: SS4D introduces a native 4D generative model that creates dynamic 3D objects from monocular video by directly training on 4D data using structured spacetime latents, addressing temporal coherence and structural consistency.

TLDR: SS4D 提出了一种原生的 4D 生成模型，该模型通过使用结构化的时空潜在变量直接在 4D 数据上训练，从单目视频创建动态 3D 对象，从而解决了时间连贯性和结构一致性问题。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Zhibing Li, Mengchen Zhang, Tong Wu, Jing Tan, Jiaqi Wang, Dahua Lin

OmniGen: Unified Multimodal Sensor Generation for Autonomous Driving

Autonomous driving has seen remarkable advancements, largely driven by extensive real-world data collection. However, acquiring diverse and corner-case data remains costly and inefficient. Generative models have emerged as a promising solution by synthesizing realistic sensor data. However, existing approaches primarily focus on single-modality generation, leading to inefficiencies and misalignment in multimodal sensor data. To address these challenges, we propose OminiGen, which generates aligned multimodal sensor data in a unified framework. Our approach leverages a shared Bird\u2019s Eye View (BEV) space to unify multimodal features and designs a novel generalizable multimodal reconstruction method, UAE, to jointly decode LiDAR and multi-view camera data. UAE achieves multimodal sensor decoding through volume rendering, enabling accurate and flexible reconstruction. Furthermore, we incorporate a Diffusion Transformer (DiT) with a ControlNet branch to enable controllable multimodal sensor generation. Our comprehensive experiments demonstrate that OminiGen achieves desired performances in unified multimodal sensor data generation with multimodal consistency and flexible sensor adjustments.

TLDR: The paper introduces OmniGen, a unified framework for generating aligned multimodal sensor data (LiDAR and camera) for autonomous driving using a shared BEV space, volume rendering, and a controllable Diffusion Transformer.

TLDR: 该论文介绍了OmniGen，一个统一的框架，利用共享的鸟瞰图（BEV）空间、体渲染和一个可控的扩散Transformer，为自动驾驶生成对齐的多模态传感器数据（激光雷达和相机）。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Tao Tang, Enhui Ma, xia zhou, Letian Wang, Tianyi Yan, Xueyang Zhang, Kun Zhan, Peng Jia, XianPeng Lang, Jia-Wang Bian, Kaicheng Yu, Xiaodan Liang

ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Diffusion Models

Multi-view image generation from a single image and text description remains challenging due to the difficulty of maintaining geometric consistency across different viewpoints. Existing approaches typically rely on 3D-aware architectures or specialized diffusion models that require extensive multi-view training data and complex geometric priors. In this work, we introduce ViewMask-1-to-3, a pioneering approach to apply discrete diffusion models to multi-view image generation. Unlike continuous diffusion methods that operate in latent spaces, ViewMask-1-to-3 formulates multi-view synthesis as a discrete sequence modeling problem, where each viewpoint is represented as visual tokens obtained through MAGVIT-v2 tokenization. By unifying language and vision through masked token prediction, our approach enables progressive generation of multiple viewpoints through iterative token unmasking with text input. ViewMask-1-to-3 achieves cross-view consistency through simple random masking combined with self-attention, eliminating the requirement for complex 3D geometric constraints or specialized attention architectures. Our approach demonstrates that discrete diffusion provides a viable and simple alternative to existing multi-view generation methods, ranking first on average across GSO and 3D-FUTURE datasets in terms of PSNR, SSIM, and LPIPS, while maintaining architectural simplicity.

TLDR: The paper introduces ViewMask-1-to-3, a novel approach using discrete diffusion models for multi-view consistent image generation from a single image and text, achieving state-of-the-art results without reliance on complex 3D priors or specialized architectures.

TLDR: 该论文介绍了 ViewMask-1-to-3，这是一种新颖的方法，使用离散扩散模型从单张图像和文本生成多视角一致的图像，在不需要复杂 3D 先验或专门架构的情况下实现了最先进的结果。

Relevance: (9/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Ruishu Zhu, Zhihao Huang, Jiacheng Sun, Ping Luo, Hongyuan Zhang, Xuelong Li

AnchorHOI: Zero-shot Generation of 4D Human-Object Interaction via Anchor-based Prior Distillation

Despite significant progress in text-driven 4D human-object interaction (HOI) generation with supervised methods, the scalability remains limited by the scarcity of large-scale 4D HOI datasets. To overcome this, recent approaches attempt zero-shot 4D HOI generation with pre-trained image diffusion models. However, interaction cues are minimally distilled during the generation process, restricting their applicability across diverse scenarios. In this paper, we propose AnchorHOI, a novel framework that thoroughly exploits hybrid priors by incorporating video diffusion models beyond image diffusion models, advancing 4D HOI generation. Nevertheless, directly optimizing high-dimensional 4D HOI with such priors remains challenging, particularly for human pose and compositional motion. To address this challenge, AnchorHOI introduces an anchor-based prior distillation strategy, which constructs interaction-aware anchors and then leverages them to guide generation in a tractable two-step process. Specifically, two tailored anchors are designed for 4D HOI generation: anchor Neural Radiance Fields (NeRFs) for expressive interaction composition, and anchor keypoints for realistic motion synthesis. Extensive experiments demonstrate that AnchorHOI outperforms previous methods with superior diversity and generalization.

TLDR: This paper introduces AnchorHOI, a zero-shot framework for 4D human-object interaction generation using anchor-based prior distillation to overcome limitations in existing methods related to interaction cues and high-dimensional optimization.

TLDR: 本文介绍了AnchorHOI，一个用于4D人与物体交互的零样本生成框架，它使用基于锚点的先验知识蒸馏来克服现有方法在交互线索和高维优化方面的局限性。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Sisi Dai, Kai Xu

FacEDiT: Unified Talking Face Editing and Generation via Facial Motion Infilling

Talking face editing and face generation have often been studied as distinct problems. In this work, we propose viewing both not as separate tasks but as subtasks of a unifying formulation, speech-conditional facial motion infilling. We explore facial motion infilling as a self-supervised pretext task that also serves as a unifying formulation of dynamic talking face synthesis. To instantiate this idea, we propose FacEDiT, a speech-conditional Diffusion Transformer trained with flow matching. Inspired by masked autoencoders, FacEDiT learns to synthesize masked facial motions conditioned on surrounding motions and speech. This formulation enables both localized generation and edits, such as substitution, insertion, and deletion, while ensuring seamless transitions with unedited regions. In addition, biased attention and temporal smoothness constraints enhance boundary continuity and lip synchronization. To address the lack of a standard editing benchmark, we introduce FacEDiTBench, the first dataset for talking face editing, featuring diverse edit types and lengths, along with new evaluation metrics. Extensive experiments validate that talking face editing and generation emerge as subtasks of speech-conditional motion infilling; FacEDiT produces accurate, speech-aligned facial edits with strong identity preservation and smooth visual continuity while generalizing effectively to talking face generation.

TLDR: The paper introduces FacEDiT, a speech-conditional diffusion transformer for unified talking face editing and generation via facial motion infilling, trained on a new benchmark dataset, FacEDiTBench.

TLDR: 该论文介绍了 FacEDiT，一种基于语音条件的扩散 Transformer，通过面部运动填充实现统一的说话人脸编辑和生成，并在新的基准数据集 FacEDiTBench 上进行训练。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Kim Sung-Bin, Joohyun Chang, David Harwath, Tae-Hyun Oh

Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models

Masked Discrete Diffusion Models (MDMs) have achieved strong performance across a wide range of multimodal tasks, including image understanding, generation, and editing. However, their inference speed remains suboptimal due to the need to repeatedly process redundant masked tokens at every sampling step. In this work, we propose Sparse-LaViDa, a novel modeling framework that dynamically truncates unnecessary masked tokens at each inference step to accelerate MDM sampling. To preserve generation quality, we introduce specialized register tokens that serve as compact representations for the truncated tokens. Furthermore, to ensure consistency between training and inference, we design a specialized attention mask that faithfully matches the truncated sampling procedure during training. Built upon the state-of-the-art unified MDM LaViDa-O, Sparse-LaViDa achieves up to a 2x speedup across diverse tasks including text-to-image generation, image editing, and mathematical reasoning, while maintaining generation quality.

TLDR: Sparse-LaViDa accelerates masked discrete diffusion models (MDMs) for multimodal tasks by dynamically truncating redundant tokens during inference, achieving up to 2x speedup while maintaining generation quality.

TLDR: Sparse-LaViDa通过在推理过程中动态截断冗余tokens，加速了用于多模态任务的掩码离散扩散模型（MDMs），在保持生成质量的同时，速度提高了2倍。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, Jason Kuen

DiffusionBrowser: Interactive Diffusion Previews via Multi-Branch Decoders

Video diffusion models have revolutionized generative video synthesis, but they are imprecise, slow, and can be opaque during generation -- keeping users in the dark for a prolonged period. In this work, we propose DiffusionBrowser, a model-agnostic, lightweight decoder framework that allows users to interactively generate previews at any point (timestep or transformer block) during the denoising process. Our model can generate multi-modal preview representations that include RGB and scene intrinsics at more than 4$\times$ real-time speed (less than 1 second for a 4-second video) that convey consistent appearance and motion to the final video. With the trained decoder, we show that it is possible to interactively guide the generation at intermediate noise steps via stochasticity reinjection and modal steering, unlocking a new control capability. Moreover, we systematically probe the model using the learned decoders, revealing how scene, object, and other details are composed and assembled during the otherwise black-box denoising process.

TLDR: This paper introduces DiffusionBrowser, a fast decoder framework for video diffusion models that allows interactive preview generation and control during the denoising process, enabling users to guide and probe the generation.

TLDR: 本文介绍了 DiffusionBrowser，一个用于视频扩散模型的快速解码器框架，它允许在去噪过程中进行交互式预览生成和控制，从而使用户能够指导和探查生成过程。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Susung Hong, Chongjian Ge, Zhifei Zhang, Jui-Hsien Wang

I-Scene: 3D Instance Models are Implicit Generalizable Spatial Learners

Generalization remains the central challenge for interactive 3D scene generation. Existing learning-based approaches ground spatial understanding in limited scene dataset, restricting generalization to new layouts. We instead reprogram a pre-trained 3D instance generator to act as a scene level learner, replacing dataset-bounded supervision with model-centric spatial supervision. This reprogramming unlocks the generator transferable spatial knowledge, enabling generalization to unseen layouts and novel object compositions. Remarkably, spatial reasoning still emerges even when the training scenes are randomly composed objects. This demonstrates that the generator's transferable scene prior provides a rich learning signal for inferring proximity, support, and symmetry from purely geometric cues. Replacing widely used canonical space, we instantiate this insight with a view-centric formulation of the scene space, yielding a fully feed-forward, generalizable scene generator that learns spatial relations directly from the instance model. Quantitative and qualitative results show that a 3D instance generator is an implicit spatial learner and reasoner, pointing toward foundation models for interactive 3D scene understanding and generation. Project page: https://luling06.github.io/I-Scene-project/

TLDR: The paper presents I-Scene, a method that reprograms a pre-trained 3D instance generator for generalizable 3D scene generation using model-centric spatial supervision instead of dataset-bounded supervision, achieving better generalization to new layouts and object compositions.

TLDR: 本文提出了I-Scene，一种通过重编程预训练的3D实例生成器来实现可泛化的3D场景生成的方法。该方法利用模型中心的空间监督代替数据集限定的监督，从而更好地泛化到新的布局和对象组合。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Lu Ling, Yunhao Ge, Yichen Sheng, Aniket Bera

JoVA: Unified Multimodal Learning for Joint Video-Audio Generation

In this paper, we present JoVA, a unified framework for joint video-audio generation. Despite recent encouraging advances, existing methods face two critical limitations. First, most existing approaches can only generate ambient sounds and lack the capability to produce human speech synchronized with lip movements. Second, recent attempts at unified human video-audio generation typically rely on explicit fusion or modality-specific alignment modules, which introduce additional architecture design and weaken the model simplicity of the original transformers. To address these issues, JoVA employs joint self-attention across video and audio tokens within each transformer layer, enabling direct and efficient cross-modal interaction without the need for additional alignment modules. Furthermore, to enable high-quality lip-speech synchronization, we introduce a simple yet effective mouth-area loss based on facial keypoint detection, which enhances supervision on the critical mouth region during training without compromising architectural simplicity. Extensive experiments on benchmarks demonstrate that JoVA outperforms or is competitive with both unified and audio-driven state-of-the-art methods in lip-sync accuracy, speech quality, and overall video-audio generation fidelity. Our results establish JoVA as an elegant framework for high-quality multimodal generation.

TLDR: The paper introduces JoVA, a unified framework for joint video-audio generation that addresses limitations in lip-sync and model complexity in existing methods by using joint self-attention and a mouth-area loss.

TLDR: 该论文介绍了JoVA，一个统一的视频-音频联合生成框架，通过使用联合自注意力机制和口部区域损失，解决了现有方法在唇形同步和模型复杂度上的局限性。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Xiaohu Huang, Hao Zhou, Qiangpeng Yang, Shilei Wen, Kai Han

Directional Textual Inversion for Personalized Text-to-Image Generation

Textual Inversion (TI) is an efficient approach to text-to-image personalization but often fails on complex prompts. We trace these failures to embedding norm inflation: learned tokens drift to out-of-distribution magnitudes, degrading prompt conditioning in pre-norm Transformers. Empirically, we show semantics are primarily encoded by direction in CLIP token space, while inflated norms harm contextualization; theoretically, we analyze how large magnitudes attenuate positional information and hinder residual updates in pre-norm blocks. We propose Directional Textual Inversion (DTI), which fixes the embedding magnitude to an in-distribution scale and optimizes only direction on the unit hypersphere via Riemannian SGD. We cast direction learning as MAP with a von Mises-Fisher prior, yielding a constant-direction prior gradient that is simple and efficient to incorporate. Across personalization tasks, DTI improves text fidelity over TI and TI-variants while maintaining subject similarity. Crucially, DTI's hyperspherical parameterization enables smooth, semantically coherent interpolation between learned concepts (slerp), a capability that is absent in standard TI. Our findings suggest that direction-only optimization is a robust and scalable path for prompt-faithful personalization.

TLDR: The paper introduces Directional Textual Inversion (DTI), a method that improves Textual Inversion by optimizing only the direction of the learned embedding on a unit hypersphere, addressing norm inflation issues and enabling smooth interpolation between concepts.

TLDR: 该论文介绍了定向文本反演（DTI），一种通过仅优化单位超球面上学习嵌入的方向来改进文本反演的方法，解决了范数膨胀问题，并实现了概念之间的平滑插值。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Kunhee Kim, NaHyeon Park, Kibeom Hong, Hyunjung Shim

Charge: A Comprehensive Novel View Synthesis Benchmark and Dataset to Bind Them All

This paper presents a new dataset for Novel View Synthesis, generated from a high-quality, animated film with stunning realism and intricate detail. Our dataset captures a variety of dynamic scenes, complete with detailed textures, lighting, and motion, making it ideal for training and evaluating cutting-edge 4D scene reconstruction and novel view generation models. In addition to high-fidelity RGB images, we provide multiple complementary modalities, including depth, surface normals, object segmentation and optical flow, enabling a deeper understanding of scene geometry and motion. The dataset is organised into three distinct benchmarking scenarios: a dense multi-view camera setup, a sparse camera arrangement, and monocular video sequences, enabling a wide range of experimentation and comparison across varying levels of data sparsity. With its combination of visual richness, high-quality annotations, and diverse experimental setups, this dataset offers a unique resource for pushing the boundaries of view synthesis and 3D vision.

TLDR: The paper introduces CHARGE, a new high-quality, multimodal dataset for novel view synthesis, encompassing dynamic scenes with various camera setups to facilitate 4D scene reconstruction and view generation research.

TLDR: 该论文介绍了一个名为CHARGE的新型高质量多模态数据集，用于新视角合成。该数据集包含动态场景和各种摄像机设置，旨在促进4D场景重建和视角生成的研究。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Michal Nazarczuk, Thomas Tanay, Arthur Moreau, Zhensong Zhang, Eduardo Pérez-Pellitero

Broadening View Synthesis of Dynamic Scenes from Constrained Monocular Videos

In dynamic Neural Radiance Fields (NeRF) systems, state-of-the-art novel view synthesis methods often fail under significant viewpoint deviations, producing unstable and unrealistic renderings. To address this, we introduce Expanded Dynamic NeRF (ExpanDyNeRF), a monocular NeRF framework that leverages Gaussian splatting priors and a pseudo-ground-truth generation strategy to enable realistic synthesis under large-angle rotations. ExpanDyNeRF optimizes density and color features to improve scene reconstruction from challenging perspectives. We also present the Synthetic Dynamic Multiview (SynDM) dataset, the first synthetic multiview dataset for dynamic scenes with explicit side-view supervision-created using a custom GTA V-based rendering pipeline. Quantitative and qualitative results on SynDM and real-world datasets demonstrate that ExpanDyNeRF significantly outperforms existing dynamic NeRF methods in rendering fidelity under extreme viewpoint shifts. Further details are provided in the supplementary materials.

TLDR: The paper introduces ExpanDyNeRF, a monocular dynamic NeRF framework using Gaussian splatting priors and a pseudo-ground-truth generation for improved novel view synthesis under large viewpoint deviations, along with a new synthetic dynamic multiview dataset.

TLDR: 该论文介绍了ExpanDyNeRF，一种单目动态NeRF框架，它利用高斯溅射先验和伪真值生成，以改进在大的视点偏差下的新视角合成，并提出了一个新的合成动态多视角数据集。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Le Jiang, Shaotong Zhu, Yedi Luo, Shayda Moezzi, Sarah Ostadabbas

SketchAssist: A Practical Assistant for Semantic Edits and Precise Local Redrawing

Sketch editing is central to digital illustration, yet existing image editing systems struggle to preserve the sparse, style-sensitive structure of line art while supporting both high-level semantic changes and precise local redrawing. We present SketchAssist, an interactive sketch drawing assistant that accelerates creation by unifying instruction-guided global edits with line-guided region redrawing, while keeping unrelated regions and overall composition intact. To enable this assistant at scale, we introduce a controllable data generation pipeline that (i) constructs attribute-addition sequences from attribute-free base sketches, (ii) forms multi-step edit chains via cross-sequence sampling, and (iii) expands stylistic coverage with a style-preserving attribute-removal model applied to diverse sketches. Building on this data, SketchAssist employs a unified sketch editing framework with minimal changes to DiT-based editors. We repurpose the RGB channels to encode the inputs, enabling seamless switching between instruction-guided edits and line-guided redrawing within a single input interface. To further specialize behavior across modes, we integrate a task-guided mixture-of-experts into LoRA layers, routing by text and visual cues to improve semantic controllability, structural fidelity, and style preservation. Extensive experiments show state-of-the-art results on both tasks, with superior instruction adherence and style/structure preservation compared to recent baselines. Together, our dataset and SketchAssist provide a practical, controllable assistant for sketch creation and revision.

TLDR: SketchAssist is a novel sketch editing system that unifies instruction-guided editing with line-guided redrawing, enabled by a controllable data generation pipeline and a task-guided mixture-of-experts to improve controllability and style preservation.

TLDR: SketchAssist是一个新的草图编辑系统，它统一了指令引导的编辑和线条引导的重绘，通过可控的数据生成流水线和任务引导的混合专家模型，提高了可控性和风格保留。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Han Zou, Yan Zhang, Ruiqi Yu, Cong Xie, Jie Huang, Zhenpeng Zhan

Coarse-to-Fine Hierarchical Alignment for UAV-based Human Detection using Diffusion Models

Training object detectors demands extensive, task-specific annotations, yet this requirement becomes impractical in UAV-based human detection due to constantly shifting target distributions and the scarcity of labeled images. As a remedy, synthetic simulators are adopted to generate annotated data, with a low annotation cost. However, the domain gap between synthetic and real images hinders the model from being effectively applied to the target domain. Accordingly, we introduce Coarse-to-Fine Hierarchical Alignment (CFHA), a three-stage diffusion-based framework designed to transform synthetic data for UAV-based human detection, narrowing the domain gap while preserving the original synthetic labels. CFHA explicitly decouples global style and local content domain discrepancies and bridges those gaps using three modules: (1) Global Style Transfer -- a diffusion model aligns color, illumination, and texture statistics of synthetic images to the realistic style, using only a small real reference set; (2) Local Refinement -- a super-resolution diffusion model is used to facilitate fine-grained and photorealistic details for the small objects, such as human instances, preserving shape and boundary integrity; (3) Hallucination Removal -- a module that filters out human instances whose visual attributes do not align with real-world data to make the human appearance closer to the target distribution. Extensive experiments on public UAV Sim2Real detection benchmarks demonstrate that our methods significantly improve the detection accuracy compared to the non-transformed baselines. Specifically, our method achieves up to $+14.1$ improvement of mAP50 on Semantic-Drone benchmark. Ablation studies confirm the complementary roles of the global and local stages and highlight the importance of hierarchical alignment. The code is released at \href{https://github.com/liwd190019/CFHA}{this url}.

TLDR: This paper introduces a Coarse-to-Fine Hierarchical Alignment (CFHA) framework using diffusion models to adapt synthetic data for UAV-based human detection, improving detection accuracy by addressing domain gaps.

TLDR: 本文介绍了一种基于扩散模型的粗到细分层对齐（CFHA）框架，用于调整合成数据以进行基于无人机的行人检测，通过解决领域差距来提高检测精度。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Wenda Li, Meng Wu, Sungmin Eum, Heesung Kwon, Qing Qu

MoLingo: Motion-Language Alignment for Text-to-Motion Generation

We introduce MoLingo, a text-to-motion (T2M) model that generates realistic, lifelike human motion by denoising in a continuous latent space. Recent works perform latent space diffusion, either on the whole latent at once or auto-regressively over multiple latents. In this paper, we study how to make diffusion on continuous motion latents work best. We focus on two questions: (1) how to build a semantically aligned latent space so diffusion becomes more effective, and (2) how to best inject text conditioning so the motion follows the description closely. We propose a semantic-aligned motion encoder trained with frame-level text labels so that latents with similar text meaning stay close, which makes the latent space more diffusion-friendly. We also compare single-token conditioning with a multi-token cross-attention scheme and find that cross-attention gives better motion realism and text-motion alignment. With semantically aligned latents, auto-regressive generation, and cross-attention text conditioning, our model sets a new state of the art in human motion generation on standard metrics and in a user study. We will release our code and models for further research and downstream usage.

TLDR: MoLingo is a text-to-motion model that uses denoising diffusion in a semantically aligned latent space with cross-attention text conditioning to achieve state-of-the-art results in human motion generation.

TLDR: MoLingo是一个文本到动作的模型，它在语义对齐的潜在空间中使用去噪扩散和交叉注意力文本条件，从而在人体动作生成方面实现了最先进的成果。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Yannan He, Garvita Tiwari, Xiaohan Zhang, Pankaj Bora, Tolga Birdal, Jan Eric Lenssen, Gerard Pons-Moll

Recurrent Video Masked Autoencoders

We present Recurrent Video Masked-Autoencoders (RVM): a novel video representation learning approach that uses a transformer-based recurrent neural network to aggregate dense image features over time, effectively capturing the spatio-temporal structure of natural video data. RVM learns via an asymmetric masked prediction task requiring only a standard pixel reconstruction objective. This design yields a highly efficient ``generalist'' encoder: RVM achieves competitive performance with state-of-the-art video models (e.g. VideoMAE, V-JEPA) on video-level tasks like action recognition and point/object tracking, while also performing favorably against image models (e.g. DINOv2) on tasks that test geometric and dense spatial understanding. Notably, RVM achieves strong performance in the small-model regime without requiring knowledge distillation, exhibiting up to 30x greater parameter efficiency than competing video masked autoencoders. Moreover, we demonstrate that RVM's recurrent nature allows for stable feature propagation over long temporal horizons with linear computational cost, overcoming some of the limitations of standard spatio-temporal attention-based architectures. Finally, we use qualitative visualizations to highlight that RVM learns rich representations of scene semantics, structure, and motion.

TLDR: The paper introduces Recurrent Video Masked Autoencoders (RVM), a novel transformer-based recurrent model for video representation learning that achieves strong performance on various video and image tasks with high parameter efficiency and stable feature propagation over long temporal sequences.

TLDR: 该论文介绍了循环视频掩蔽自编码器 (RVM)，一种新的基于 Transformer 的循环模型，用于视频表征学习，在各种视频和图像任务上都表现出色，具有高参数效率和在长时间序列上稳定的特征传播。

Relevance: (6/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Daniel Zoran, Nikhil Parthasarathy, Yi Yang, Drew A Hudson, Joao Carreira, Andrew Zisserman

4D-RaDiff: Latent Diffusion for 4D Radar Point Cloud Generation

Automotive radar has shown promising developments in environment perception due to its cost-effectiveness and robustness in adverse weather conditions. However, the limited availability of annotated radar data poses a significant challenge for advancing radar-based perception systems. To address this limitation, we propose a novel framework to generate 4D radar point clouds for training and evaluating object detectors. Unlike image-based diffusion, our method is designed to consider the sparsity and unique characteristics of radar point clouds by applying diffusion to a latent point cloud representation. Within this latent space, generation is controlled via conditioning at either the object or scene level. The proposed 4D-RaDiff converts unlabeled bounding boxes into high-quality radar annotations and transforms existing LiDAR point cloud data into realistic radar scenes. Experiments demonstrate that incorporating synthetic radar data of 4D-RaDiff as data augmentation method during training consistently improves object detection performance compared to training on real data only. In addition, pre-training on our synthetic data reduces the amount of required annotated radar data by up to 90% while achieving comparable object detection performance.

TLDR: The paper introduces 4D-RaDiff, a latent diffusion framework for generating synthetic 4D radar point clouds, addressing the limitations of sparse radar data and improving object detection performance through data augmentation and pre-training.

TLDR: 该论文介绍了4D-RaDiff，一种用于生成合成4D雷达点云的潜在扩散框架，旨在解决稀疏雷达数据的问题，并通过数据增强和预训练提高目标检测性能。

Relevance: (5/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (6/10)

Read Paper (PDF)

Authors: Jimmie Kwok, Holger Caesar, Andras Palffy

Establishing Stochastic Object Models from Noisy Data via Ambient Measurement-Integrated Diffusion

Task-based measures of image quality (IQ) are critical for evaluating medical imaging systems, which must account for randomness including anatomical variability. Stochastic object models (SOMs) provide a statistical description of such variability, but conventional mathematical SOMs fail to capture realistic anatomy, while data-driven approaches typically require clean data rarely available in clinical tasks. To address this challenge, we propose AMID, an unsupervised Ambient Measurement-Integrated Diffusion with noise decoupling, which establishes clean SOMs directly from noisy measurements. AMID introduces a measurement-integrated strategy aligning measurement noise with the diffusion trajectory, and explicitly models coupling between measurement and diffusion noise across steps, an ambient loss is thus designed base on it to learn clean SOMs. Experiments on real CT and mammography datasets show that AMID outperforms existing methods in generation fidelity and yields more reliable task-based IQ evaluation, demonstrating its potential for unsupervised medical imaging analysis.

TLDR: The paper introduces AMID, an unsupervised method for creating stochastic object models (SOMs) from noisy medical imaging data by integrating measurement noise with a diffusion process, enabling more reliable task-based image quality evaluation.

TLDR: 本文介绍了一种名为AMID的无监督方法，该方法通过将测量噪声与扩散过程相结合，从嘈杂的医学图像数据中创建随机对象模型（SOM），从而实现更可靠的基于任务的图像质量评估。

Relevance: (4/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (7/10)

Overall: (6/10)

Read Paper (PDF)

Authors: Jianwei Sun, Xiaoning Lei, Wenhao Cai, Xichen Xu, Yanshu Wang, Hu Gao

AIGC Daily Papers

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

OUSAC: Optimized Guidance Scheduling with Adaptive Caching for DiT Acceleration

Towards Scalable Pre-training of Visual Tokenizers for Generation

The Devil is in Attention Sharing: Improving Complex Non-rigid Image Editing Faithfulness via Attention Synergy

SS4D: Native 4D Generative Model via Structured Spacetime Latents

OmniGen: Unified Multimodal Sensor Generation for Autonomous Driving

ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Diffusion Models

AnchorHOI: Zero-shot Generation of 4D Human-Object Interaction via Anchor-based Prior Distillation

FacEDiT: Unified Talking Face Editing and Generation via Facial Motion Infilling

Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models

DiffusionBrowser: Interactive Diffusion Previews via Multi-Branch Decoders

I-Scene: 3D Instance Models are Implicit Generalizable Spatial Learners

JoVA: Unified Multimodal Learning for Joint Video-Audio Generation

Directional Textual Inversion for Personalized Text-to-Image Generation

Charge: A Comprehensive Novel View Synthesis Benchmark and Dataset to Bind Them All

Broadening View Synthesis of Dynamic Scenes from Constrained Monocular Videos

SketchAssist: A Practical Assistant for Semantic Edits and Precise Local Redrawing

Coarse-to-Fine Hierarchical Alignment for UAV-based Human Detection using Diffusion Models

MoLingo: Motion-Language Alignment for Text-to-Motion Generation

Recurrent Video Masked Autoencoders

4D-RaDiff: Latent Diffusion for 4D Radar Point Cloud Generation

Establishing Stochastic Object Models from Noisy Data via Ambient Measurement-Integrated Diffusion