ArXiv CS.CV Papers (Image/Video Generation)

Seeing through Imagination: Learning Scene Geometry via Implicit Spatial World Modeling

Spatial reasoning, the ability to understand and interpret the 3D structure of the world, is a critical yet underdeveloped capability in Multimodal Large Language Models (MLLMs). Current methods predominantly rely on verbal descriptive tuning, which suffers from visual illiteracy, i.e., they learn spatial concepts through textual symbols alone, devoid of connection to their visual manifestations. To bridge this gap, this paper introduces MILO, an Implicit spatIaL wOrld modeling paradigm that simulates human-like spatial imagination. MILO integrates a visual generator to provide geometry-aware feedback, thereby implicitly grounding the MLLM's symbolic reasoning in perceptual experience. Complementing this paradigm, we propose RePE (Relative Positional Encoding), a novel encoding scheme that captures relative camera-pose transformations, offering superior performance over absolute coordinate systems. To support the training, we construct GeoGen, a large-scale Geometry-aware Generative dataset with approximately 2,241 videos and 67,827 observation-action-outcome triplets. Experiments demonstrate that our approach significantly enhances spatial reasoning capabilities across multiple baselines and benchmarks, offering a more holistic understanding of 3D space.

TLDR: This paper introduces MILO, a method for improving spatial reasoning in MLLMs by grounding symbolic reasoning in perceptual experience using a visual generator and a novel relative positional encoding scheme. They also introduce a large-scale Geometry-aware Generative dataset called GeoGen.

TLDR: 该论文提出了MILO，一种通过使用视觉生成器和新的相对位置编码方案，将符号推理建立在感知经验基础上的方法，从而提高MLLM中的空间推理能力。他们还介绍了一个名为 GeoGen 的大规模几何感知生成数据集。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Meng Cao, Haokun Lin, Haoyuan Li, Haoran Tang, Rongtao Xu, Dong An, Xue Liu, Ian Reid, Xiaodan Liang

Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights

Current multimodal models aim to transcend the limitations of single-modality representations by unifying understanding and generation, often using text-to-image (T2I) tasks to calibrate semantic consistency. However, their reliance on static, single-image generation in training and evaluation leads to overfitting to static pattern matching and semantic fusion, while fundamentally hindering their ability to model dynamic processes that unfold over time. To address these constraints, we propose Envision-a causal event progression benchmark for chained text-to-multi-image generation. Grounded in world knowledge and structured by spatiotemporal causality, it reorganizes existing evaluation dimensions and includes 1,000 four-stage prompts spanning six scientific and humanities domains. To transition evaluation from single images to sequential frames and assess whether models truly internalize world knowledge while adhering to causal-temporal constraints, we introduce Envision-Score, a holistic metric integrating multi-dimensional consistency, physicality, and aesthetics. Comprehensive evaluation of 15 models (10 specialized T2I models, 5 unified models) uncovers: specialized T2I models demonstrate proficiency in aesthetic rendering yet lack intrinsic world knowledge. Unified multimodal models bridge this gap, consistently outperforming specialized counterparts in causal narrative coherence. However, even these unified architectures remain subordinate to closed-source models and struggle to overcome the core challenge of spatiotemporal consistency. This demonstrates that a focus on causally-isolated single images impedes multi-frame reasoning and generation, promoting static pattern matching over dynamic world modeling-ultimately limiting world knowledge internalization, generation.

TLDR: The paper introduces Envision, a new benchmark for evaluating the causal reasoning abilities of multimodal models in chained text-to-multi-image generation, highlighting current models' limitations in capturing dynamic processes and spatiotemporal consistency.

TLDR: 该论文介绍了Envision，一个新的基准，用于评估多模态模型在链式文本到多图像生成中的因果推理能力，突出了当前模型在捕获动态过程和时空一致性方面的局限性。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Juanxi Tian, Siyuan Li, Conghui He, Lijun Wu, Cheng Tan

FreqEdit: Preserving High-Frequency Features for Robust Multi-Turn Image Editing

Instruction-based image editing through natural language has emerged as a powerful paradigm for intuitive visual manipulation. While recent models achieve impressive results on single edits, they suffer from severe quality degradation under multi-turn editing. Through systematic analysis, we identify progressive loss of high-frequency information as the primary cause of this quality degradation. We present FreqEdit, a training-free framework that enables stable editing across 10+ consecutive iterations. Our approach comprises three synergistic components: (1) high-frequency feature injection from reference velocity fields to preserve fine-grained details, (2) an adaptive injection strategy that spatially modulates injection strength for precise region-specific control, and (3) a path compensation mechanism that periodically recalibrates the editing trajectory to prevent over-constraint. Extensive experiments demonstrate that FreqEdit achieves superior performance in both identity preservation and instruction following compared to seven state-of-the-art baselines.

TLDR: FreqEdit is a training-free framework that preserves high-frequency image details during multi-turn instruction-based image editing, mitigating quality degradation through reference velocity fields, adaptive injection, and path compensation.

TLDR: FreqEdit 是一个无需训练的框架，通过参考速度场、自适应注入和路径补偿来保留多轮指令图像编辑中的高频图像细节，从而减轻质量下降。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yucheng Liao, Jiajun Liang, Kaiqian Cui, Baoquan Zhao, Haoran Xie, Wei Liu, Qing Li, Xudong Mao

Diffusion Fuzzy System: Fuzzy Rule Guided Latent Multi-Path Diffusion Modeling

Diffusion models have emerged as a leading technique for generating images due to their ability to create high-resolution and realistic images. Despite their strong performance, diffusion models still struggle in managing image collections with significant feature differences. They often fail to capture complex features and produce conflicting results. Research has attempted to address this issue by learning different regions of an image through multiple diffusion paths and then combining them. However, this approach leads to inefficient coordination among multiple paths and high computational costs. To tackle these issues, this paper presents a Diffusion Fuzzy System (DFS), a latent-space multi-path diffusion model guided by fuzzy rules. DFS offers several advantages. First, unlike traditional multi-path diffusion methods, DFS uses multiple diffusion paths, each dedicated to learning a specific class of image features. By assigning each path to a different feature type, DFS overcomes the limitations of multi-path models in capturing heterogeneous image features. Second, DFS employs rule-chain-based reasoning to dynamically steer the diffusion process and enable efficient coordination among multiple paths. Finally, DFS introduces a fuzzy membership-based latent-space compression mechanism to reduce the computational costs of multi-path diffusion effectively. We tested our method on three public datasets: LSUN Bedroom, LSUN Church, and MS COCO. The results show that DFS achieves more stable training and faster convergence than existing single-path and multi-path diffusion models. Additionally, DFS surpasses baseline models in both image quality and alignment between text and images, and also shows improved accuracy when comparing generated images to target references.

TLDR: The paper introduces a Diffusion Fuzzy System (DFS), a fuzzy rule-guided latent multi-path diffusion model for image generation that improves feature capture, coordination, and computational efficiency compared to existing diffusion models.

TLDR: 该论文提出了一种扩散模糊系统（DFS），这是一种模糊规则引导的潜在多路径扩散模型，用于图像生成，与现有扩散模型相比，该模型在特征捕获、协调和计算效率方面有所提高。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Hailong Yang, Te Zhang, Kup-sze Choi, Zhaohong Deng

ChronosObserver: Taming 4D World with Hyperspace Diffusion Sampling

Although prevailing camera-controlled video generation models can produce cinematic results, lifting them directly to the generation of 3D-consistent and high-fidelity time-synchronized multi-view videos remains challenging, which is a pivotal capability for taming 4D worlds. Some works resort to data augmentation or test-time optimization, but these strategies are constrained by limited model generalization and scalability issues. To this end, we propose ChronosObserver, a training-free method including World State Hyperspace to represent the spatiotemporal constraints of a 4D world scene, and Hyperspace Guided Sampling to synchronize the diffusion sampling trajectories of multiple views using the hyperspace. Experimental results demonstrate that our method achieves high-fidelity and 3D-consistent time-synchronized multi-view videos generation without training or fine-tuning for diffusion models.

TLDR: ChronosObserver introduces a training-free method for generating 3D-consistent, time-synchronized multi-view videos using hyperspace diffusion sampling, addressing limitations in existing camera-controlled video generation models.

TLDR: ChronosObserver 提出了一种无需训练的方法，通过使用超空间扩散采样来生成 3D 一致、时间同步的多视角视频，从而解决了现有摄像机控制视频生成模型的局限性。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Qisen Wang, Yifan Zhao, Peisen Shen, Jialu Li, Jia Li

InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

Large-scale video-text pretraining achieves strong performance but depends on noisy, synthetic captions with limited semantic coverage, often overlooking implicit world knowledge such as object motion, 3D geometry, and physical cues. In contrast, masked video modeling (MVM) directly exploits spatiotemporal structures but trails text-supervised methods on general tasks. We find this gap arises from overlooked architectural issues: pixel-level reconstruction struggles with convergence and its low-level requirement often conflicts with semantics, while latent prediction often encourages shortcut learning. To address these, we disentangle the traditional encoder-decoder design into an Encoder-Predictor-Decoder (EPD) framework, where the predictor acts as a latent world model, and propose InternVideo-Next, a two-stage pretraining scheme that builds a semantically consistent yet detail-preserving latent space for this world model. First, conventional linear decoder in pixel MVM enforces the predictor output latent to be linearly projected to, thus separable in pixel space, causing the conflict with semantic abstraction. Our Stage 1 proposes a conditional diffusion decoder and injects reliable image-level semantic priors to enhance semantics and convergence, thus bridging pixel-level fidelity with high-level semantic abstraction. Stage 2 further learns world knowledge by predicting frozen Stage 1 targets within this space, mitigating shortcut learning. Trained on public, unlabeled videos, InternVideo-Next achieves state-of-the-art results across benchmarks and provides a scalable path toward general video representation learning.

TLDR: InternVideo-Next introduces a two-stage masked video modeling pretraining approach using an Encoder-Predictor-Decoder framework with a conditional diffusion decoder to learn general video representations without video-text supervision, achieving state-of-the-art results.

TLDR: InternVideo-Next 提出了一种两阶段的掩码视频建模预训练方法，使用具有条件扩散解码器的编码器-预测器-解码器框架，无需视频-文本监督即可学习通用视频表示，并取得了最先进的效果。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (7/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Chenting Wang, Yuhan Zhu, Yicheng Xu, Jiange Yang, Ziang Yan, Yali Wang, Yi Wang, Limin Wang

DCText: Scheduled Attention Masking for Visual Text Generation via Divide-and-Conquer Strategy

Despite recent text-to-image models achieving highfidelity text rendering, they still struggle with long or multiple texts due to diluted global attention. We propose DCText, a training-free visual text generation method that adopts a divide-and-conquer strategy, leveraging the reliable short-text generation of Multi-Modal Diffusion Transformers. Our method first decomposes a prompt by extracting and dividing the target text, then assigns each to a designated region. To accurately render each segment within their regions while preserving overall image coherence, we introduce two attention masks - Text-Focus and Context-Expansion - applied sequentially during denoising. Additionally, Localized Noise Initialization further improves text accuracy and region alignment without increasing computational cost. Extensive experiments on single- and multisentence benchmarks show that DCText achieves the best text accuracy without compromising image quality while also delivering the lowest generation latency.

TLDR: DCText is a training-free method improving text rendering in text-to-image models, particularly for long and multiple texts, using a divide-and-conquer strategy with attention masking.

TLDR: DCText 是一种免训练方法，通过分而治之的策略和注意力掩码，改进文本到图像模型中的文本渲染，特别是对于长文本和多文本。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Jaewoo Song, Jooyoung Choi, Kanghyun Baek, Sangyub Lee, Daemin Park, Sungroh Yoon

Efficient Training of Diffusion Mixture-of-Experts Models: A Practical Recipe

Recent efforts on Diffusion Mixture-of-Experts (MoE) models have primarily focused on developing more sophisticated routing mechanisms. However, we observe that the underlying architectural configuration space remains markedly under-explored. Inspired by the MoE design paradigms established in large language models (LLMs), we identify a set of crucial architectural factors for building effective Diffusion MoE models--including DeepSeek-style expert modules, alternative intermediate widths, varying expert counts, and enhanced attention positional encodings. Our systematic study reveals that carefully tuning these configurations is essential for unlocking the full potential of Diffusion MoE models, often yielding gains that exceed those achieved by routing innovations alone. Through extensive experiments, we present novel architectures that can be efficiently applied to both latent and pixel-space diffusion frameworks, which provide a practical and efficient training recipe that enables Diffusion MoE models to surpass strong baselines while using equal or fewer activated parameters. All code and models are publicly available at: https://github.com/yhlleo/EfficientMoE.

TLDR: This paper introduces an efficient training recipe for Diffusion Mixture-of-Experts (MoE) models by focusing on architectural configurations rather than solely on routing mechanisms, achieving superior performance with comparable or fewer activated parameters.

TLDR: 本文介绍了一种高效训练扩散混合专家模型 (Diffusion MoE) 的方法，通过关注架构配置而非仅仅是路由机制，从而以相当或更少的激活参数实现了卓越的性能。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yahui Liu, Yang Yue, Jingyuan Zhang, Chenxi Sun, Yang Zhou, Wencong Zeng, Ruiming Tang, Guorui Zhou

PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards

Personalized generation models for a single subject have demonstrated remarkable effectiveness, highlighting their significant potential. However, when extended to multiple subjects, existing models often exhibit degraded performance, particularly in maintaining subject consistency and adhering to textual prompts. We attribute these limitations to the absence of high-quality multi-subject datasets and refined post-training strategies. To address these challenges, we propose a scalable multi-subject data generation pipeline that leverages powerful single-subject generation models to construct diverse and high-quality multi-subject training data. Through this dataset, we first enable single-subject personalization models to acquire knowledge of synthesizing multi-image and multi-subject scenarios. Furthermore, to enhance both subject consistency and text controllability, we design a set of Pairwise Subject-Consistency Rewards and general-purpose rewards, which are incorporated into a refined reinforcement learning stage. To comprehensively evaluate multi-subject personalization, we introduce a new benchmark that assesses model performance using seven subsets across three dimensions. Extensive experiments demonstrate the effectiveness of our approach in advancing multi-subject personalized image generation. Github Link: https://github.com/wang-shulei/PSR

TLDR: The paper introduces a pipeline for scaling personalized image generation to multiple subjects by creating a multi-subject dataset using single-subject models and refining the results with pairwise subject-consistency rewards via reinforcement learning, demonstrating improved performance on a new multi-subject benchmark.

TLDR: 该论文介绍了一种将个性化图像生成扩展到多个主体的流程，通过使用单主体模型创建多主体数据集，并通过强化学习中的成对主体一致性奖励来改进结果，从而在新的多主体基准测试中展示了改进的性能。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Shulei Wang, Longhui Wei, Xin He, Jianbo Ouyang, Hui Lu, Zhou Zhao, Qi Tian

Accelerating Inference of Masked Image Generators via Reinforcement Learning

Masked Generative Models (MGM)s demonstrate strong capabilities in generating high-fidelity images. However, they need many sampling steps to create high-quality generations, resulting in slow inference speed. In this work, we propose Speed-RL, a novel paradigm for accelerating a pretrained MGMs to generate high-quality images in fewer steps. Unlike conventional distillation methods which formulate the acceleration problem as a distribution matching problem, where a few-step student model is trained to match the distribution generated by a many-step teacher model, we consider this problem as a reinforcement learning problem. Since the goal of acceleration is to generate high quality images in fewer steps, we can combine a quality reward with a speed reward and finetune the base model using reinforcement learning with the combined reward as the optimization target. Through extensive experiments, we show that the proposed method was able to accelerate the base model by a factor of 3x while maintaining comparable image quality.

TLDR: This paper introduces Speed-RL, a reinforcement learning approach to accelerate masked generative models (MGMs) for image generation, achieving a 3x speedup with comparable image quality.

TLDR: 该论文介绍了Speed-RL，一种使用强化学习加速掩码生成模型（MGM）进行图像生成的方法，实现了3倍的加速，同时保持了相当的图像质量。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Pranav Subbaraman, Shufan Li, Siyan Zhao, Aditya Grover

StyleYourSmile: Cross-Domain Face Retargeting Without Paired Multi-Style Data

Cross-domain face retargeting requires disentangled control over identity, expressions, and domain-specific stylistic attributes. Existing methods, typically trained on real-world faces, either fail to generalize across domains, need test-time optimizations, or require fine-tuning with carefully curated multi-style datasets to achieve domain-invariant identity representations. In this work, we introduce \textit{StyleYourSmile}, a novel one-shot cross-domain face retargeting method that eliminates the need for curated multi-style paired data. We propose an efficient data augmentation strategy alongside a dual-encoder framework, for extracting domain-invariant identity cues and capturing domain-specific stylistic variations. Leveraging these disentangled control signals, we condition a diffusion model to retarget facial expressions across domains. Extensive experiments demonstrate that \textit{StyleYourSmile} achieves superior identity preservation and retargeting fidelity across a wide range of visual domains.

TLDR: The paper introduces StyleYourSmile, a novel cross-domain face retargeting method using a dual-encoder and diffusion model that achieves improved identity preservation and retargeting fidelity without requiring paired multi-style data.

TLDR: 这篇论文介绍了一种名为 StyleYourSmile 的新型跨域人脸重定向方法，该方法使用双编码器和扩散模型，无需配对的多风格数据即可实现改进的身份保留和重定向保真度。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Avirup Dey, Vinay Namboodiri

GRASP: Guided Residual Adapters with Sample-wise Partitioning

Recent advances in text-to-image diffusion models enable high-fidelity generation across diverse prompts. However, these models falter in long-tail settings, such as medical imaging, where rare pathologies comprise a small fraction of data. This results in mode collapse: tail-class outputs lack quality and diversity, undermining the goal of synthetic data augmentation for underrepresented conditions. We pinpoint gradient conflicts between frequent head and rare tail classes as the primary culprit, a factor unaddressed by existing sampling or conditioning methods that mainly steer inference without altering the learned distribution. To resolve this, we propose GRASP: Guided Residual Adapters with Sample-wise Partitioning. GRASP uses external priors to statically partition samples into clusters that minimize intra-group gradient clashes. It then fine-tunes pre-trained models by injecting cluster-specific residual adapters into transformer feedforward layers, bypassing learned gating for stability and efficiency. On the long-tail MIMIC-CXR-LT dataset, GRASP yields superior FID and diversity metrics, especially for rare classes, outperforming baselines like vanilla fine-tuning and Mixture of Experts variants. Downstream classification on NIH-CXR-LT improves considerably for tail labels. Generalization to ImageNet-LT confirms broad applicability. Our method is lightweight, scalable, and readily integrates with diffusion pipelines.

TLDR: The paper introduces GRASP, a method using guided residual adapters and sample-wise partitioning to improve text-to-image diffusion models in long-tail settings, particularly in medical imaging, by mitigating gradient conflicts between head and tail classes.

TLDR: 该论文介绍了 GRASP，一种利用引导残差适配器和样本分区的方法，通过缓解头类和尾类之间的梯度冲突，改进长尾设置（尤其是在医学成像中）中的文本到图像扩散模型。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Felix Nützel, Mischa Dombrowski, Bernhard Kainz

Reversible Inversion for Training-Free Exemplar-guided Image Editing

Exemplar-guided Image Editing (EIE) aims to modify a source image according to a visual reference. Existing approaches often require large-scale pre-training to learn relationships between the source and reference images, incurring high computational costs. As a training-free alternative, inversion techniques can be used to map the source image into a latent space for manipulation. However, our empirical study reveals that standard inversion is sub-optimal for EIE, leading to poor quality and inefficiency. To tackle this challenge, we introduce \textbf{Reversible Inversion ({ReInversion})} for effective and efficient EIE. Specifically, ReInversion operates as a two-stage denoising process, which is first conditioned on the source image and subsequently on the reference. Besides, we introduce a Mask-Guided Selective Denoising (MSD) strategy to constrain edits to target regions, preserving the structural consistency of the background. Both qualitative and quantitative comparisons demonstrate that our ReInversion method achieves state-of-the-art EIE performance with the lowest computational overhead.

TLDR: The paper introduces Reversible Inversion (ReInversion), a training-free method for exemplar-guided image editing that improves upon standard inversion techniques with a two-stage denoising process and mask-guided selective denoising for enhanced quality, efficiency, and structural consistency.

TLDR: 本文介绍了一种名为可逆反演（ReInversion）的免训练示例引导图像编辑方法。该方法通过两阶段去噪过程和掩码引导的选择性去噪来改进标准反演技术，从而提高质量、效率和结构一致性。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Yuke Li, Lianli Gao, Ji Zhang, Pengpeng Zeng, Lichuan Xiang, Hongkai Wen, Heng Tao Shen, Jingkuan Song

Generative Adversarial Gumbel MCTS for Abstract Visual Composition Generation

We study abstract visual composition, in which identity is primarily determined by the spatial configuration and relations among a small set of geometric primitives (e.g., parts, symmetry, topology). They are invariant primarily to texture and photorealistic detail. Composing such structures from fixed components under geometric constraints and vague goal specification (such as text) is non-trivial due to combinatorial placement choices, limited data, and discrete feasibility (overlap-free, allowable orientations), which create a sparse solution manifold ill-suited to purely statistical pixel-space generators. We propose a constraint-guided framework that combines explicit geometric reasoning with neural semantics. An AlphaGo-style search enforces feasibility, while a fine-tuned vision-language model scores semantic alignment as reward signals. Our algorithm uses a policy network as a heuristic in Monte-Carlo Tree Search and fine-tunes the network via search-generated plans. Inspired by the Generative Adversarial Network, we use the generated instances for adversarial reward refinement. Over time, the generation should approach the actual data more closely when the reward model cannot distinguish between generated instances and ground-truth. In the Tangram Assembly task, our approach yields higher validity and semantic fidelity than diffusion and auto-regressive baselines, especially as constraints tighten.

TLDR: This paper introduces a constraint-guided framework, combining geometric reasoning and neural semantics with a Generative Adversarial Gumbel MCTS approach, for abstract visual composition generation, demonstrably superior to diffusion and auto-regressive models in Tangram assembly.

TLDR: 该论文介绍了一种约束引导框架，它结合了几何推理和神经语义以及生成对抗Gumbel MCTS方法，用于抽象视觉组合生成，在Tangram组装中明显优于扩散模型和自回归模型。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Zirui Zhao, Boye Niu, David Hsu, Wee Sun Lee

TabletopGen: Instance-Level Interactive 3D Tabletop Scene Generation from Text or Single Image

Generating high-fidelity, physically interactive 3D simulated tabletop scenes is essential for embodied AI--especially for robotic manipulation policy learning and data synthesis. However, current text- or image-driven 3D scene generation methods mainly focus on large-scale scenes, struggling to capture the high-density layouts and complex spatial relations that characterize tabletop scenes. To address these challenges, we propose TabletopGen, a training-free, fully automatic framework that generates diverse, instance-level interactive 3D tabletop scenes. TabletopGen accepts a reference image as input, which can be synthesized by a text-to-image model to enhance scene diversity. We then perform instance segmentation and completion on the reference to obtain per-instance images. Each instance is reconstructed into a 3D model followed by canonical coordinate alignment. The aligned 3D models then undergo pose and scale estimation before being assembled into a collision-free, simulation-ready tabletop scene. A key component of our framework is a novel pose and scale alignment approach that decouples the complex spatial reasoning into two stages: a Differentiable Rotation Optimizer for precise rotation recovery and a Top-view Spatial Alignment mechanism for robust translation and scale estimation, enabling accurate 3D reconstruction from 2D reference. Extensive experiments and user studies show that TabletopGen achieves state-of-the-art performance, markedly surpassing existing methods in visual fidelity, layout accuracy, and physical plausibility, capable of generating realistic tabletop scenes with rich stylistic and spatial diversity. Our code will be publicly available.

TLDR: TabletopGen is a training-free framework for generating interactive 3D tabletop scenes from text or images, addressing the limitations of existing methods in capturing the high-density layouts and spatial relationships characteristic of such scenes. It uses a novel two-stage pose and scale alignment approach for accurate 3D reconstruction.

TLDR: TabletopGen是一个无需训练的框架，用于从文本或图像生成交互式3D桌面场景。它解决了现有方法在捕捉高密度布局和空间关系上的局限性。该方法使用了一种新颖的两阶段姿态和尺度对齐方法，以实现精确的3D重建。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Ziqian Wang, Yonghao He, Licheng Yang, Wei Zou, Hongxuan Ma, Liu Liu, Wei Sui, Yuxin Guo, Hu Su

AIGC Daily Papers

Seeing through Imagination: Learning Scene Geometry via Implicit Spatial World Modeling

Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights

FreqEdit: Preserving High-Frequency Features for Robust Multi-Turn Image Editing

Diffusion Fuzzy System: Fuzzy Rule Guided Latent Multi-Path Diffusion Modeling

ChronosObserver: Taming 4D World with Hyperspace Diffusion Sampling

InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

DCText: Scheduled Attention Masking for Visual Text Generation via Divide-and-Conquer Strategy

Efficient Training of Diffusion Mixture-of-Experts Models: A Practical Recipe

PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards

Accelerating Inference of Masked Image Generators via Reinforcement Learning

StyleYourSmile: Cross-Domain Face Retargeting Without Paired Multi-Style Data

GRASP: Guided Residual Adapters with Sample-wise Partitioning

Reversible Inversion for Training-Free Exemplar-guided Image Editing

Generative Adversarial Gumbel MCTS for Abstract Visual Composition Generation

TabletopGen: Instance-Level Interactive 3D Tabletop Scene Generation from Text or Single Image