Daily papers related to Image/Video/Multimodal Generation from cs.CV
November 11, 2025
Text-to-image models have rapidly evolved from casual creative tools to professional-grade systems, achieving unprecedented levels of image quality and realism. Yet, most models are trained to map short prompts into detailed images, creating a gap between sparse textual input and rich visual outputs. This mismatch reduces controllability, as models often fill in missing details arbitrarily, biasing toward average user preferences and limiting precision for professional use. We address this limitation by training the first open-source text-to-image model on long structured captions, where every training sample is annotated with the same set of fine-grained attributes. This design maximizes expressive coverage and enables disentangled control over visual factors. To process long captions efficiently, we propose DimFusion, a fusion mechanism that integrates intermediate tokens from a lightweight LLM without increasing token length. We also introduce the Text-as-a-Bottleneck Reconstruction (TaBR) evaluation protocol. By assessing how well real images can be reconstructed through a captioning-generation loop, TaBR directly measures controllability and expressiveness, even for very long captions where existing evaluation methods fail. Finally, we demonstrate our contributions by training the large-scale model FIBO, achieving state-of-the-art prompt alignment among open-source models. Model weights are publicly available at https://huggingface.co/briaai/FIBO
TLDR: This paper introduces FIBO, an open-source text-to-image model trained on long, structured captions to improve controllability and prompt alignment, along with DimFusion for efficient long caption processing and TaBR for controllability evaluation.
TLDR: 本文介绍FIBO,一个基于长结构化字幕训练的开源文本到图像模型,旨在提高可控性和提示对齐。同时提出DimFusion用于高效处理长字幕,以及TaBR用于可控性评估。
Read Paper (PDF)Remarkable advances in recent 2D image and 3D shape generation have induced a significant focus on dynamic 4D content generation. However, previous 4D generation methods commonly struggle to maintain spatial-temporal consistency and adapt poorly to rapid temporal variations, due to the lack of effective spatial-temporal modeling. To address these problems, we propose a novel 4D generation network called 4DSTR, which modulates generative 4D Gaussian Splatting with spatial-temporal rectification. Specifically, temporal correlation across generated 4D sequences is designed to rectify deformable scales and rotations and guarantee temporal consistency. Furthermore, an adaptive spatial densification and pruning strategy is proposed to address significant temporal variations by dynamically adding or deleting Gaussian points with the awareness of their pre-frame movements. Extensive experiments demonstrate that our 4DSTR achieves state-of-the-art performance in video-to-4D generation, excelling in reconstruction quality, spatial-temporal consistency, and adaptation to rapid temporal movements.
TLDR: The paper introduces 4DSTR, a novel network for 4D content generation that uses spatial-temporal rectification to improve consistency and handle rapid temporal changes in video-to-4D generation tasks.
TLDR: 该论文介绍了一种名为 4DSTR 的新型 4D 内容生成网络,它使用时空校正来提高一致性并处理视频到 4D 生成任务中的快速时间变化。
Read Paper (PDF)This paper presents Omni-View, which extends the unified multimodal understanding and generation to 3D scenes based on multiview images, exploring the principle that "generation facilitates understanding". Consisting of understanding model, texture module, and geometry module, Omni-View jointly models scene understanding, novel view synthesis, and geometry estimation, enabling synergistic interaction between 3D scene understanding and generation tasks. By design, it leverages the spatiotemporal modeling capabilities of its texture module responsible for appearance synthesis, alongside the explicit geometric constraints provided by its dedicated geometry module, thereby enriching the model's holistic understanding of 3D scenes. Trained with a two-stage strategy, Omni-View achieves a state-of-the-art score of 55.4 on the VSI-Bench benchmark, outperforming existing specialized 3D understanding models, while simultaneously delivering strong performance in both novel view synthesis and 3D scene generation.
TLDR: The paper introduces Omni-View, a unified 3D model that leverages multiview images for scene understanding, novel view synthesis, and geometry estimation, achieving state-of-the-art results on VSI-Bench by jointly modeling understanding and generation.
TLDR: 该论文介绍了Omni-View,一个统一的3D模型,它利用多视角图像进行场景理解、新视角合成和几何估计,并通过联合建模理解和生成,在VSI-Bench上取得了最先进的结果。
Read Paper (PDF)Text-to-image diffusion models exhibit remarkable generative capabilities, but lack precise control over object counts and spatial arrangements. This work introduces a two-stage system to address these compositional limitations. The first stage employs a Large Language Model (LLM) to generate a structured layout from a list of objects. The second stage uses a layout-conditioned diffusion model to synthesize a photorealistic image adhering to this layout. We find that task decomposition is critical for LLM-based spatial planning; by simplifying the initial generation to core objects and completing the layout with rule-based insertion, we improve object recall from 57.2% to 99.9% for complex scenes. For image synthesis, we compare two leading conditioning methods: ControlNet and GLIGEN. After domain-specific finetuning on table-setting datasets, we identify a key trade-off: ControlNet preserves text-based stylistic control but suffers from object hallucination, while GLIGEN provides superior layout fidelity at the cost of reduced prompt-based controllability. Our end-to-end system successfully generates images with specified object counts and plausible spatial arrangements, demonstrating the viability of a decoupled approach for compositionally controlled synthesis.
TLDR: The paper presents a two-stage system using LLMs and diffusion models for layout-controlled image generation, achieving improved object recall and exploring the trade-offs between ControlNet and GLIGEN for layout conditioning.
TLDR: 该论文提出了一个两阶段系统,利用大型语言模型和扩散模型进行布局控制的图像生成,提高了物体召回率,并探讨了 ControlNet 和 GLIGEN 在布局条件方面的权衡。
Read Paper (PDF)Vector quantization (VQ) transforms continuous image features into discrete representations, providing compressed, tokenized inputs for generative models. However, VQ-based frameworks suffer from several issues, such as non-smooth latent spaces, weak alignment between representations before and after quantization, and poor coherence between the continuous and discrete domains. These issues lead to unstable codeword learning and underutilized codebooks, ultimately degrading the performance of both reconstruction and downstream generation tasks. To this end, we propose VAEVQ, which comprises three key components: (1) Variational Latent Quantization (VLQ), replacing the AE with a VAE for quantization to leverage its structured and smooth latent space, thereby facilitating more effective codeword activation; (2) Representation Coherence Strategy (RCS), adaptively modulating the alignment strength between pre- and post-quantization features to enhance consistency and prevent overfitting to noise; and (3) Distribution Consistency Regularization (DCR), aligning the entire codebook distribution with the continuous latent distribution to improve utilization. Extensive experiments on two benchmark datasets demonstrate that VAEVQ outperforms state-of-the-art methods.
TLDR: This paper introduces VAEVQ, a novel vector quantization method that enhances discrete visual tokenization using variational autoencoders and several regularization techniques to improve codebook utilization and performance in generative models.
TLDR: 本文介绍了一种新的向量量化方法VAEVP,它利用变分自编码器和多种正则化技术来增强离散视觉标记化,从而提高生成模型中的码本利用率和性能。
Read Paper (PDF)Millimeter-wave radar offers a promising sensing modality for autonomous systems thanks to its robustness in adverse conditions and low cost. However, its utility is significantly limited by the sparsity and low resolution of radar point clouds, which poses challenges for tasks requiring dense and accurate 3D perception. Despite that recent efforts have shown great potential by exploring generative approaches to address this issue, they often rely on dense voxel representations that are inefficient and struggle to preserve structural detail. To fill this gap, we make the key observation that latent diffusion models (LDMs), though successful in other modalities, have not been effectively leveraged for radar-based 3D generation due to a lack of compatible representations and conditioning strategies. We introduce RaLD, a framework that bridges this gap by integrating scene-level frustum-based LiDAR autoencoding, order-invariant latent representations, and direct radar spectrum conditioning. These insights lead to a more compact and expressive generation process. Experiments show that RaLD produces dense and accurate 3D point clouds from raw radar spectrums, offering a promising solution for robust perception in challenging environments.
TLDR: The paper introduces RaLD, a latent diffusion model framework for generating high-resolution 3D radar point clouds from raw radar data, addressing the limitations of sparse radar data in autonomous systems.
TLDR: 该论文介绍了RaLD,一个用于从原始雷达数据生成高分辨率3D雷达点云的潜在扩散模型框架,旨在解决自动驾驶系统中稀疏雷达数据的局限性。
Read Paper (PDF)In the early stages of semiconductor equipment development, obtaining large quantities of raw optical images poses a significant challenge. This data scarcity hinder the advancement of AI-powered solutions in semiconductor manufacturing. To address this challenge, we introduce SinSEMI, a novel one-shot learning approach that generates diverse and highly realistic images from single optical image. SinSEMI employs a multi-scale flow-based model enhanced with LPIPS (Learned Perceptual Image Patch Similarity) energy guidance during sampling, ensuring both perceptual realism and output variety. We also introduce a comprehensive evaluation framework tailored for this application, which enables a thorough assessment using just two reference images. Through the evaluation against multiple one-shot generation techniques, we demonstrate SinSEMI's superior performance in visual quality, quantitative measures, and downstream tasks. Our experimental results demonstrate that SinSEMI-generated images achieve both high fidelity and meaningful diversity, making them suitable as training data for semiconductor AI applications.
TLDR: The paper introduces SinSEMI, a one-shot image generation model for semiconductor inspection using a multi-scale flow-based approach with LPIPS guidance, along with a data-efficient evaluation framework.
TLDR: 该论文介绍了SinSEMI,一种用于半导体检测的单样本图像生成模型,它采用多尺度流模型和LPIPS指导,同时还提出了一个数据高效的评估框架。
Read Paper (PDF)