Daily papers related to Image/Video/Multimodal Generation from cs.CV
August 30, 2025
Recent advances in diffusion-based generative models have demonstrated significant potential in augmenting scarce datasets for object detection tasks. Nevertheless, most recent models rely on resource-intensive full fine-tuning of large-scale diffusion models, requiring enterprise-grade GPUs (e.g., NVIDIA V100) and thousands of synthetic images. To address these limitations, we propose Flux LoRA Augmentation (FLORA), a lightweight synthetic data generation pipeline. Our approach uses the Flux 1.1 Dev diffusion model, fine-tuned exclusively through Low-Rank Adaptation (LoRA). This dramatically reduces computational requirements, enabling synthetic dataset generation with a consumer-grade GPU (e.g., NVIDIA RTX 4090). We empirically evaluate our approach on seven diverse object detection datasets. Our results demonstrate that training object detectors with just 500 synthetic images generated by our approach yields superior detection performance compared to models trained on 5000 synthetic images from the ODGEN baseline, achieving improvements of up to 21.3% in mAP@.50:.95. This work demonstrates that it is possible to surpass state-of-the-art performance with far greater efficiency, as FLORA achieves superior results using only 10% of the data and a fraction of the computational cost. This work demonstrates that a quality and efficiency-focused approach is more effective than brute-force generation, making advanced synthetic data creation more practical and accessible for real-world scenarios.
TLDR: The paper introduces FLORA, a lightweight synthetic data generation pipeline for object detection using LoRA fine-tuning of a diffusion model, achieving superior performance with significantly reduced computational cost and data requirements compared to full fine-tuning methods.
TLDR: 该论文介绍了FLORA,一种轻量级的合成数据生成管线,通过LoRA微调扩散模型用于目标检测,与完全微调方法相比,以显著降低的计算成本和数据需求实现了卓越的性能。
Read Paper (PDF)Understanding temporal dynamics in medical imaging is crucial for applications such as disease progression modeling, treatment planning and anatomical development tracking. However, most deep learning methods either consider only single temporal contexts, or focus on tasks like classification or regression, limiting their ability for fine-grained spatial predictions. While some approaches have been explored, they are often limited to single timepoints, specific diseases or have other technical restrictions. To address this fundamental gap, we introduce Temporal Flow Matching (TFM), a unified generative trajectory method that (i) aims to learn the underlying temporal distribution, (ii) by design can fall back to a nearest image predictor, i.e. predicting the last context image (LCI), as a special case, and (iii) supports $3D$ volumes, multiple prior scans, and irregular sampling. Extensive benchmarks on three public longitudinal datasets show that TFM consistently surpasses spatio-temporal methods from natural imaging, establishing a new state-of-the-art and robust baseline for $4D$ medical image prediction.
TLDR: The paper introduces Temporal Flow Matching (TFM) for generating spatio-temporal trajectories in 4D longitudinal medical imaging, outperforming existing methods on multiple datasets and establishing a new state-of-the-art baseline.
TLDR: 本文介绍了一种名为时间流匹配(TFM)的方法,用于生成4D纵向医学图像中的时空轨迹。该方法在多个数据集上优于现有方法,并建立了一个新的最先进的基线。
Read Paper (PDF)Gaussian splatting typically requires dense observations of the scene and can fail to reconstruct occluded and unobserved areas. We propose a latent diffusion model to reconstruct a complete 3D scene with Gaussian splats, including the occluded parts, from only a single image during inference. Completing the unobserved surfaces of a scene is challenging due to the ambiguity of the plausible surfaces. Conventional methods use a regression-based formulation to predict a single "mode" for occluded and out-of-frustum surfaces, leading to blurriness, implausibility, and failure to capture multiple possible explanations. Thus, they often address this problem partially, focusing either on objects isolated from the background, reconstructing only visible surfaces, or failing to extrapolate far from the input views. In contrast, we propose a generative formulation to learn a distribution of 3D representations of Gaussian splats conditioned on a single input image. To address the lack of ground-truth training data, we propose a Variational AutoReconstructor to learn a latent space only from 2D images in a self-supervised manner, over which a diffusion model is trained. Our method generates faithful reconstructions and diverse samples with the ability to complete the occluded surfaces for high-quality 360-degree renderings.
TLDR: This paper proposes a diffusion model for completing 3D Gaussian splat representations from a single image, enabling the reconstruction of occluded areas and generation of diverse 360-degree renderings.
TLDR: 该论文提出了一种扩散模型,用于从单个图像补全3D高斯溅射表示,从而实现对遮挡区域的重建和生成多样化的360度渲染。
Read Paper (PDF)