AIGC Daily Papers

Daily papers related to Image/Video/Multimodal Generation from cs.CV

May 04, 2025

DualDiff: Dual-branch Diffusion Model for Autonomous Driving with Semantic Fusion

Accurate and high-fidelity driving scene reconstruction relies on fully leveraging scene information as conditioning. However, existing approaches, which primarily use 3D bounding boxes and binary maps for foreground and background control, fall short in capturing the complexity of the scene and integrating multi-modal information. In this paper, we propose DualDiff, a dual-branch conditional diffusion model designed to enhance multi-view driving scene generation. We introduce Occupancy Ray Sampling (ORS), a semantic-rich 3D representation, alongside numerical driving scene representation, for comprehensive foreground and background control. To improve cross-modal information integration, we propose a Semantic Fusion Attention (SFA) mechanism that aligns and fuses features across modalities. Furthermore, we design a foreground-aware masked (FGM) loss to enhance the generation of tiny objects. DualDiff achieves state-of-the-art performance in FID score, as well as consistently better results in downstream BEV segmentation and 3D object detection tasks.

TLDR: the paper introduces dualdiff, a dual-branch diffusion model for autonomous driving scene generation, utilizing semantic-rich 3d representation and a semantic fusion attention mechanism to achieve state-of-the-art performance in scene reconstruction and downstream tasks.

TLDR: 这篇论文介绍了一种名为dualdiff的双分支扩散模型,用于自动驾驶场景生成,它利用语义丰富的3d表示和语义融合注意力机制,在场景重建和下游任务中实现了最先进的性能。

Relevance: (8/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (7/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Haoteng Li, Zhao Yang, Zezhong Qian, Gongpeng Zhao, Yuqi Huang, Jun Yu, Huazheng Zhou, Longjun Liu

PhytoSynth: Leveraging Multi-modal Generative Models for Crop Disease Data Generation with Novel Benchmarking and Prompt Engineering Approach

Collecting large-scale crop disease images in the field is labor-intensive and time-consuming. Generative models (GMs) offer an alternative by creating synthetic samples that resemble real-world images. However, existing research primarily relies on Generative Adversarial Networks (GANs)-based image-to-image translation and lack a comprehensive analysis of computational requirements in agriculture. Therefore, this research explores a multi-modal text-to-image approach for generating synthetic crop disease images and is the first to provide computational benchmarking in this context. We trained three Stable Diffusion (SD) variants-SDXL, SD3.5M (medium), and SD3.5L (large)-and fine-tuned them using Dreambooth and Low-Rank Adaptation (LoRA) fine-tuning techniques to enhance generalization. SD3.5M outperformed the others, with an average memory usage of 18 GB, power consumption of 180 W, and total energy use of 1.02 kWh/500 images (0.002 kWh per image) during inference task. Our results demonstrate SD3.5M's ability to generate 500 synthetic images from just 36 in-field samples in 1.5 hours. We recommend SD3.5M for efficient crop disease data generation.

TLDR: this paper explores using multi-modal text-to-image stable diffusion models, particularly sd3.5m, to generate synthetic crop disease images, providing computational benchmarks for this application and demonstrating efficient generation from limited real-world samples.

TLDR: 该论文探索使用多模态文本到图像的stable diffusion模型(特别是sd3.5m)来生成合成的作物病害图像,为该应用提供了计算基准,并展示了从有限的真实样本中高效生成图像的能力。

Relevance: (9/10)
Novelty: (7/10)
Clarity: (8/10)
Potential Impact: (7/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Nitin Rai, Arnold W. Schumann, Nathan Boyd

PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth

Recent advancements in autonomous driving (AD) systems have highlighted the potential of world models in achieving robust and generalizable performance across both ordinary and challenging driving conditions. However, a key challenge remains: precise and flexible camera pose control, which is crucial for accurate viewpoint transformation and realistic simulation of scene dynamics. In this paper, we introduce PosePilot, a lightweight yet powerful framework that significantly enhances camera pose controllability in generative world models. Drawing inspiration from self-supervised depth estimation, PosePilot leverages structure-from-motion principles to establish a tight coupling between camera pose and video generation. Specifically, we incorporate self-supervised depth and pose readouts, allowing the model to infer depth and relative camera motion directly from video sequences. These outputs drive pose-aware frame warping, guided by a photometric warping loss that enforces geometric consistency across synthesized frames. To further refine camera pose estimation, we introduce a reverse warping step and a pose regression loss, improving viewpoint precision and adaptability. Extensive experiments on autonomous driving and general-domain video datasets demonstrate that PosePilot significantly enhances structural understanding and motion reasoning in both diffusion-based and auto-regressive world models. By steering camera pose with self-supervised depth, PosePilot sets a new benchmark for pose controllability, enabling physically consistent, reliable viewpoint synthesis in generative world models.

TLDR: the paper introduces posepilot, a framework that enhances camera pose controllability in generative world models using self-supervised depth estimation, improving viewpoint synthesis and geometric consistency.

TLDR: 该论文介绍了 posepilot,一个利用自监督深度估计来增强生成世界模型中相机姿态可控性的框架,从而改进了视点合成和几何一致性。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Bu Jin, Weize Li, Baihan Yang, Zhenxin Zhu, Junpeng Jiang, Huan-ang Gao, Haiyang Sun, Kun Zhan, Hengtong Hu, Xueyang Zhang, Peng Jia, Hao Zhao

RAGAR: Retrieval Augment Personalized Image Generation Guided by Recommendation

Personalized image generation is crucial for improving the user experience, as it renders reference images into preferred ones according to user visual preferences. Although effective, existing methods face two main issues. First, existing methods treat all items in the user historical sequence equally when extracting user preferences, overlooking the varying semantic similarities between historical items and the reference item. Disproportionately high weights for low-similarity items distort users' visual preferences for the reference item. Second, existing methods heavily rely on consistency between generated and reference images to optimize the generation, which leads to underfitting user preferences and hinders personalization. To address these issues, we propose Retrieval Augment Personalized Image GenerAtion guided by Recommendation (RAGAR). Our approach uses a retrieval mechanism to assign different weights to historical items according to their similarities to the reference item, thereby extracting more refined users' visual preferences for the reference item. Then we introduce a novel rank task based on the multi-modal ranking model to optimize the personalization of the generated images instead of forcing depend on consistency. Extensive experiments and human evaluations on three real-world datasets demonstrate that RAGAR achieves significant improvements in both personalization and semantic metrics compared to five baselines.

TLDR: the paper introduces ragar, a novel approach for personalized image generation that leverages retrieval mechanisms and a ranking task to better incorporate user preferences, addressing limitations of existing methods that equally weigh user history and rely heavily on consistency with reference images.

TLDR: 该论文介绍了ragar,一种新颖的个性化图像生成方法,它利用检索机制和排序任务来更好地整合用户偏好,解决了现有方法同等对待用户历史记录并过度依赖与参考图像一致性的局限性。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Run Ling, Wenji Wang, Yuting Liu, Guibing Guo, Linying Jiang, Xingwei Wang

Enhancing the Learning Experience: Using Vision-Language Models to Generate Questions for Educational Videos

Web-based educational videos offer flexible learning opportunities and are becoming increasingly popular. However, improving user engagement and knowledge retention remains a challenge. Automatically generated questions can activate learners and support their knowledge acquisition. Further, they can help teachers and learners assess their understanding. While large language and vision-language models have been employed in various tasks, their application to question generation for educational videos remains underexplored. In this paper, we investigate the capabilities of current vision-language models for generating learning-oriented questions for educational video content. We assess (1) out-of-the-box models' performance; (2) fine-tuning effects on content-specific question generation; (3) the impact of different video modalities on question quality; and (4) in a qualitative study, question relevance, answerability, and difficulty levels of generated questions. Our findings delineate the capabilities of current vision-language models, highlighting the need for fine-tuning and addressing challenges in question diversity and relevance. We identify requirements for future multimodal datasets and outline promising research directions.

TLDR: this paper explores the use of vision-language models for automatically generating questions for educational videos, assessing their performance and identifying areas for improvement via fine-tuning and data collection.

TLDR: 本文探讨了使用视觉-语言模型自动生成教育视频的问题,评估了它们的性能,并通过微调和数据收集确定了改进领域。

Relevance: (7/10)
Novelty: (6/10)
Clarity: (9/10)
Potential Impact: (7/10)
Overall: (7/10)
Read Paper (PDF)

Authors: Markos Stamatakis, Joshua Berger, Christian Wartena, Ralph Ewerth, Anett Hoppe

Vision and Intention Boost Large Language Model in Long-Term Action Anticipation

Long-term action anticipation (LTA) aims to predict future actions over an extended period. Previous approaches primarily focus on learning exclusively from video data but lack prior knowledge. Recent researches leverage large language models (LLMs) by utilizing text-based inputs which suffer severe information loss. To tackle these limitations single-modality methods face, we propose a novel Intention-Conditioned Vision-Language (ICVL) model in this study that fully leverages the rich semantic information of visual data and the powerful reasoning capabilities of LLMs. Considering intention as a high-level concept guiding the evolution of actions, we first propose to employ a vision-language model (VLM) to infer behavioral intentions as comprehensive textual features directly from video inputs. The inferred intentions are then fused with visual features through a multi-modality fusion strategy, resulting in intention-enhanced visual representations. These enhanced visual representations, along with textual prompts, are fed into LLM for future action anticipation. Furthermore, we propose an effective example selection strategy jointly considers visual and textual similarities, providing more relevant and informative examples for in-context learning. Extensive experiments with state-of-the-art performance on Ego4D, EPIC-Kitchens-55, and EGTEA GAZE+ datasets fully demonstrate the effectiveness and superiority of the proposed method.

TLDR: the paper introduces an intention-conditioned vision-language (icvl) model for long-term action anticipation, leveraging visual data and llms with an intention-aware fusion strategy and example selection, demonstrating state-of-the-art results on multiple datasets.

TLDR: 该论文介绍了一种用于长期行为预测的意图条件视觉语言 (icvl) 模型,该模型利用视觉数据和大型语言模型,采用意图感知融合策略和示例选择,并在多个数据集上展示了最先进的结果。

Relevance: (6/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (7/10)
Overall: (7/10)
Read Paper (PDF)

Authors: Congqi Cao, Lanshu Hu, Yating Yu, Yanning Zhang

Mitigating Group-Level Fairness Disparities in Federated Visual Language Models

Visual language models (VLMs) have shown remarkable capabilities in multimodal tasks but face challenges in maintaining fairness across demographic groups, particularly when deployed in federated learning (FL) environments. This paper addresses the critical issue of group fairness in federated VLMs by introducing FVL-FP, a novel framework that combines FL with fair prompt tuning techniques. We focus on mitigating demographic biases while preserving model performance through three innovative components: (1) Cross-Layer Demographic Fair Prompting (CDFP), which adjusts potentially biased embeddings through counterfactual regularization; (2) Demographic Subspace Orthogonal Projection (DSOP), which removes demographic bias in image representations by mapping fair prompt text to group subspaces; and (3) Fair-aware Prompt Fusion (FPF), which dynamically balances client contributions based on both performance and fairness metrics. Extensive evaluations across four benchmark datasets demonstrate that our approach reduces demographic disparity by an average of 45\% compared to standard FL approaches, while maintaining task performance within 6\% of state-of-the-art results. FVL-FP effectively addresses the challenges of non-IID data distributions in federated settings and introduces minimal computational overhead while providing significant fairness benefits. Our work presents a parameter-efficient solution to the critical challenge of ensuring equitable performance across demographic groups in privacy-preserving multimodal systems.

TLDR: this paper introduces fvl-fp, a novel federated learning framework that leverages fair prompt tuning to mitigate group-level fairness disparities in visual language models, achieving significant fairness improvements with minimal performance impact.

TLDR: 该论文介绍了一种名为fvl-fp的新型联邦学习框架,该框架利用公平提示调优来减轻视觉语言模型中群体层面的公平性差距,在性能影响最小的情况下实现了显著的公平性改进。

Relevance: (4/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (7/10)
Overall: (6/10)
Read Paper (PDF)

Authors: Chaomeng Chen, Zitong Yu, Junhao Dong, Sen Su, Linlin Shen, Shutao Xia, Xiaochun Cao

Seeing Heat with Color -- RGB-Only Wildfire Temperature Inference from SAM-Guided Multimodal Distillation using Radiometric Ground Truth

High-fidelity wildfire monitoring using Unmanned Aerial Vehicles (UAVs) typically requires multimodal sensing - especially RGB and thermal imagery - which increases hardware cost and power consumption. This paper introduces SAM-TIFF, a novel teacher-student distillation framework for pixel-level wildfire temperature prediction and segmentation using RGB input only. A multimodal teacher network trained on paired RGB-Thermal imagery and radiometric TIFF ground truth distills knowledge to a unimodal RGB student network, enabling thermal-sensor-free inference. Segmentation supervision is generated using a hybrid approach of segment anything (SAM)-guided mask generation, and selection via TOPSIS, along with Canny edge detection and Otsu's thresholding pipeline for automatic point prompt selection. Our method is the first to perform per-pixel temperature regression from RGB UAV data, demonstrating strong generalization on the recent FLAME 3 dataset. This work lays the foundation for lightweight, cost-effective UAV-based wildfire monitoring systems without thermal sensors.

TLDR: this paper presents a teacher-student distillation framework (sam-tiff) to predict per-pixel wildfire temperatures from rgb imagery only, using sam-guided segmentation and radiometric ground truth, enabling cheaper uav-based wildfire monitoring.

TLDR: 本文提出了一种师生蒸馏框架 (sam-tiff),仅使用 rgb 图像预测每个像素的野火温度,利用 sam 引导的分割和辐射地面实况,从而实现更廉价的基于无人机的野火监测。

Relevance: (3/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (7/10)
Overall: (5/10)
Read Paper (PDF)

Authors: Michael Marinaccio, Fatemeh Afghah

Multimodal and Multiview Deep Fusion for Autonomous Marine Navigation

We propose a cross attention transformer based method for multimodal sensor fusion to build a birds eye view of a vessels surroundings supporting safer autonomous marine navigation. The model deeply fuses multiview RGB and long wave infrared images with sparse LiDAR point clouds. Training also integrates X band radar and electronic chart data to inform predictions. The resulting view provides a detailed reliable scene representation improving navigational accuracy and robustness. Real world sea trials confirm the methods effectiveness even in adverse weather and complex maritime settings.

TLDR: this paper introduces a cross-attention transformer-based method for multimodal sensor fusion to create a bird's-eye view for autonomous marine navigation, using rgb, infrared, lidar, radar, and electronic chart data. real-world sea trials demonstrate its effectiveness.

TLDR: 本文提出了一种基于交叉注意力transformer的多模态传感器融合方法,利用rgb、红外、激光雷达、雷达和电子海图数据,为自主航海构建鸟瞰图。真实海试验证了其有效性。

Relevance: (2/10)
Novelty: (7/10)
Clarity: (9/10)
Potential Impact: (7/10)
Overall: (4/10)
Read Paper (PDF)

Authors: Dimitrios Dagdilelis, Panagiotis Grigoriadis, Roberto Galeazzi