ArXiv CS.CV Papers (Image/Video Generation)

T2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation

Text-to-video generative models have made significant strides in recent years, producing high-quality videos that excel in both aesthetic appeal and accurate instruction following, and have become central to digital art creation and user engagement online. Yet, despite these advancements, their ability to respect fundamental physical laws remains largely untested: many outputs still violate basic constraints such as rigid-body collisions, energy conservation, and gravitational dynamics, resulting in unrealistic or even misleading content. Existing physical-evaluation benchmarks typically rely on automatic, pixel-level metrics applied to simplistic, life-scenario prompts, and thus overlook both human judgment and first-principles physics. To fill this gap, we introduce \textbf{T2VPhysBench}, a first-principled benchmark that systematically evaluates whether state-of-the-art text-to-video systems, both open-source and commercial, obey twelve core physical laws including Newtonian mechanics, conservation principles, and phenomenological effects. Our benchmark employs a rigorous human evaluation protocol and includes three targeted studies: (1) an overall compliance assessment showing that all models score below 0.60 on average in each law category; (2) a prompt-hint ablation revealing that even detailed, law-specific hints fail to remedy physics violations; and (3) a counterfactual robustness test demonstrating that models often generate videos that explicitly break physical rules when so instructed. The results expose persistent limitations in current architectures and offer concrete insights for guiding future research toward truly physics-aware video generation.

TLDR: the paper introduces t2vphysbench, a new benchmark to evaluate the physical consistency of text-to-video models, revealing their significant limitations in adhering to fundamental physical laws.

TLDR: 该论文介绍了t2vphysbench，一个新的基准测试，用于评估文本到视频模型的物理一致性，揭示了它们在遵守基本物理定律方面的重大局限性。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (10/10)

Potential Impact: (8/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Xuyang Guo, Jiayan Huo, Zhenmei Shi, Zhao Song, Jiahao Zhang, Jiale Zhao

Controllable Weather Synthesis and Removal with Video Diffusion Models

Generating realistic and controllable weather effects in videos is valuable for many applications. Physics-based weather simulation requires precise reconstructions that are hard to scale to in-the-wild videos, while current video editing often lacks realism and control. In this work, we introduce WeatherWeaver, a video diffusion model that synthesizes diverse weather effects -- including rain, snow, fog, and clouds -- directly into any input video without the need for 3D modeling. Our model provides precise control over weather effect intensity and supports blending various weather types, ensuring both realism and adaptability. To overcome the scarcity of paired training data, we propose a novel data strategy combining synthetic videos, generative image editing, and auto-labeled real-world videos. Extensive evaluations show that our method outperforms state-of-the-art methods in weather simulation and removal, providing high-quality, physically plausible, and scene-identity-preserving results over various real-world videos.

TLDR: the paper introduces weatherweaver, a video diffusion model for synthesizing and removing controllable weather effects in videos using a novel data strategy to address the lack of paired training data.

TLDR: 该论文介绍了weatherweaver，一种视频扩散模型，用于合成和移除视频中可控的天气效果，并使用创新的数据策略来解决配对训练数据不足的问题。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Chih-Hao Lin, Zian Wang, Ruofan Liang, Yuxuan Zhang, Sanja Fidler, Shenlong Wang, Zan Gojcic

T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT

Recent advancements in large language models have demonstrated how chain-of-thought (CoT) and reinforcement learning (RL) can improve performance. However, applying such reasoning strategies to the visual generation domain remains largely unexplored. In this paper, we present T2I-R1, a novel reasoning-enhanced text-to-image generation model, powered by RL with a bi-level CoT reasoning process. Specifically, we identify two levels of CoT that can be utilized to enhance different stages of generation: (1) the semantic-level CoT for high-level planning of the prompt and (2) the token-level CoT for low-level pixel processing during patch-by-patch generation. To better coordinate these two levels of CoT, we introduce BiCoT-GRPO with an ensemble of generation rewards, which seamlessly optimizes both generation CoTs within the same training step. By applying our reasoning strategies to the baseline model, Janus-Pro, we achieve superior performance with 13% improvement on T2I-CompBench and 19% improvement on the WISE benchmark, even surpassing the state-of-the-art model FLUX.1. Code is available at: https://github.com/CaraJ7/T2I-R1

TLDR: the paper introduces t2i-r1, a novel text-to-image generation model that uses reinforcement learning with a bi-level chain-of-thought (cot) reasoning process to improve performance, achieving state-of-the-art results on t2i-compbench and wise benchmarks.

TLDR: 该论文介绍了 t2i-r1, 一种新型的文本到图像生成模型，它使用强化学习和双层思维链（cot）推理过程来提高性能，并在 t2i-compbench 和 wise 基准测试中取得了最先进的结果。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, Hongsheng Li

KeySync: A Robust Approach for Leakage-free Lip Synchronization in High Resolution

Lip synchronization, known as the task of aligning lip movements in an existing video with new input audio, is typically framed as a simpler variant of audio-driven facial animation. However, as well as suffering from the usual issues in talking head generation (e.g., temporal consistency), lip synchronization presents significant new challenges such as expression leakage from the input video and facial occlusions, which can severely impact real-world applications like automated dubbing, but are often neglected in existing works. To address these shortcomings, we present KeySync, a two-stage framework that succeeds in solving the issue of temporal consistency, while also incorporating solutions for leakage and occlusions using a carefully designed masking strategy. We show that KeySync achieves state-of-the-art results in lip reconstruction and cross-synchronization, improving visual quality and reducing expression leakage according to LipLeak, our novel leakage metric. Furthermore, we demonstrate the effectiveness of our new masking approach in handling occlusions and validate our architectural choices through several ablation studies. Code and model weights can be found at https://antonibigata.github.io/KeySync.

TLDR: keysync is a two-stage framework that achieves state-of-the-art lip synchronization by addressing temporal consistency, expression leakage, and occlusions using a novel masking strategy and leakage metric.

TLDR: keysync是一个两阶段框架，通过使用一种新颖的掩蔽策略和泄漏指标，解决了时间一致性、表情泄漏和遮挡等问题，从而实现了最先进的唇部同步。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Antoni Bigata, Rodrigo Mira, Stella Bounareli, Michał Stypułkowski, Konstantinos Vougioukas, Stavros Petridis, Maja Pantic

JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers

We present JointDiT, a diffusion transformer that models the joint distribution of RGB and depth. By leveraging the architectural benefit and outstanding image prior of the state-of-the-art diffusion transformer, JointDiT not only generates high-fidelity images but also produces geometrically plausible and accurate depth maps. This solid joint distribution modeling is achieved through two simple yet effective techniques that we propose, i.e., adaptive scheduling weights, which depend on the noise levels of each modality, and the unbalanced timestep sampling strategy. With these techniques, we train our model across all noise levels for each modality, enabling JointDiT to naturally handle various combinatorial generation tasks, including joint generation, depth estimation, and depth-conditioned image generation by simply controlling the timestep of each branch. JointDiT demonstrates outstanding joint generation performance. Furthermore, it achieves comparable results in depth estimation and depth-conditioned image generation, suggesting that joint distribution modeling can serve as a replaceable alternative to conditional generation. The project page is available at https://byungki-k.github.io/JointDiT/.

TLDR: jointdit is a diffusion transformer that jointly models rgb and depth information using adaptive scheduling weights and unbalanced timestep sampling, achieving strong performance in joint generation, depth estimation, and depth-conditioned image generation.

TLDR: jointdit是一个扩散transformer，它使用自适应调度权重和非平衡时间步采样，联合建模rgb和深度信息，在联合生成、深度估计和深度条件图像生成方面表现出色。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Kwon Byung-Ki, Qi Dai, Lee Hyoseok, Chong Luo, Tae-Hyun Oh

Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models

The rapid evolution of social media has provided enhanced communication channels for individuals to create online content, enabling them to express their thoughts and opinions. Multimodal memes, often utilized for playful or humorous expressions with visual and textual elements, are sometimes misused to disseminate hate speech against individuals or groups. While the detection of hateful memes is well-researched, developing effective methods to transform hateful content in memes remains a significant challenge. Leveraging the powerful generation and reasoning capabilities of Vision-Language Models (VLMs), we address the tasks of detecting and mitigating hateful content. This paper presents two key contributions: first, a definition-guided prompting technique for detecting hateful memes, and second, a unified framework for mitigating hateful content in memes, named UnHateMeme, which works by replacing hateful textual and/or visual components. With our definition-guided prompts, VLMs achieve impressive performance on hateful memes detection task. Furthermore, our UnHateMeme framework, integrated with VLMs, demonstrates a strong capability to convert hateful memes into non-hateful forms that meet human-level criteria for hate speech and maintain multimodal coherence between image and text. Through empirical experiments, we show the effectiveness of state-of-the-art pretrained VLMs such as LLaVA, Gemini and GPT-4o on the proposed tasks, providing a comprehensive analysis of their respective strengths and limitations for these tasks. This paper aims to shed light on important applications of VLMs for ensuring safe and respectful online environments.

TLDR: this paper introduces a method for detecting and mitigating hateful content in multimodal memes using vision-language models (vlms), including a definition-guided prompting technique for detection and a framework (unhatememe) for transforming hateful memes into non-hateful ones by replacing textual/visual components.

TLDR: 本文介绍了一种使用视觉-语言模型（vlms）检测和缓解多模态表情包中仇恨内容的方法，包括用于检测的定义引导提示技术和一个框架（unhatememe），通过替换文本/视觉组件将仇恨表情包转换为非仇恨表情包。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Minh-Hao Van, Xintao Wu

Eye2Eye: A Simple Approach for Monocular-to-Stereo Video Synthesis

The rising popularity of immersive visual experiences has increased interest in stereoscopic 3D video generation. Despite significant advances in video synthesis, creating 3D videos remains challenging due to the relative scarcity of 3D video data. We propose a simple approach for transforming a text-to-video generator into a video-to-stereo generator. Given an input video, our framework automatically produces the video frames from a shifted viewpoint, enabling a compelling 3D effect. Prior and concurrent approaches for this task typically operate in multiple phases, first estimating video disparity or depth, then warping the video accordingly to produce a second view, and finally inpainting the disoccluded regions. This approach inherently fails when the scene involves specular surfaces or transparent objects. In such cases, single-layer disparity estimation is insufficient, resulting in artifacts and incorrect pixel shifts during warping. Our work bypasses these restrictions by directly synthesizing the new viewpoint, avoiding any intermediate steps. This is achieved by leveraging a pre-trained video model's priors on geometry, object materials, optics, and semantics, without relying on external geometry models or manually disentangling geometry from the synthesis process. We demonstrate the advantages of our approach in complex, real-world scenarios featuring diverse object materials and compositions. See videos on https://video-eye2eye.github.io

TLDR: the paper proposes a novel "eye2eye" approach for monocular-to-stereo video synthesis by directly synthesizing the second viewpoint using a pre-trained video model, bypassing explicit depth or disparity estimation and mitigating artifacts in complex scenes.

TLDR: 该论文提出了一种名为 “eye2eye” 的新方法，用于从单目视频合成立体视频，它通过直接使用预训练的视频模型来合成第二个视角，绕过了显式的深度或视差估计，并减少了复杂场景中的伪影。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Michal Geyer, Omer Tov, Linyi Jin, Richard Tucker, Inbar Mosseri, Tali Dekel, Noah Snavely

RayZer: A Self-supervised Large View Synthesis Model

We present RayZer, a self-supervised multi-view 3D Vision model trained without any 3D supervision, i.e., camera poses and scene geometry, while exhibiting emerging 3D awareness. Concretely, RayZer takes unposed and uncalibrated images as input, recovers camera parameters, reconstructs a scene representation, and synthesizes novel views. During training, RayZer relies solely on its self-predicted camera poses to render target views, eliminating the need for any ground-truth camera annotations and allowing RayZer to be trained with 2D image supervision. The emerging 3D awareness of RayZer is attributed to two key factors. First, we design a self-supervised framework, which achieves 3D-aware auto-encoding of input images by disentangling camera and scene representations. Second, we design a transformer-based model in which the only 3D prior is the ray structure, connecting camera, pixel, and scene simultaneously. RayZer demonstrates comparable or even superior novel view synthesis performance than ``oracle'' methods that rely on pose annotations in both training and testing. Project: https://hwjiang1510.github.io/RayZer/

TLDR: rayzer is a self-supervised model that synthesizes novel views from unposed images, recovering camera parameters and reconstructing scenes without 3d supervision, achieving comparable or better performance than methods requiring pose annotations.

TLDR: rayzer是一个自监督模型，可以从无姿态图像中合成新的视图，在没有3d监督的情况下恢复相机参数和重建场景，并且实现了与需要姿势注释的方法相比具有竞争力甚至更好的性能。

Relevance: (6/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Hanwen Jiang, Hao Tan, Peng Wang, Haian Jin, Yue Zhao, Sai Bi, Kai Zhang, Fujun Luan, Kalyan Sunkavalli, Qixing Huang, Georgios Pavlakos

GuideSR: Rethinking Guidance for One-Step High-Fidelity Diffusion-Based Super-Resolution

In this paper, we propose GuideSR, a novel single-step diffusion-based image super-resolution (SR) model specifically designed to enhance image fidelity. Existing diffusion-based SR approaches typically adapt pre-trained generative models to image restoration tasks by adding extra conditioning on a VAE-downsampled representation of the degraded input, which often compromises structural fidelity. GuideSR addresses this limitation by introducing a dual-branch architecture comprising: (1) a Guidance Branch that preserves high-fidelity structures from the original-resolution degraded input, and (2) a Diffusion Branch, which a pre-trained latent diffusion model to enhance perceptual quality. Unlike conventional conditioning mechanisms, our Guidance Branch features a tailored structure for image restoration tasks, combining Full Resolution Blocks (FRBs) with channel attention and an Image Guidance Network (IGN) with guided attention. By embedding detailed structural information directly into the restoration pipeline, GuideSR produces sharper and more visually consistent results. Extensive experiments on benchmark datasets demonstrate that GuideSR achieves state-of-the-art performance while maintaining the low computational cost of single-step approaches, with up to 1.39dB PSNR gain on challenging real-world datasets. Our approach consistently outperforms existing methods across various reference-based metrics including PSNR, SSIM, LPIPS, DISTS and FID, further representing a practical advancement for real-world image restoration.

TLDR: guidesr introduces a novel single-step diffusion-based image super-resolution model that uses a dual-branch architecture to enhance image fidelity by preserving high-fidelity structures from the original-resolution degraded input and enhancing perceptual quality using a pre-trained latent diffusion model.

TLDR: guidesr 提出了一种新颖的单步扩散图像超分辨率模型，该模型使用双分支架构来增强图像保真度，通过保留原始分辨率降级输入中的高保真度结构，并使用预训练的潜在扩散模型来增强感知质量。

Relevance: (6/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Aditya Arora, Zhengzhong Tu, Yufei Wang, Ruizheng Bai, Jian Wang, Sizhuo Ma

Efficient Neural Video Representation with Temporally Coherent Modulation

Implicit neural representations (INR) has found successful applications across diverse domains. To employ INR in real-life, it is important to speed up training. In the field of INR for video applications, the state-of-the-art approach employs grid-type parametric encoding and successfully achieves a faster encoding speed in comparison to its predecessors. However, the grid usage, which does not consider the video's dynamic nature, leads to redundant use of trainable parameters. As a result, it has significantly lower parameter efficiency and higher bitrate compared to NeRV-style methods that do not use a parametric encoding. To address the problem, we propose Neural Video representation with Temporally coherent Modulation (NVTM), a novel framework that can capture dynamic characteristics of video. By decomposing the spatio-temporal 3D video data into a set of 2D grids with flow information, NVTM enables learning video representation rapidly and uses parameter efficiently. Our framework enables to process temporally corresponding pixels at once, resulting in the fastest encoding speed for a reasonable video quality, especially when compared to the NeRV-style method, with a speed increase of over 3 times. Also, it remarks an average of 1.54dB/0.019 improvements in PSNR/LPIPS on UVG (Dynamic) (even with 10% fewer parameters) and an average of 1.84dB/0.013 improvements in PSNR/LPIPS on MCL-JCV (Dynamic), compared to previous grid-type works. By expanding this to compression tasks, we demonstrate comparable performance to video compression standards (H.264, HEVC) and recent INR approaches for video compression. Additionally, we perform extensive experiments demonstrating the superior performance of our algorithm across diverse tasks, encompassing super resolution, frame interpolation and video inpainting. Project page is https://sujiikim.github.io/NVTM/.

TLDR: the paper introduces a novel neural video representation framework (nvtm) that speeds up training and improves parameter efficiency by using temporally coherent modulation, outperforming existing methods in video representation tasks.

TLDR: 本文介绍了一种新型神经视频表示框架（nvtm），它通过使用时间相干调制来加速训练并提高参数效率，在视频表示任务中优于现有方法。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Seungjun Shin, Suji Kim, Dokwan Oh

Quaternion Wavelet-Conditioned Diffusion Models for Image Super-Resolution

Image Super-Resolution is a fundamental problem in computer vision with broad applications spacing from medical imaging to satellite analysis. The ability to reconstruct high-resolution images from low-resolution inputs is crucial for enhancing downstream tasks such as object detection and segmentation. While deep learning has significantly advanced SR, achieving high-quality reconstructions with fine-grained details and realistic textures remains challenging, particularly at high upscaling factors. Recent approaches leveraging diffusion models have demonstrated promising results, yet they often struggle to balance perceptual quality with structural fidelity. In this work, we introduce ResQu a novel SR framework that integrates a quaternion wavelet preprocessing framework with latent diffusion models, incorporating a new quaternion wavelet- and time-aware encoder. Unlike prior methods that simply apply wavelet transforms within diffusion models, our approach enhances the conditioning process by exploiting quaternion wavelet embeddings, which are dynamically integrated at different stages of denoising. Furthermore, we also leverage the generative priors of foundation models such as Stable Diffusion. Extensive experiments on domain-specific datasets demonstrate that our method achieves outstanding SR results, outperforming in many cases existing approaches in perceptual quality and standard evaluation metrics. The code will be available after the revision process.

TLDR: the paper introduces resqu, a super-resolution framework using quaternion wavelet-conditioned diffusion models, leveraging stable diffusion priors to enhance perceptual quality and structural fidelity. it claims superior performance in domain-specific datasets.

TLDR: 本文介绍了一种名为resqu的超分辨率框架，该框架采用四元数小波条件扩散模型，并利用stable diffusion先验来增强感知质量和结构保真度。作者声称该方法在特定领域的数据集上表现优异。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Luigi Sigillo, Christian Bianchi, Danilo Comminiello

Multimodal Masked Autoencoder Pre-training for 3D MRI-Based Brain Tumor Analysis with Missing Modalities

Multimodal magnetic resonance imaging (MRI) constitutes the first line of investigation for clinicians in the care of brain tumors, providing crucial insights for surgery planning, treatment monitoring, and biomarker identification. Pre-training on large datasets have been shown to help models learn transferable representations and adapt with minimal labeled data. This behavior is especially valuable in medical imaging, where annotations are often scarce. However, applying this paradigm to multimodal medical data introduces a challenge: most existing approaches assume that all imaging modalities are available during both pre-training and fine-tuning. In practice, missing modalities often occur due to acquisition issues, specialist unavailability, or specific experimental designs on small in-house datasets. Consequently, a common approach involves training a separate model for each desired modality combination, making the process both resource-intensive and impractical for clinical use. Therefore, we introduce BM-MAE, a masked image modeling pre-training strategy tailored for multimodal MRI data. The same pre-trained model seamlessly adapts to any combination of available modalities, extracting rich representations that capture both intra- and inter-modal information. This allows fine-tuning on any subset of modalities without requiring architectural changes, while still benefiting from a model pre-trained on the full set of modalities. Extensive experiments show that the proposed pre-training strategy outperforms or remains competitive with baselines that require separate pre-training for each modality subset, while substantially surpassing training from scratch on several downstream tasks. Additionally, it can quickly and efficiently reconstruct missing modalities, highlighting its practical value. Code and trained models are available at: https://github.com/Lucas-rbnt/bmmae

TLDR: this paper introduces bm-mae, a masked autoencoder pre-training strategy for 3d mri brain tumor analysis that addresses the challenge of missing modalities, allowing for seamless adaptation and improved performance compared to modality-specific pre-training.

TLDR: 本文介绍了bm-mae，一种用于3d mri脑肿瘤分析的掩码自动编码器预训练策略，它解决了缺失模态的挑战，与模态特定的预训练相比，实现了无缝适应和性能提升。

Relevance: (3/10)

Novelty: (7/10)

Clarity: (8/10)

Potential Impact: (6/10)

Overall: (5/10)

Read Paper (PDF)

Authors: Lucas Robinet, Ahmad Berjaoui, Elizabeth Cohen-Jonathan Moyal

Neuroevolution of Self-Attention Over Proto-Objects

Proto-objects - image regions that share common visual properties - offer a promising alternative to traditional attention mechanisms based on rectangular-shaped image patches in neural networks. Although previous work demonstrated that evolving a patch-based hard-attention module alongside a controller network could achieve state-of-the-art performance in visual reinforcement learning tasks, our approach leverages image segmentation to work with higher-level features. By operating on proto-objects rather than fixed patches, we significantly reduce the representational complexity: each image decomposes into fewer proto-objects than regular patches, and each proto-object can be efficiently encoded as a compact feature vector. This enables a substantially smaller self-attention module that processes richer semantic information. Our experiments demonstrate that this proto-object-based approach matches or exceeds the state-of-the-art performance of patch-based implementations with 62% less parameters and 2.6 times less training time.

TLDR: this paper introduces a neuroevolution approach to self-attention using proto-objects (image segments) instead of patches, resulting in a more efficient self-attention module with comparable or better performance.

TLDR: 本文介绍了一种神经进化方法，通过使用原始对象（图像分割）代替补丁来实现自注意力机制，从而产生更高效的自注意力模块，并实现相当或更好的性能。

Relevance: (3/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (6/10)

Overall: (5/10)

Read Paper (PDF)

Authors: Rafael C. Pinto, Anderson R. Tavares

Uncertainty-Aware Multi-Expert Knowledge Distillation for Imbalanced Disease Grading

Automatic disease image grading is a significant application of artificial intelligence for healthcare, enabling faster and more accurate patient assessments. However, domain shifts, which are exacerbated by data imbalance, introduce bias into the model, posing deployment difficulties in clinical applications. To address the problem, we propose a novel \textbf{U}ncertainty-aware \textbf{M}ulti-experts \textbf{K}nowledge \textbf{D}istillation (UMKD) framework to transfer knowledge from multiple expert models to a single student model. Specifically, to extract discriminative features, UMKD decouples task-agnostic and task-specific features with shallow and compact feature alignment in the feature space. At the output space, an uncertainty-aware decoupled distillation (UDD) mechanism dynamically adjusts knowledge transfer weights based on expert model uncertainties, ensuring robust and reliable distillation. Additionally, UMKD also tackles the problems of model architecture heterogeneity and distribution discrepancies between source and target domains, which are inadequately tackled by previous KD approaches. Extensive experiments on histology prostate grading (\textit{SICAPv2}) and fundus image grading (\textit{APTOS}) demonstrate that UMKD achieves a new state-of-the-art in both source-imbalanced and target-imbalanced scenarios, offering a robust and practical solution for real-world disease image grading.

TLDR: the paper introduces an uncertainty-aware multi-expert knowledge distillation (umkd) framework to address data imbalance and domain shift in disease image grading, achieving state-of-the-art results on histology and fundus image grading datasets.

TLDR: 该论文介绍了一种不确定性感知的多专家知识蒸馏（umkd）框架，以解决疾病图像分级中的数据不平衡和领域转移问题，并在组织学和眼底图像分级数据集上取得了最新的成果。

Relevance: (2/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (7/10)

Overall: (4/10)

Read Paper (PDF)

Authors: Shuo Tong, Shangde Gao, Ke Liu, Zihang Huang, Hongxia Xu, Haochao Ying, Jian Wu

Towards Lightweight Hyperspectral Image Super-Resolution with Depthwise Separable Dilated Convolutional Network

Deep neural networks have demonstrated highly competitive performance in super-resolution (SR) for natural images by learning mappings from low-resolution (LR) to high-resolution (HR) images. However, hyperspectral super-resolution remains an ill-posed problem due to the high spectral dimensionality of the data and the scarcity of available training samples. Moreover, existing methods often rely on large models with a high number of parameters or require the fusion with panchromatic or RGB images, both of which are often impractical in real-world scenarios. Inspired by the MobileNet architecture, we introduce a lightweight depthwise separable dilated convolutional network (DSDCN) to address the aforementioned challenges. Specifically, our model leverages multiple depthwise separable convolutions, similar to the MobileNet architecture, and further incorporates a dilated convolution fusion block to make the model more flexible for the extraction of both spatial and spectral features. In addition, we propose a custom loss function that combines mean squared error (MSE), an L2 norm regularization-based constraint, and a spectral angle-based loss, ensuring the preservation of both spectral and spatial details. The proposed model achieves very competitive performance on two publicly available hyperspectral datasets, making it well-suited for hyperspectral image super-resolution tasks. The source codes are publicly available at: \href{https://github.com/Usman1021/lightweight}{https://github.com/Usman1021/lightweight}.

TLDR: this paper introduces a lightweight depthwise separable dilated convolutional network (dsdcn) for hyperspectral image super-resolution, addressing limitations of existing methods in terms of model size and reliance on additional data sources.

TLDR: 该论文介绍了一种轻量级的深度可分离扩张卷积网络(dsdcn)，用于高光谱图像超分辨率，解决了现有方法在模型大小和依赖额外数据源方面的局限性。

Relevance: (2/10)

Novelty: (6/10)

Clarity: (8/10)

Potential Impact: (5/10)

Overall: (4/10)

Read Paper (PDF)

Authors: Usman Muhammad, Jorma Laaksonen, Lyudmila Mihaylova

AdCare-VLM: Leveraging Large Vision Language Model (LVLM) to Monitor Long-Term Medication Adherence and Care

Chronic diseases, including diabetes, hypertension, asthma, HIV-AIDS, epilepsy, and tuberculosis, necessitate rigorous adherence to medication to avert disease progression, manage symptoms, and decrease mortality rates. Adherence is frequently undermined by factors including patient behavior, caregiver support, elevated medical costs, and insufficient healthcare infrastructure. We propose AdCare-VLM, a specialized Video-LLaVA-based multimodal large vision language model (LVLM) aimed at visual question answering (VQA) concerning medication adherence through patient videos. We employ a private dataset comprising 806 custom-annotated tuberculosis (TB) medication monitoring videos, which have been labeled by clinical experts, to fine-tune the model for adherence pattern detection. We present LLM-TB-VQA, a detailed medical adherence VQA dataset that encompasses positive, negative, and ambiguous adherence cases. Our method identifies correlations between visual features, such as the clear visibility of the patient's face, medication, water intake, and the act of ingestion, and their associated medical concepts in captions. This facilitates the integration of aligned visual-linguistic representations and improves multimodal interactions. Experimental results indicate that our method surpasses parameter-efficient fine-tuning (PEFT) enabled VLM models, such as LLaVA-V1.5 and Chat-UniVi, with absolute improvements ranging from 3.1% to 3.54% across pre-trained, regular, and low-rank adaptation (LoRA) configurations. Comprehensive ablation studies and attention map visualizations substantiate our approach, enhancing interpretability.

TLDR: the paper introduces adcare-vlm, a video-llava-based lvlm fine-tuned on a custom tb medication monitoring video dataset for vqa concerning medication adherence, showing improvements over existing peft-enabled vlm models.

TLDR: 该论文介绍了adcare-vlm，一个基于video-llava的lvlm，通过在一个定制的结核病药物监测视频数据集上进行微调，用于关于药物依从性的vqa，并且相较于现有的peft驱动的vlm模型，性能有所提升。

Relevance: (2/10)

Novelty: (7/10)

Clarity: (8/10)

Potential Impact: (6/10)

Overall: (4/10)

Read Paper (PDF)

Authors: Md Asaduzzaman Jabin, Hanqi Jiang, Yiwei Li, Patrick Kaggwa, Eugene Douglass, Juliet N. Sekandi, Tianming Liu

AIGC Daily Papers

T2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation

Controllable Weather Synthesis and Removal with Video Diffusion Models

T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT

KeySync: A Robust Approach for Leakage-free Lip Synchronization in High Resolution

JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers

Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models

Eye2Eye: A Simple Approach for Monocular-to-Stereo Video Synthesis

RayZer: A Self-supervised Large View Synthesis Model

GuideSR: Rethinking Guidance for One-Step High-Fidelity Diffusion-Based Super-Resolution

Efficient Neural Video Representation with Temporally Coherent Modulation

Quaternion Wavelet-Conditioned Diffusion Models for Image Super-Resolution

Multimodal Masked Autoencoder Pre-training for 3D MRI-Based Brain Tumor Analysis with Missing Modalities

Neuroevolution of Self-Attention Over Proto-Objects

Uncertainty-Aware Multi-Expert Knowledge Distillation for Imbalanced Disease Grading

Towards Lightweight Hyperspectral Image Super-Resolution with Depthwise Separable Dilated Convolutional Network

AdCare-VLM: Leveraging Large Vision Language Model (LVLM) to Monitor Long-Term Medication Adherence and Care