Daily papers related to Image/Video/Multimodal Generation from cs.CV
November 06, 2025
This paper proposes VoxStudio, the first unified and end-to-end speech-to-image model that generates expressive images directly from spoken descriptions by jointly aligning linguistic and paralinguistic information. At its core is a speech information bottleneck (SIB) module, which compresses raw speech into compact semantic tokens, preserving prosody and emotional nuance. By operating directly on these tokens, VoxStudio eliminates the need for an additional speech-to-text system, which often ignores the hidden details beyond text, e.g., tone or emotion. We also release VoxEmoset, a large-scale paired emotional speech-image dataset built via an advanced TTS engine to affordably generate richly expressive utterances. Comprehensive experiments on the SpokenCOCO, Flickr8kAudio, and VoxEmoset benchmarks demonstrate the feasibility of our method and highlight key challenges, including emotional consistency and linguistic ambiguity, paving the way for future research.
TLDR: The paper introduces VoxStudio, a novel end-to-end speech-to-image model that generates expressive images from speech using a speech information bottleneck to capture prosody and emotion, along with a new paired emotional speech-image dataset, VoxEmoset.
TLDR: 该论文介绍了 VoxStudio,一种新型的端到端语音到图像模型,它使用语音信息瓶颈从语音生成富有表现力的图像,以捕获韵律和情感,并发布了一个新的配对情感语音-图像数据集 VoxEmoset。
Read Paper (PDF)Due to the lack of effective cross-modal modeling, existing open-source audio-video generation methods often exhibit compromised lip synchronization and insufficient semantic consistency. To mitigate these drawbacks, we propose UniAVGen, a unified framework for joint audio and video generation. UniAVGen is anchored in a dual-branch joint synthesis architecture, incorporating two parallel Diffusion Transformers (DiTs) to build a cohesive cross-modal latent space. At its heart lies an Asymmetric Cross-Modal Interaction mechanism, which enables bidirectional, temporally aligned cross-attention, thus ensuring precise spatiotemporal synchronization and semantic consistency. Furthermore, this cross-modal interaction is augmented by a Face-Aware Modulation module, which dynamically prioritizes salient regions in the interaction process. To enhance generative fidelity during inference, we additionally introduce Modality-Aware Classifier-Free Guidance, a novel strategy that explicitly amplifies cross-modal correlation signals. Notably, UniAVGen's robust joint synthesis design enables seamless unification of pivotal audio-video tasks within a single model, such as joint audio-video generation and continuation, video-to-audio dubbing, and audio-driven video synthesis. Comprehensive experiments validate that, with far fewer training samples (1.3M vs. 30.1M), UniAVGen delivers overall advantages in audio-video synchronization, timbre consistency, and emotion consistency.
TLDR: UniAVGen is a unified audio-video generation framework using Diffusion Transformers and asymmetric cross-modal interaction to improve lip synchronization and semantic consistency while reducing training data requirements. It tackles multiple audio-video tasks within a single model.
TLDR: UniAVGen是一个统一的音视频生成框架,它利用扩散变换器和非对称跨模态交互来提高口型同步和语义一致性,同时减少了训练数据需求。它在一个模型中解决了多个音视频任务。
Read Paper (PDF)Personalizing text-to-image diffusion models has traditionally relied on subject-specific fine-tuning approaches such as DreamBooth~\cite{ruiz2023dreambooth}, which are computationally expensive and slow at inference. Recent adapter- and encoder-based methods attempt to reduce this overhead but still depend on additional fine-tuning or large backbone models for satisfactory results. In this work, we revisit an orthogonal direction: fine-tuning-free personalization via Hypernetworks that predict LoRA-adapted weights directly from subject images. Prior hypernetwork-based approaches, however, suffer from costly data generation or unstable attempts to mimic base model optimization trajectories. We address these limitations with an end-to-end training objective, stabilized by a simple output regularization, yielding reliable and effective hypernetworks. Our method removes the need for per-subject optimization at test time while preserving both subject fidelity and prompt alignment. To further enhance compositional generalization at inference time, we introduce Hybrid-Model Classifier-Free Guidance (HM-CFG), which combines the compositional strengths of the base diffusion model with the subject fidelity of personalized models during sampling. Extensive experiments on CelebA-HQ, AFHQ-v2, and DreamBench demonstrate that our approach achieves strong personalization performance and highlights the promise of hypernetworks as a scalable and effective direction for open-category personalization.
TLDR: This paper introduces a fine-tuning-free approach for personalized text-to-image generation using hypernetworks with a stabilized training objective and Hybrid-Model Classifier-Free Guidance to improve compositional generalization.
TLDR: 这篇论文介绍了一种无需微调的个性化文本到图像生成方法,它使用超网络,通过稳定的训练目标和混合模型无分类器指导来提高组合泛化能力。
Read Paper (PDF)We introduce ProM3E, a probabilistic masked multimodal embedding model for any-to-any generation of multimodal representations for ecology. ProM3E is based on masked modality reconstruction in the embedding space, learning to infer missing modalities given a few context modalities. By design, our model supports modality inversion in the embedding space. The probabilistic nature of our model allows us to analyse the feasibility of fusing various modalities for given downstream tasks, essentially learning what to fuse. Using these features of our model, we propose a novel cross-modal retrieval approach that mixes inter-modal and intra-modal similarities to achieve superior performance across all retrieval tasks. We further leverage the hidden representation from our model to perform linear probing tasks and demonstrate the superior representation learning capability of our model. All our code, datasets and model will be released at https://vishu26.github.io/prom3e.
TLDR: The paper introduces ProM3E, a probabilistic multimodal embedding model for ecological data, enabling any-to-any modality generation and cross-modal retrieval through masked modality reconstruction and probabilistic fusion analysis.
TLDR: 本文介绍了一种名为 ProM3E 的概率多模态嵌入模型,用于生态数据,通过屏蔽模态重建和概率融合分析,实现任何模态之间的生成和跨模态检索。
Read Paper (PDF)