AIGC Daily Papers

Daily papers related to Image/Video/Multimodal Generation from cs.CV

September 08, 2025

UniVerse-1: Unified Audio-Video Generation via Stitching of Experts

We introduce UniVerse-1, a unified, Veo-3-like model capable of simultaneously generating coordinated audio and video. To enhance training efficiency, we bypass training from scratch and instead employ a stitching of experts (SoE) technique. This approach deeply fuses the corresponding blocks of pre-trained video and music generation experts models, thereby fully leveraging their foundational capabilities. To ensure accurate annotations and temporal alignment for both ambient sounds and speech with video content, we developed an online annotation pipeline that processes the required training data and generates labels during training process. This strategy circumvents the performance degradation often caused by misalignment text-based annotations. Through the synergy of these techniques, our model, after being finetuned on approximately 7,600 hours of audio-video data, produces results with well-coordinated audio-visuals for ambient sounds generation and strong alignment for speech generation. To systematically evaluate our proposed method, we introduce Verse-Bench, a new benchmark dataset. In an effort to advance research in audio-video generation and to close the performance gap with state-of-the-art models such as Veo3, we make our model and code publicly available. We hope this contribution will benefit the broader research community. Project page: https://dorniwang.github.io/UniVerse-1/.

TLDR: UniVerse-1 is a novel audio-video generation model that leverages a stitching of experts (SoE) technique to fuse pre-trained video and music models, achieving coordinated audio-visual generation and strong speech alignment after finetuning, along with a new benchmark dataset, Verse-Bench.

TLDR: UniVerse-1是一个新型的音视频生成模型,它利用专家缝合(SoE)技术融合预训练的视频和音乐模型,在微调后实现了协调的视听生成和强大的语音对齐,并提出了一个新的基准数据集Verse-Bench。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Duomin Wang, Wei Zuo, Aojie Li, Ling-Hao Chen, Xinyao Liao, Deyu Zhou, Zixin Yin, Xili Dai, Daxin Jiang, Gang Yu

Home-made Diffusion Model from Scratch to Hatch

We introduce Home-made Diffusion Model (HDM), an efficient yet powerful text-to-image diffusion model optimized for training (and inferring) on consumer-grade hardware. HDM achieves competitive 1024x1024 generation quality while maintaining a remarkably low training cost of $535-620 using four RTX5090 GPUs, representing a significant reduction in computational requirements compared to traditional approaches. Our key contributions include: (1) Cross-U-Transformer (XUT), a novel U-shape transformer, Cross-U-Transformer (XUT), that employs cross-attention for skip connections, providing superior feature integration that leads to remarkable compositional consistency; (2) a comprehensive training recipe that incorporates TREAD acceleration, a novel shifted square crop strategy for efficient arbitrary aspect-ratio training, and progressive resolution scaling; and (3) an empirical demonstration that smaller models (343M parameters) with carefully crafted architectures can achieve high-quality results and emergent capabilities, such as intuitive camera control. Our work provides an alternative paradigm of scaling, demonstrating a viable path toward democratizing high-quality text-to-image generation for individual researchers and smaller organizations with limited computational resources.

TLDR: The paper introduces Home-made Diffusion Model (HDM), a text-to-image diffusion model optimized for training on consumer-grade hardware, achieving competitive image generation quality with significantly reduced computational costs. It also introduces the Cross-U-Transformer (XUT) architecture and a comprehensive training recipe.

TLDR: 该论文介绍了 Home-made Diffusion Model (HDM),一种针对在消费级硬件上训练进行优化的文本到图像扩散模型,以显著降低的计算成本实现了具有竞争力的图像生成质量。它还引入了 Cross-U-Transformer (XUT) 架构和一个全面的训练方案。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Shih-Ying Yeh

BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models

Recent advancements in aligning image and video generative models via GRPO have achieved remarkable gains in enhancing human preference alignment. However, these methods still face high computational costs from on-policy rollouts and excessive SDE sampling steps, as well as training instability due to sparse rewards. In this paper, we propose BranchGRPO, a novel method that introduces a branch sampling policy updating the SDE sampling process. By sharing computation across common prefixes and pruning low-reward paths and redundant depths, BranchGRPO substantially lowers the per-update compute cost while maintaining or improving exploration diversity. This work makes three main contributions: (1) a branch sampling scheme that reduces rollout and training cost; (2) a tree-based advantage estimator incorporating dense process-level rewards; and (3) pruning strategies exploiting path and depth redundancy to accelerate convergence and boost performance. Experiments on image and video preference alignment show that BranchGRPO improves alignment scores by 16% over strong baselines, while cutting training time by 50%.

TLDR: BranchGRPO introduces a branched sampling approach to GRPO for more efficient and stable alignment of image and video generative models with human preferences, achieving significant speedups and alignment improvements.

TLDR: BranchGRPO 提出了一种分支采样方法来改进 GRPO,以实现图像和视频生成模型与人类偏好更高效和稳定的对齐,从而显著提高速度和对齐效果。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, Shanghang Zhang

Imagining Alternatives: Towards High-Resolution 3D Counterfactual Medical Image Generation via Language Guidance

Vision-language models have demonstrated impressive capabilities in generating 2D images under various conditions; however the impressive performance of these models in 2D is largely enabled by extensive, readily available pretrained foundation models. Critically, comparable pretrained foundation models do not exist for 3D, significantly limiting progress in this domain. As a result, the potential of vision-language models to produce high-resolution 3D counterfactual medical images conditioned solely on natural language descriptions remains completely unexplored. Addressing this gap would enable powerful clinical and research applications, such as personalized counterfactual explanations, simulation of disease progression scenarios, and enhanced medical training by visualizing hypothetical medical conditions in realistic detail. Our work takes a meaningful step toward addressing this challenge by introducing a framework capable of generating high-resolution 3D counterfactual medical images of synthesized patients guided by free-form language prompts. We adapt state-of-the-art 3D diffusion models with enhancements from Simple Diffusion and incorporate augmented conditioning to improve text alignment and image quality. To our knowledge, this represents the first demonstration of a language-guided native-3D diffusion model applied specifically to neurological imaging data, where faithful three-dimensional modeling is essential to represent the brain's three-dimensional structure. Through results on two distinct neurological MRI datasets, our framework successfully simulates varying counterfactual lesion loads in Multiple Sclerosis (MS), and cognitive states in Alzheimer's disease, generating high-quality images while preserving subject fidelity in synthetically generated medical images. Our results lay the groundwork for prompt-driven disease progression analysis within 3D medical imaging.

TLDR: This paper introduces a framework for generating high-resolution 3D counterfactual medical images from language prompts, specifically applied to neurological imaging and demonstrates its use in simulating disease progression in MS and Alzheimer's. It's the first language-guided native-3D diffusion model for neurological imaging.

TLDR: 本文介绍了一种框架,用于从语言提示生成高分辨率3D反事实医学图像,专门应用于神经影像,并演示了其在模拟MS和阿尔茨海默病疾病进展中的应用。 这是首个用于神经影像的语言引导的native-3D扩散模型。

Relevance: (8/10)
Novelty: (9/10)
Clarity: (8/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Mohamed Mohamed, Brennan Nichyporuk, Douglas L. Arnold, Tal Arbel

Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching

Reinforcement Learning (RL) has recently emerged as a powerful technique for improving image and video generation in Diffusion and Flow Matching models, specifically for enhancing output quality and alignment with prompts. A critical step for applying online RL methods on Flow Matching is the introduction of stochasticity into the deterministic framework, commonly realized by Stochastic Differential Equation (SDE). Our investigation reveals a significant drawback to this approach: SDE-based sampling introduces pronounced noise artifacts in the generated images, which we found to be detrimental to the reward learning process. A rigorous theoretical analysis traces the origin of this noise to an excess of stochasticity injected during inference. To address this, we draw inspiration from Denoising Diffusion Implicit Models (DDIM) to reformulate the sampling process. Our proposed method, Coefficients-Preserving Sampling (CPS), eliminates these noise artifacts. This leads to more accurate reward modeling, ultimately enabling faster and more stable convergence for reinforcement learning-based optimizers like Flow-GRPO and Dance-GRPO. Code will be released at https://github.com/IamCreateAI/FlowCPS

TLDR: This paper proposes a new sampling method, Coefficients-Preserving Sampling (CPS), for reinforcement learning with Flow Matching to reduce noise artifacts in generated images, leading to faster and more stable convergence for RL-based optimizers.

TLDR: 本文提出了一种新的采样方法,即系数保持采样 (CPS),用于 Flow Matching 的强化学习,以减少生成图像中的噪声伪影,从而更快、更稳定地收敛 RL 优化器。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Feng Wang, Zihao Yu