Daily papers related to Image/Video/Multimodal Generation from cs.CV
December 11, 2025
Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind & Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.
TLDR: This paper introduces Bind & Compose, a method for flexible visual concept composition from images and videos using a hierarchical binder structure and a novel training strategy to improve concept binding accuracy and compatibility between image and video concepts.
TLDR: 本文介绍了 Bind & Compose,一种灵活的视觉概念组合方法,它使用分层绑定器结构和一种新的训练策略,通过绑定视觉概念与相应的提示 tokens 来从图像和视频中组合视觉概念,从而提高概念绑定准确性和图像与视频概念之间的兼容性。
Read Paper (PDF)Personalized Text-to-Image (PT2I) generation aims to produce customized images based on reference images. A prominent interest pertains to the integration of an image prompt adapter to facilitate zero-shot PT2I without test-time fine-tuning. However, current methods grapple with three fundamental challenges: 1. the elusive equilibrium between Concept Preservation (CP) and Prompt Following (PF), 2. the difficulty in retaining fine-grained concept details in reference images, and 3. the restricted scalability to extend to multi-subject personalization. To tackle these challenges, we present Dynamic Image Prompt Adapter (DynaIP), a cutting-edge plugin to enhance the fine-grained concept fidelity, CP-PF balance, and subject scalability of SOTA T2I multimodal diffusion transformers (MM-DiT) for PT2I generation. Our key finding is that MM-DiT inherently exhibit decoupling learning behavior when injecting reference image features into its dual branches via cross attentions. Based on this, we design an innovative Dynamic Decoupling Strategy that removes the interference of concept-agnostic information during inference, significantly enhancing the CP-PF balance and further bolstering the scalability of multi-subject compositions. Moreover, we identify the visual encoder as a key factor affecting fine-grained CP and reveal that the hierarchical features of commonly used CLIP can capture visual information at diverse granularity levels. Therefore, we introduce a novel Hierarchical Mixture-of-Experts Feature Fusion Module to fully leverage the hierarchical features of CLIP, remarkably elevating the fine-grained concept fidelity while also providing flexible control of visual granularity. Extensive experiments across single- and multi-subject PT2I tasks verify that our DynaIP outperforms existing approaches, marking a notable advancement in the field of PT2l generation.
TLDR: The paper introduces DynaIP, a novel image prompt adapter for personalized text-to-image generation that addresses challenges in concept preservation, prompt following, and scalability by using a dynamic decoupling strategy and hierarchical feature fusion.
TLDR: 该论文介绍了DynaIP,一种用于个性化文本到图像生成的新型图像提示适配器,通过动态解耦策略和分层特征融合,解决了概念保持、提示跟随和可扩展性方面的挑战。
Read Paper (PDF)Synthesizing realistic human-object interactions (HOI) in video is challenging due to the complex, instance-specific interaction dynamics of both humans and objects. Incorporating controllability in video generation further adds to the complexity. Existing controllable video generation approaches face a trade-off: sparse controls like keypoint trajectories are easy to specify but lack instance-awareness, while dense signals such as optical flow, depths or 3D meshes are informative but costly to obtain. We propose VHOI, a two-stage framework that first densifies sparse trajectories into HOI mask sequences, and then fine-tunes a video diffusion model conditioned on these dense masks. We introduce a novel HOI-aware motion representation that uses color encodings to distinguish not only human and object motion, but also body-part-specific dynamics. This design incorporates a human prior into the conditioning signal and strengthens the model's ability to understand and generate realistic HOI dynamics. Experiments demonstrate state-of-the-art results in controllable HOI video generation. VHOI is not limited to interaction-only scenarios and can also generate full human navigation leading up to object interactions in an end-to-end manner. Project page: https://vcai.mpi-inf.mpg.de/projects/vhoi/.
TLDR: The paper presents VHOI, a two-stage framework for controllable video generation of human-object interactions, using motion densification of sparse trajectories and a HOI-aware motion representation to condition a video diffusion model.
TLDR: 该论文提出了VHOI,一个可控的人与物体交互视频生成框架,它使用稀疏轨迹的运动稠密化以及HOI感知的运动表示来调节视频扩散模型。
Read Paper (PDF)Ultrasound echocardiography is essential for the non-invasive, real-time assessment of cardiac function, but the scarcity of labelled data, driven by privacy restrictions and the complexity of expert annotation, remains a major obstacle for deep learning methods. We propose the Motion Conditioned Diffusion Model (MCDM), a label-free latent diffusion framework that synthesises realistic echocardiography videos conditioned on self-supervised motion features. To extract these features, we design the Motion and Appearance Feature Extractor (MAFE), which disentangles motion and appearance representations from videos. Feature learning is further enhanced by two auxiliary objectives: a re-identification loss guided by pseudo appearance features and an optical flow loss guided by pseudo flow fields. Evaluated on the EchoNet-Dynamic dataset, MCDM achieves competitive video generation performance, producing temporally coherent and clinically realistic sequences without reliance on manual labels. These results demonstrate the potential of self-supervised conditioning for scalable echocardiography synthesis. Our code is available at https://github.com/ZheLi2020/LabelfreeMCDM.
TLDR: This paper introduces a label-free motion-conditioned diffusion model (MCDM) for synthesizing realistic cardiac ultrasound videos, using self-supervised motion features extracted by a novel Motion and Appearance Feature Extractor (MAFE). The method achieves competitive performance on the EchoNet-Dynamic dataset without manual labels.
TLDR: 该论文介绍了一种无标签的运动条件扩散模型(MCDM),用于合成逼真的心脏超声视频,它利用一种名为运动和外观特征提取器(MAFE)的新型方法进行自监督的运动特征提取。该方法在EchoNet-Dynamic数据集上实现了具有竞争力的性能,且无需手动标签。
Read Paper (PDF)Recent advances in video generation have been remarkable, enabling models to produce visually compelling videos with synchronized audio. While existing video generation benchmarks provide comprehensive metrics for visual quality, they lack convincing evaluations for audio-video generation, especially for models aiming to generate synchronized audio-video outputs. To address this gap, we introduce VABench, a comprehensive and multi-dimensional benchmark framework designed to systematically evaluate the capabilities of synchronous audio-video generation. VABench encompasses three primary task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation. It further establishes two major evaluation modules covering 15 dimensions. These dimensions specifically assess pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization, lip-speech consistency, and carefully curated audio and video question-answering (QA) pairs, among others. Furthermore, VABench covers seven major content categories: animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, and virtual worlds. We provide a systematic analysis and visualization of the evaluation results, aiming to establish a new standard for assessing video generation models with synchronous audio capabilities and to promote the comprehensive advancement of the field.
TLDR: The paper introduces VABench, a new benchmark for evaluating audio-video generation models, covering tasks like text/image-to-audio-video and stereo audio-video generation, with a focus on audio-video synchronization and consistency.
TLDR: 该论文介绍了VABench,这是一个用于评估音视频生成模型的新基准,涵盖了文本/图像到音视频和立体音视频生成等任务,重点关注音视频同步和一致性。
Read Paper (PDF)Recent advances in diffusion models have greatly improved image generation and editing, yet generating or reconstructing layered PSD files with transparent alpha channels remains highly challenging. We propose OmniPSD, a unified diffusion framework built upon the Flux ecosystem that enables both text-to-PSD generation and image-to-PSD decomposition through in-context learning. For text-to-PSD generation, OmniPSD arranges multiple target layers spatially into a single canvas and learns their compositional relationships through spatial attention, producing semantically coherent and hierarchically structured layers. For image-to-PSD decomposition, it performs iterative in-context editing, progressively extracting and erasing textual and foreground components to reconstruct editable PSD layers from a single flattened image. An RGBA-VAE is employed as an auxiliary representation module to preserve transparency without affecting structure learning. Extensive experiments on our new RGBA-layered dataset demonstrate that OmniPSD achieves high-fidelity generation, structural consistency, and transparency awareness, offering a new paradigm for layered design generation and decomposition with diffusion transformers.
TLDR: OmniPSD is a diffusion-based framework for generating and decomposing layered PSD files, enabling both text-to-PSD generation and image-to-PSD decomposition with transparency awareness.
TLDR: OmniPSD是一个基于扩散模型的框架,用于生成和分解分层PSD文件,实现了具有透明度感知的文本到PSD的生成与图像到PSD的分解。
Read Paper (PDF)We present WonderZoom, a novel approach to generating 3D scenes with contents across multiple spatial scales from a single image. Existing 3D world generation models remain limited to single-scale synthesis and cannot produce coherent scene contents at varying granularities. The fundamental challenge is the lack of a scale-aware 3D representation capable of generating and rendering content with largely different spatial sizes. WonderZoom addresses this through two key innovations: (1) scale-adaptive Gaussian surfels for generating and real-time rendering of multi-scale 3D scenes, and (2) a progressive detail synthesizer that iteratively generates finer-scale 3D contents. Our approach enables users to "zoom into" a 3D region and auto-regressively synthesize previously non-existent fine details from landscapes to microscopic features. Experiments demonstrate that WonderZoom significantly outperforms state-of-the-art video and 3D models in both quality and alignment, enabling multi-scale 3D world creation from a single image. We show video results and an interactive viewer of generated multi-scale 3D worlds in https://wonderzoom.github.io/
TLDR: WonderZoom introduces a novel approach for generating multi-scale 3D scenes from a single image by using scale-adaptive Gaussian surfels and a progressive detail synthesizer, enabling users to zoom into 3D regions and generate previously non-existent fine details.
TLDR: WonderZoom提出了一种新颖的方法,通过使用尺度自适应高斯surfels和一个渐进式细节合成器,从单个图像生成多尺度3D场景,从而使用户能够放大3D区域并生成以前不存在的精细细节。
Read Paper (PDF)Recent progress in text-to-video generation has achieved remarkable realism, yet fine-grained control over camera motion and orientation remains elusive. Existing approaches typically encode camera trajectories through relative or ambiguous representations, limiting explicit geometric control. We introduce GimbalDiffusion, a framework that enables camera control grounded in physical-world coordinates, using gravity as a global reference. Instead of describing motion relative to previous frames, our method defines camera trajectories in an absolute coordinate system, allowing precise and interpretable control over camera parameters without requiring an initial reference frame. We leverage panoramic 360-degree videos to construct a wide variety of camera trajectories, well beyond the predominantly straight, forward-facing trajectories seen in conventional video data. To further enhance camera guidance, we introduce null-pitch conditioning, an annotation strategy that reduces the model's reliance on text content when conflicting with camera specifications (e.g., generating grass while the camera points towards the sky). Finally, we establish a benchmark for camera-aware video generation by rebalancing SpatialVID-HQ for comprehensive evaluation under wide camera pitch variation. Together, these contributions advance the controllability and robustness of text-to-video models, enabling precise, gravity-aligned camera manipulation within generative frameworks.
TLDR: GimbalDiffusion introduces a gravity-aware camera control method for text-to-video generation, enabling precise camera movements in a global coordinate system and improving controllability and robustness.
TLDR: GimbalDiffusion 提出了一种文本到视频生成中基于重力的相机控制方法,可以在全局坐标系中实现精确的相机运动,并提高可控性和鲁棒性。
Read Paper (PDF)Text-to-image generative models have achieved remarkable visual quality but still struggle with compositionality$-$accurately capturing object relationships, attribute bindings, and fine-grained details in prompts. A key limitation is that models are not explicitly trained to differentiate between compositionally similar prompts and images, resulting in outputs that are close to the intended description yet deviate in fine-grained details. To address this, we propose AgentComp, a framework that explicitly trains models to better differentiate such compositional variations and enhance their reasoning ability. AgentComp leverages the reasoning and tool-use capabilities of large language models equipped with image generation, editing, and VQA tools to autonomously construct compositional datasets. Using these datasets, we apply an agentic preference optimization method to fine-tune text-to-image models, enabling them to better distinguish between compositionally similar samples and resulting in overall stronger compositional generation ability. AgentComp achieves state-of-the-art results on compositionality benchmarks such as T2I-CompBench, without compromising image quality$-$a common drawback in prior approaches$-$and even generalizes to other capabilities not explicitly trained for, such as text rendering.
TLDR: The paper presents AgentComp, a framework utilizing LLMs and image tools to generate compositional datasets, enabling fine-tuning of text-to-image models for improved compositional reasoning and state-of-the-art results on compositionality benchmarks.
TLDR: 该论文提出了AgentComp框架,它利用LLM和图像工具生成组合数据集,从而能够对文本到图像模型进行微调,以提高组合推理能力,并在组合基准测试中获得最先进的结果。
Read Paper (PDF)Recent advances in diffusion transformers have empowered video generation models to generate high-quality video clips from texts or images. However, world models with the ability to predict long-horizon futures from past observations and actions remain underexplored, especially for general-purpose scenarios and various forms of actions. To bridge this gap, we introduce Astra, an interactive general world model that generates real-world futures for diverse scenarios (e.g., autonomous driving, robot grasping) with precise action interactions (e.g., camera motion, robot action). We propose an autoregressive denoising architecture and use temporal causal attention to aggregate past observations and support streaming outputs. We use a noise-augmented history memory to avoid over-reliance on past frames to balance responsiveness with temporal coherence. For precise action control, we introduce an action-aware adapter that directly injects action signals into the denoising process. We further develop a mixture of action experts that dynamically route heterogeneous action modalities, enhancing versatility across diverse real-world tasks such as exploration, manipulation, and camera control. Astra achieves interactive, consistent, and general long-term video prediction and supports various forms of interactions. Experiments across multiple datasets demonstrate the improvements of Astra in fidelity, long-range prediction, and action alignment over existing state-of-the-art world models.
TLDR: The paper introduces Astra, a general interactive world model using an autoregressive denoising architecture, designed for long-horizon video prediction across diverse scenarios with precise action control. It outperforms existing models in fidelity, long-range prediction, and action alignment.
TLDR: 该论文介绍了Astra,一个通用的交互式世界模型,它使用自回归去噪架构,旨在实现跨不同场景、具有精确动作控制的长期视频预测。它在保真度、远程预测和动作对齐方面优于现有模型。
Read Paper (PDF)Generating high-quality, textured 3D scenes from a single image remains a fundamental challenge in vision and graphics. Recent image-to-3D generators recover reasonable geometry from single views, but their object-centric training limits generalization to complex, large-scale scenes with faithful structure and texture. We present EvoScene, a self-evolving, training-free framework that progressively reconstructs complete 3D scenes from single images. The key idea is combining the complementary strengths of existing models: geometric reasoning from 3D generation models and visual knowledge from video generation models. Through three iterative stages--Spatial Prior Initialization, Visual-guided 3D Scene Mesh Generation, and Spatial-guided Novel View Generation--EvoScene alternates between 2D and 3D domains, gradually improving both structure and appearance. Experiments on diverse scenes demonstrate that EvoScene achieves superior geometric stability, view-consistent textures, and unseen-region completion compared to strong baselines, producing ready-to-use 3D meshes for practical applications.
TLDR: EvoScene is a training-free framework for generating complete 3D scenes from a single image by iteratively combining 3D generation and video generation models, demonstrating improved geometric stability and texture consistency.
TLDR: EvoScene是一个无需训练的框架,通过迭代结合3D生成和视频生成模型,从单张图像生成完整的3D场景,展示了改进的几何稳定性与纹理一致性。
Read Paper (PDF)Image retouching has received significant attention due to its ability to achieve high-quality visual content. Existing approaches mainly rely on uniform pixel-wise color mapping across entire images, neglecting the inherent color variations induced by image content. This limitation hinders existing approaches from achieving adaptive retouching that accommodates both diverse color distributions and user-defined style preferences. To address these challenges, we propose a novel Content-Adaptive image retouching method guided by Attribute-based Text Representation (CA-ATP). Specifically, we propose a content-adaptive curve mapping module, which leverages a series of basis curves to establish multiple color mapping relationships and learns the corresponding weight maps, enabling content-aware color adjustments. The proposed module can capture color diversity within the image content, allowing similar color values to receive distinct transformations based on their spatial context. In addition, we propose an attribute text prediction module that generates text representations from multiple image attributes, which explicitly represent user-defined style preferences. These attribute-based text representations are subsequently integrated with visual features via a multimodal model, providing user-friendly guidance for image retouching. Extensive experiments on several public datasets demonstrate that our method achieves state-of-the-art performance.
TLDR: This paper introduces a content-adaptive image retouching method guided by attribute-based text representation (CA-ATP), which uses a content-adaptive curve mapping and an attribute text prediction module to achieve state-of-the-art performance.
TLDR: 本文提出了一种基于属性文本表示引导的内容自适应图像修饰方法 (CA-ATP),该方法利用内容自适应曲线映射和属性文本预测模块,实现了目前最先进的性能。
Read Paper (PDF)Part-level 3D generation is essential for applications requiring decomposable and structured 3D synthesis. However, existing methods either rely on implicit part segmentation with limited granularity control or depend on strong external segmenters trained on large annotated datasets. In this work, we observe that part awareness emerges naturally during whole-object geometry learning and propose Geom-Seg VecSet, a unified geometry-segmentation latent representation that jointly encodes object geometry and part-level structure. Building on this representation, we introduce UniPart, a two-stage latent diffusion framework for image-guided part-level 3D generation. The first stage performs joint geometry generation and latent part segmentation, while the second stage conditions part-level diffusion on both whole-object and part-specific latents. A dual-space generation scheme further enhances geometric fidelity by predicting part latents in both global and canonical spaces. Extensive experiments demonstrate that UniPart achieves superior segmentation controllability and part-level geometric quality compared with existing approaches.
TLDR: UniPart is a two-stage latent diffusion framework for image-guided part-level 3D generation that uses a unified geometry-segmentation latent representation, achieving superior segmentation controllability and part-level geometric quality.
TLDR: UniPart是一个两阶段的潜在扩散框架,用于图像引导的零件级3D生成,它使用统一的几何分割潜在表示,实现了卓越的分割可控性和零件级几何质量。
Read Paper (PDF)Understanding disease progression is a central clinical challenge with direct implications for early diagnosis and personalized treatment. While recent generative approaches have attempted to model progression, key mismatches remain: disease dynamics are inherently continuous and monotonic, yet latent representations are often scattered, lacking semantic structure, and diffusion-based models disrupt continuity with random denoising process. In this work, we propose to treat the disease dynamic as a velocity field and leverage Flow Matching (FM) to align the temporal evolution of patient data. Unlike prior methods, it captures the intrinsic dynamic of disease, making the progression more interpretable. However, a key challenge remains: in latent space, Auto-Encoders (AEs) do not guarantee alignment across patients or correlation with clinical-severity indicators (e.g., age and disease conditions). To address this, we propose to learn patient-specific latent alignment, which enforces patient trajectories to lie along a specific axis, with magnitude increasing monotonically with disease severity. This leads to a consistent and semantically meaningful latent space. Together, we present $Δ$-LFM, a framework for modeling patient-specific latent progression with flow matching. Across three longitudinal MRI benchmarks, $Δ$-LFM demonstrates strong empirical performance and, more importantly, offers a new framework for interpreting and visualizing disease dynamics.
TLDR: This paper introduces Δ-LFM, a flow-matching-based framework for generating longitudinal imaging data that learns patient-specific disease progression with a semantically meaningful latent space.
TLDR: 本文介绍了一种基于流匹配的框架Δ-LFM,用于生成纵向成像数据,该框架学习具有语义意义的潜在空间的患者特定疾病进展。
Read Paper (PDF)Generating realistic food images for categories with multiple nouns is surprisingly challenging. For instance, the prompt "egg noodle" may result in images that incorrectly contain both eggs and noodles as separate entities. Multi-noun food categories are common in real-world datasets and account for a large portion of entries in benchmarks such as UEC-256. These compound names often cause generative models to misinterpret the semantics, producing unintended ingredients or objects. This is due to insufficient multi-noun category related knowledge in the text encoder and misinterpretation of multi-noun relationships, leading to incorrect spatial layouts. To overcome these challenges, we propose FoCULR (Food Category Understanding and Layout Refinement) which incorporates food domain knowledge and introduces core concepts early in the generation process. Experimental results demonstrate that the integration of these techniques improves image generation performance in the food domain.
TLDR: The paper introduces FoCULR, a method to improve food image generation for multi-noun categories by incorporating food domain knowledge and refining spatial layouts, addressing the problem of generative models misinterpreting compound names.
TLDR: 该论文介绍了FoCULR,一种通过结合食物领域知识和优化空间布局来改进多名词类别食物图像生成的方法,旨在解决生成模型误解复合名称的问题。
Read Paper (PDF)White balance (WB) is a key step in the image signal processor (ISP) pipeline that mitigates color casts caused by varying illumination and restores the scene's true colors. Currently, sRGB-based WB editing for post-ISP WB correction is widely used to address color constancy failures in the ISP pipeline when the original camera RAW is unavailable. However, additive color models (e.g., sRGB) are inherently limited by fixed nonlinear transformations and entangled color channels, which often impede their generalization to complex lighting conditions. To address these challenges, we propose a novel framework for WB correction that leverages a perception-inspired Learnable HSI (LHSI) color space. Built upon a cylindrical color model that naturally separates luminance from chromatic components, our framework further introduces dedicated parameters to enhance this disentanglement and learnable mapping to adaptively refine the flexibility. Moreover, a new Mamba-based network is introduced, which is tailored to the characteristics of the proposed LHSI color space. Experimental results on benchmark datasets demonstrate the superiority of our method, highlighting the potential of perception-inspired color space design in computational photography. The source code is available at https://github.com/YangCheng58/WB_Color_Space.
TLDR: The paper introduces a novel white balance correction framework using a perception-inspired learnable HSI color space and a Mamba-based network to address limitations of sRGB-based methods under complex lighting conditions. Results on benchmark datasets demonstrate improved performance.
TLDR: 该论文提出了一种新颖的白平衡校正框架,它使用受感知启发的、可学习的HSI颜色空间和一个基于Mamba的网络,以解决在复杂光照条件下sRGB方法的局限性。在基准数据集上的结果表明性能有所提升。
Read Paper (PDF)