Daily papers related to Image/Video/Multimodal Generation from cs.CV
October 28, 2025
This paper proposes FreeFuse, a novel training-free approach for multi-subject text-to-image generation through automatic fusion of multiple subject LoRAs. In contrast to existing methods that either focus on pre-inference LoRA weight merging or rely on segmentation models and complex techniques like noise blending to isolate LoRA outputs, our key insight is that context-aware dynamic subject masks can be automatically derived from cross-attention layer weights. Mathematical analysis shows that directly applying these masks to LoRA outputs during inference well approximates the case where the subject LoRA is integrated into the diffusion model and used individually for the masked region. FreeFuse demonstrates superior practicality and efficiency as it requires no additional training, no modification to LoRAs, no auxiliary models, and no user-defined prompt templates or region specifications. Alternatively, it only requires users to provide the LoRA activation words for seamless integration into standard workflows. Extensive experiments validate that FreeFuse outperforms existing approaches in both generation quality and usability under the multi-subject generation tasks. The project page is at https://future-item.github.io/FreeFuse/
TLDR: FreeFuse introduces a training-free method for multi-subject text-to-image generation by automatically fusing multiple subject LoRAs using context-aware masks derived from cross-attention weights, achieving superior performance and usability.
TLDR: FreeFuse 提出了一种无需训练的方法,通过使用从交叉注意力权重导出的上下文感知掩码自动融合多个主体 LoRA,从而实现多主体文本到图像的生成,并实现了卓越的性能和可用性。
Read Paper (PDF)Current 3D/4D generation methods are usually optimized for photorealism, efficiency, and aesthetics. However, they often fail to preserve the semantic identity of the subject across different viewpoints. Adapting generation methods with one or few images of a specific subject (also known as Personalization or Subject-driven generation) allows generating visual content that align with the identity of the subject. However, personalized 3D/4D generation is still largely underexplored. In this work, we introduce TIRE (Track, Inpaint, REsplat), a novel method for subject-driven 3D/4D generation. It takes an initial 3D asset produced by an existing 3D generative model as input and uses video tracking to identify the regions that need to be modified. Then, we adopt a subject-driven 2D inpainting model for progressively infilling the identified regions. Finally, we resplat the modified 2D multi-view observations back to 3D while still maintaining consistency. Extensive experiments demonstrate that our approach significantly improves identity preservation in 3D/4D generation compared to state-of-the-art methods. Our project website is available at https://zsh2000.github.io/track-inpaint-resplat.github.io/.
TLDR: The paper introduces TIRE, a novel method for subject-driven 3D/4D generation that uses video tracking and inpainting to improve identity preservation, surpassing existing state-of-the-art approaches.
TLDR: 该论文介绍了一种名为TIRE的新型主体驱动3D/4D生成方法,该方法利用视频跟踪和修复提高身份保持,并优于现有最先进的方法。
Read Paper (PDF)Directly modeling the explicit likelihood of the raw data distribution is key topic in the machine learning area, which achieves the scaling successes in Large Language Models by autoregressive modeling. However, continuous AR modeling over visual pixel data suffer from extremely long sequences and high-dimensional spaces. In this paper, we present FARMER, a novel end-to-end generative framework that unifies Normalizing Flows (NF) and Autoregressive (AR) models for tractable likelihood estimation and high-quality image synthesis directly from raw pixels. FARMER employs an invertible autoregressive flow to transform images into latent sequences, whose distribution is modeled implicitly by an autoregressive model. To address the redundancy and complexity in pixel-level modeling, we propose a self-supervised dimension reduction scheme that partitions NF latent channels into informative and redundant groups, enabling more effective and efficient AR modeling. Furthermore, we design a one-step distillation scheme to significantly accelerate inference speed and introduce a resampling-based classifier-free guidance algorithm to boost image generation quality. Extensive experiments demonstrate that FARMER achieves competitive performance compared to existing pixel-based generative models while providing exact likelihoods and scalable training.
TLDR: FARMER is a novel generative framework unifying Normalizing Flows and Autoregressive models for high-quality image synthesis directly from raw pixels, using dimension reduction and distillation for efficiency.
TLDR: FARMER 是一种新颖的生成框架,它统一了归一化流和自回归模型,可直接从原始像素进行高质量图像合成,并使用降维和蒸馏来提高效率。
Read Paper (PDF)The scope of neural code intelligence is rapidly expanding beyond text-based source code to encompass the rich visual outputs that programs generate. This visual dimension is critical for advanced applications like flexible content generation and precise, program-driven editing of visualizations. However, progress has been impeded by the scarcity of high-quality multimodal code data, a bottleneck stemming from challenges in synthesis and quality assessment. To address these challenges, we make contributions from both a data and modeling perspective. We first introduce a complete synthesis toolkit that leverages reciprocal synergies between data modalities to efficiently produce a large-scale, high-quality corpus spanning from standard charts to complex interactive web UIs and code-driven animations. Leveraging this toolkit, we construct JanusCode-800K, the largest multimodal code corpus to date. This powers the training of our models, JanusCoder and JanusCoderV, which establish a visual-programmatic interface for generating code from textual instructions, visual inputs, or a combination of both. Our unified model is a departure from existing approaches that build specialized models for isolated tasks. Extensive experiments on both text-centric and vision-centric coding tasks demonstrate the superior performance of the JanusCoder series, with our 7B to 14B scale models approaching or even exceeding the performance of commercial models. Furthermore, extensive analysis provides key insights into harmonizing programmatic logic with its visual expression. Our code and checkpoints will are available at https://github.com/InternLM/JanusCoder.
TLDR: JanusCoder introduces a multimodal code corpus (JanusCode-800K) and models (JanusCoder, JanusCoderV) for generating code from text, visuals, or both, demonstrating strong performance in visual-programmatic interface tasks.
TLDR: JanusCoder 提出了一个多模态代码语料库 (JanusCode-800K) 和模型 (JanusCoder, JanusCoderV),用于从文本、视觉或两者生成代码,并在视觉编程接口任务中表现出强大的性能。
Read Paper (PDF)Diffusion-based generative processes, formulated as differential equation solving, frequently balance computational speed with sample quality. Our theoretical investigation of ODE- and SDE-based solvers reveals complementary weaknesses: ODE solvers accumulate irreducible gradient error along deterministic trajectories, while SDE methods suffer from amplified discretization errors when the step budget is limited. Building upon this insight, we introduce AdaSDE, a novel single-step SDE solver that aims to unify the efficiency of ODEs with the error resilience of SDEs. Specifically, we introduce a single per-step learnable coefficient, estimated via lightweight distillation, which dynamically regulates the error correction strength to accelerate diffusion sampling. Notably, our framework can be integrated with existing solvers to enhance their capabilities. Extensive experiments demonstrate state-of-the-art performance: at 5 NFE, AdaSDE achieves FID scores of 4.18 on CIFAR-10, 8.05 on FFHQ and 6.96 on LSUN Bedroom. Codes are available in https://github.com/WLU-wry02/AdaSDE.
TLDR: The paper introduces AdaSDE, a novel single-step SDE solver for diffusion models that balances computational speed and sample quality by using a learnable coefficient to dynamically regulate error correction strength. It achieves state-of-the-art FID scores at low NFEs.
TLDR: 该论文介绍了一种新的单步SDE求解器AdaSDE,用于扩散模型,它通过使用可学习的系数来动态调节误差校正强度,从而平衡计算速度和样本质量。 它以低NFE实现了最先进的FID分数。
Read Paper (PDF)AutoRegressive (AR) models have demonstrated competitive performance in image generation, achieving results comparable to those of diffusion models. However, their token-by-token image generation mechanism remains computationally intensive and existing solutions such as VAR often lead to limited sample diversity. In this work, we propose a Nested AutoRegressive~(NestAR) model, which proposes nested AutoRegressive architectures in generating images. NestAR designs multi-scale modules in a hierarchical order. These different scaled modules are constructed in an AR architecture, where one larger-scale module is conditioned on outputs from its previous smaller-scale module. Within each module, NestAR uses another AR structure to generate ``patches'' of tokens. The proposed nested AR architecture reduces the overall complexity from $\mathcal{O}(n)$ to $\mathcal{O}(\log n)$ in generating $n$ image tokens, as well as increases image diversities. NestAR further incorporates flow matching loss to use continuous tokens, and develops objectives to coordinate these multi-scale modules in model training. NestAR achieves competitive image generation performance while significantly lowering computational cost.
TLDR: The paper introduces NestAR, a nested autoregressive model for image generation that reduces computational complexity and increases image diversity by using multi-scale modules in a hierarchical AR architecture and incorporating flow matching loss.
TLDR: 该论文介绍了一种名为 NestAR 的嵌套自回归模型,用于图像生成。该模型通过分层 AR 架构中使用多尺度模块并结合流匹配损失,降低了计算复杂度并增加了图像多样性。
Read Paper (PDF)While recent text-to-video models excel at generating diverse scenes, they struggle with precise motion control, particularly for complex, multi-subject motions. Although methods for single-motion customization have been developed to address this gap, they fail in compositional scenarios due to two primary challenges: motion-appearance entanglement and ineffective multi-motion blending. This paper introduces CoMo, a novel framework for $\textbf{compositional motion customization}$ in text-to-video generation, enabling the synthesis of multiple, distinct motions within a single video. CoMo addresses these issues through a two-phase approach. First, in the single-motion learning phase, a static-dynamic decoupled tuning paradigm disentangles motion from appearance to learn a motion-specific module. Second, in the multi-motion composition phase, a plug-and-play divide-and-merge strategy composes these learned motions without additional training by spatially isolating their influence during the denoising process. To facilitate research in this new domain, we also introduce a new benchmark and a novel evaluation metric designed to assess multi-motion fidelity and blending. Extensive experiments demonstrate that CoMo achieves state-of-the-art performance, significantly advancing the capabilities of controllable video generation. Our project page is at https://como6.github.io/.
TLDR: The paper introduces CoMo, a framework for compositional motion customization in text-to-video generation that enables the synthesis of multiple, distinct motions within a single video using a static-dynamic decoupled tuning paradigm and a divide-and-merge strategy. It also provides a new benchmark and evaluation metric.
TLDR: 本文提出了一种名为CoMo的框架,用于文本到视频生成中的组合运动定制,通过静态-动态解耦调整范式和分治合并策略,实现在单个视频中合成多个不同的运动。同时,还提供了一个新的基准和评估指标。
Read Paper (PDF)Recent text-to-image models have revolutionized image generation, but they still struggle with maintaining concept consistency across generated images. While existing works focus on character consistency, they often overlook the crucial role of scenes in storytelling, which restricts their creativity in practice. This paper introduces scene-oriented story generation, addressing two key challenges: (i) scene planning, where current methods fail to ensure scene-level narrative coherence by relying solely on text descriptions, and (ii) scene consistency, which remains largely unexplored in terms of maintaining scene consistency across multiple stories. We propose SceneDecorator, a training-free framework that employs VLM-Guided Scene Planning to ensure narrative coherence across different scenes in a ``global-to-local'' manner, and Long-Term Scene-Sharing Attention to maintain long-term scene consistency and subject diversity across generated stories. Extensive experiments demonstrate the superior performance of SceneDecorator, highlighting its potential to unleash creativity in the fields of arts, films, and games.
TLDR: The paper introduces SceneDecorator, a training-free framework for scene-oriented story generation that addresses scene planning and scene consistency issues in text-to-image models, leading to improved narrative coherence and subject diversity.
TLDR: 本文介绍了SceneDecorator,一个用于场景导向故事生成的免训练框架,解决了文本到图像模型中场景规划和场景一致性的问题,从而提高了叙事连贯性和主题多样性。
Read Paper (PDF)Driving scene generation is a critical domain for autonomous driving, enabling downstream applications, including perception and planning evaluation. Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities; however, their performance heavily depends on annotated occupancy data, which still remains scarce. To overcome this limitation, we curate Nuplan-Occ, the largest semantic occupancy dataset to date, constructed from the widely used Nuplan benchmark. Its scale and diversity facilitate not only large-scale generative modeling but also autonomous driving downstream applications. Based on this dataset, we develop a unified framework that jointly synthesizes high-quality semantic occupancy, multi-view videos, and LiDAR point clouds. Our approach incorporates a spatio-temporal disentangled architecture to support high-fidelity spatial expansion and temporal forecasting of 4D dynamic occupancy. To bridge modal gaps, we further propose two novel techniques: a Gaussian splatting-based sparse point map rendering strategy that enhances multi-view video generation, and a sensor-aware embedding strategy that explicitly models LiDAR sensor properties for realistic multi-LiDAR simulation. Extensive experiments demonstrate that our method achieves superior generation fidelity and scalability compared to existing approaches, and validates its practical value in downstream tasks. Repo: https://github.com/Arlo0o/UniScene-Unified-Occupancy-centric-Driving-Scene-Generation/tree/v2
TLDR: The paper introduces Nuplan-Occ, a large-scale semantic occupancy dataset for autonomous driving, and a unified framework for generating high-quality semantic occupancy, multi-view videos, and LiDAR point clouds, outperforming existing methods.
TLDR: 该论文介绍了 Nuplan-Occ,一个用于自动驾驶的大规模语义占据数据集,以及一个用于生成高质量语义占据、多视角视频和激光雷达点云的统一框架,优于现有方法。
Read Paper (PDF)Recent advances in training-free video editing have enabled lightweight and precise cross-frame generation by leveraging pre-trained text-to-image diffusion models. However, existing methods often rely on heuristic frame selection to maintain temporal consistency during DDIM inversion, which introduces manual bias and reduces the scalability of end-to-end inference. In this paper, we propose~\textbf{VALA} (\textbf{V}ariational \textbf{A}lignment for \textbf{L}atent \textbf{A}nchors), a variational alignment module that adaptively selects key frames and compresses their latent features into semantic anchors for consistent video editing. To learn meaningful assignments, VALA propose a variational framework with a contrastive learning objective. Therefore, it can transform cross-frame latent representations into compressed latent anchors that preserve both content and temporal coherence. Our method can be fully integrated into training-free text-to-image based video editing models. Extensive experiments on real-world video editing benchmarks show that VALA achieves state-of-the-art performance in inversion fidelity, editing quality, and temporal consistency, while offering improved efficiency over prior methods.
TLDR: The paper introduces VALA, a variational alignment module for training-free video editing that adaptively selects key frames and compresses latent features into semantic anchors, improving temporal consistency. It claims state-of-the-art performance in inversion fidelity, editing quality, and temporal consistency.
TLDR: 该论文提出了VALA,一种用于免训练视频编辑的变分对齐模块,可以自适应地选择关键帧并将潜在特征压缩为语义锚点,从而提高时间一致性。该方法声称在反演保真度、编辑质量和时间一致性方面达到了最先进的性能。
Read Paper (PDF)Training-free video editing (VE) models tend to fall back on gender stereotypes when rendering profession-related prompts. We propose \textbf{FAME} for \textit{Fairness-aware Attention-modulated Video Editing} that mitigates profession-related gender biases while preserving prompt alignment and temporal consistency for coherent VE. We derive fairness embeddings from existing minority representations by softly injecting debiasing tokens into the text encoder. Simultaneously, FAME integrates fairness modulation into both temporal self attention and prompt-to-region cross attention to mitigate the motion corruption and temporal inconsistency caused by directly introducing fairness cues. For temporal self attention, FAME introduces a region constrained attention mask combined with time decay weighting, which enhances intra-region coherence while suppressing irrelevant inter-region interactions. For cross attention, it reweights tokens to region matching scores by incorporating fairness sensitive similarity masks derived from debiasing prompt embeddings. Together, these modulations keep fairness-sensitive semantics tied to the right visual regions and prevent temporal drift across frames. Extensive experiments on new VE fairness-oriented benchmark \textit{FairVE} demonstrate that FAME achieves stronger fairness alignment and semantic fidelity, surpassing existing VE baselines.
TLDR: The paper introduces FAME, a training-free video editing method that reduces gender bias in profession-related prompts by modulating attention mechanisms and incorporating fairness embeddings, resulting in improved fairness and semantic quality. They also introduce a new benchmark FairVE.
TLDR: 该论文介绍了 FAME,一种无需训练的视频编辑方法,通过调节注意力机制和结合公平性嵌入来减少职业相关提示中的性别偏见,从而提高公平性和语义质量。 他们还引入了一个新的基准测试 FairVE。
Read Paper (PDF)Unified multimodal models have recently shown remarkable gains in both capability and versatility, yet most leading systems are still trained from scratch and require substantial computational resources. In this paper, we show that competitive performance can be obtained far more efficiently by strategically fusing publicly available models specialized for either generation or understanding. Our key design is to retain the original blocks while additionally interleaving multimodal self-attention blocks throughout the networks. This double fusion mechanism (1) effectively enables rich multi-modal fusion while largely preserving the original strengths of the base models, and (2) catalyzes synergistic fusion of high-level semantic representations from the understanding encoder with low-level spatial signals from the generation encoder. By training with only ~ 35B tokens, this approach achieves strong results across multiple benchmarks: 0.91 on GenEval for compositional text-to-image generation, 82.16 on DPG-Bench for complex text-to-image generation, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench for image editing. By fully releasing the entire suite of code, model weights, and datasets, we hope to support future research on unified multimodal modeling.
TLDR: The paper introduces LightBagel, a lightweight framework that fuses pre-trained unimodal models for unified multimodal understanding and generation using interleaved multimodal self-attention, achieving strong results on text-to-image and image editing benchmarks with limited training.
TLDR: 该论文介绍了LightBagel,一个轻量级的框架,通过交错的多模态自注意力融合预训练的单模态模型,实现统一的多模态理解和生成,并在文本到图像和图像编辑基准测试中取得了良好的结果,同时训练量有限。
Read Paper (PDF)We propose a task-agnostic framework for multimodal fusion of time series and single timestamp images, enabling cross-modal generation and robust downstream performance. Our approach explores deterministic and learned strategies for time series quantization and then leverages a masked correlation learning objective, aligning discrete image and time series tokens in a unified representation space. Instantiated in the Earth observation domain, the pretrained model generates consistent global temperature profiles from satellite imagery and is validated through counterfactual experiments. Across downstream tasks, our task-agnostic pretraining outperforms task-specific fusion by 6\% in R$^2$ and 2\% in RMSE on average, and exceeds baseline methods by 50\% in R$^2$ and 12\% in RMSE. Finally, we analyze gradient sensitivity across modalities, providing insights into model robustness. Code, data, and weights will be released under a permissive license.
TLDR: The paper proposes a task-agnostic multimodal fusion framework for Earth observation, combining time series and single-timestamp imagery using masked correlation learning, demonstrating improved performance in downstream tasks and cross-modal generation of temperature profiles.
TLDR: 该论文提出了一种任务无关的多模态融合框架,用于地球观测,结合了时间序列和单时间戳图像,使用掩蔽相关学习,在下游任务中表现出改进的性能,并可以进行温度曲线的跨模态生成。
Read Paper (PDF)Diffusion bridge models establish probabilistic paths between arbitrary paired distributions and exhibit great potential for universal image restoration. Most existing methods merely treat them as simple variants of stochastic interpolants, lacking a unified analytical perspective. Besides, they indiscriminately reconstruct images through global noise injection and removal, inevitably distorting undegraded regions due to imperfect reconstruction. To address these challenges, we propose the Residual Diffusion Bridge Model (RDBM). Specifically, we theoretically reformulate the stochastic differential equations of generalized diffusion bridge and derive the analytical formulas of its forward and reverse processes. Crucially, we leverage the residuals from given distributions to modulate the noise injection and removal, enabling adaptive restoration of degraded regions while preserving intact others. Moreover, we unravel the fundamental mathematical essence of existing bridge models, all of which are special cases of RDBM and empirically demonstrate the optimality of our proposed models. Extensive experiments are conducted to demonstrate the state-of-the-art performance of our method both qualitatively and quantitatively across diverse image restoration tasks. Code is publicly available at https://github.com/MiliLab/RDBM.
TLDR: This paper introduces the Residual Diffusion Bridge Model (RDBM) for image restoration, reformulating diffusion bridges with a focus on residual-based noise modulation to improve reconstruction while preserving intact image regions. The authors claim state-of-the-art performance across diverse image restoration tasks.
TLDR: 本文提出了用于图像修复的残差扩散桥模型(RDBM),通过重新构建扩散桥,重点关注基于残差的噪声调制,以提高重建效果同时保留完整的图像区域。作者声称该方法在各种图像修复任务中表现出最先进的性能。
Read Paper (PDF)