Daily papers related to Image/Video/Multimodal Generation from cs.CV
January 27, 2026
Modern video generators still struggle with complex physical dynamics, often falling short of physical realism. Existing approaches address this using external verifiers or additional training on augmented data, which is computationally expensive and still limited in capturing fine-grained motion. In this work, we present self-refining video sampling, a simple method that uses a pre-trained video generator trained on large-scale datasets as its own self-refiner. By interpreting the generator as a denoising autoencoder, we enable iterative inner-loop refinement at inference time without any external verifier or additional training. We further introduce an uncertainty-aware refinement strategy that selectively refines regions based on self-consistency, which prevents artifacts caused by over-refinement. Experiments on state-of-the-art video generators demonstrate significant improvements in motion coherence and physics alignment, achieving over 70\% human preference compared to the default sampler and guidance-based sampler.
TLDR: The paper introduces a self-refining video sampling method that leverages a pre-trained video generator as its own self-refiner to improve motion coherence and physics alignment, achieving significant human preference over existing samplers.
TLDR: 该论文介绍了一种自精炼视频采样方法,利用预训练的视频生成器作为其自身的自精炼器,以提高运动连贯性和物理对齐,并显著优于现有的采样器。
Read Paper (PDF)Talking Head Generation aims at synthesizing natural-looking talking videos from speech and a single portrait image. Previous 3D talking head generation methods have relied on domain-specific heuristics such as warping-based facial motion representation priors to animate talking motions, yet still produce inaccurate 3D avatar reconstructions, thus undermining the realism of generated animations. We introduce Splat-Portrait, a Gaussian-splatting-based method that addresses the challenges of 3D head reconstruction and lip motion synthesis. Our approach automatically learns to disentangle a single portrait image into a static 3D reconstruction represented as static Gaussian Splatting, and a predicted whole-image 2D background. It then generates natural lip motion conditioned on input audio, without any motion driven priors. Training is driven purely by 2D reconstruction and score-distillation losses, without 3D supervision nor landmarks. Experimental results demonstrate that Splat-Portrait exhibits superior performance on talking head generation and novel view synthesis, achieving better visual quality compared to previous works. Our project code and supplementary documents are public available at https://github.com/stonewalking/Splat-portrait.
TLDR: Splat-Portrait introduces a Gaussian splatting based talking head generation method that disentangles a portrait into a static 3D reconstruction and a 2D background, then generates lip motion conditioned on audio, achieving superior visual quality without 3D supervision.
TLDR: Splat-Portrait 提出了一种基于高斯溅射的说话人头部生成方法,该方法将人像分解为静态 3D 重建和 2D 背景,然后根据音频生成嘴唇动作,无需 3D 监督即可实现卓越的视觉质量。
Read Paper (PDF)Fine-tuning-based adaptation is widely used to customize diffusion-based image generation, leading to large collections of community-created adapters that capture diverse subjects and styles. Adapters derived from the same base model can be merged with weights, enabling the synthesis of new visual results within a vast and continuous design space. To explore this space, current workflows rely on manual slider-based tuning, an approach that scales poorly and makes weight selection difficult, even when the candidate set is limited to 20-30 adapters. We propose GimmBO to support interactive exploration of adapter merging for image generation through Preferential Bayesian Optimization (PBO). Motivated by observations from real-world usage, including sparsity and constrained weight ranges, we introduce a two-stage BO backend that improves sampling efficiency and convergence in high-dimensional spaces. We evaluate our approach with simulated users and a user study, demonstrating improved convergence, high success rates, and consistent gains over BO and line-search baselines, and further show the flexibility of the framework through several extensions.
TLDR: The paper introduces GimmBO, a Preferential Bayesian Optimization (PBO) based method for interactive exploration of adapter merging in diffusion-based image generation, improving efficiency and convergence in high-dimensional weight spaces.
TLDR: 该论文介绍了GimmBO,一种基于偏好贝叶斯优化(PBO)的交互式探索扩散模型图像生成中适配器合并的方法,提高了高维权重空间的效率和收敛性。
Read Paper (PDF)We introduce GenAgent, unifying visual understanding and generation through an agentic multimodal model. Unlike unified models that face expensive training costs and understanding-generation trade-offs, GenAgent decouples these capabilities through an agentic framework: understanding is handled by the multimodal model itself, while generation is achieved by treating image generation models as invokable tools. Crucially, unlike existing modular systems constrained by static pipelines, this design enables autonomous multi-turn interactions where the agent generates multimodal chains-of-thought encompassing reasoning, tool invocation, judgment, and reflection to iteratively refine outputs. We employ a two-stage training strategy: first, cold-start with supervised fine-tuning on high-quality tool invocation and reflection data to bootstrap agent behaviors; second, end-to-end agentic reinforcement learning combining pointwise rewards (final image quality) and pairwise rewards (reflection accuracy), with trajectory resampling for enhanced multi-turn exploration. GenAgent significantly boosts base generator(FLUX.1-dev) performance on GenEval++ (+23.6\%) and WISE (+14\%). Beyond performance gains, our framework demonstrates three key properties: 1) cross-tool generalization to generators with varying capabilities, 2) test-time scaling with consistent improvements across interaction rounds, and 3) task-adaptive reasoning that automatically adjusts to different tasks. Our code will be available at \href{https://github.com/deep-kaixun/GenAgent}{this url}.
TLDR: GenAgent introduces an agentic framework for scaling text-to-image generation by decoupling visual understanding and generation, using an agent to refine outputs through multi-turn interactions and reinforcement learning, achieving significant performance improvements and generalization capabilities.
TLDR: GenAgent 引入了一种代理框架,通过解耦视觉理解和生成来扩展文本到图像的生成。该框架使用代理通过多轮交互和强化学习来优化输出,从而显著提高了性能并实现了泛化能力。
Read Paper (PDF)