ArXiv CS.CV Papers (Image/Video Generation)

FREE-Edit: Using Editing-aware Injection in Rectified Flow Models for Zero-shot Image-Driven Video Editing

Image-driven video editing aims to propagate edit contents from the modified first frame to the rest frames. The existing methods usually invert the source video to noise using a pre-trained image-to-video (I2V) model and then guide the sampling process using the edited first frame. Generally, a popular choice for maintaining motion and layout from the source video is intervening in the denoising process by injecting attention during reconstruction. However, such injection often leads to unsatisfactory results, where excessive injection leads to conflicting semantics from the source video while insufficient injection brings limited source representation. Recognizing this, we propose an Editing-awaRE (REE) injection method to modulate injection intensity of each token. Specifically, we first compute the pixel difference between the source and edited first frame to form a corresponding editing mask. Next, we track the editing area throughout the entire video by using optical flow to warp the first-frame mask. Then, editing-aware feature injection intensity for each token is generated accordingly, where injection is not conducted on editing areas. Building upon REE injection, we further propose a zero-shot image-driven video editing framework with recent-emerging rectified-Flow models, dubbed FREE-Edit. Without fine-tuning or training, our FREE-Edit demonstrates effectiveness in various image-driven video editing scenarios, showing its capability to produce higher-quality outputs compared with existing techniques. Project page: https://free-edit.github.io/page/.

TLDR: The paper introduces FREE-Edit, a zero-shot image-driven video editing framework that leverages rectified flow models and an Editing-awaRE injection method to improve the quality and consistency of video edits based on a modified first frame.

TLDR: 该论文介绍了FREE-Edit，一个零样本图像驱动的视频编辑框架，利用修正流模型和一个编辑感知注入方法，以提高基于修改后的第一帧的视频编辑的质量和一致性。

Relevance: (8/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Maomao Li, Yunfei Liu, Yu Li

Unified Vision-Language Modeling via Concept Space Alignment

We introduce V-SONAR, a vision-language embedding space extended from the text-only embedding space SONAR (Omnilingual Embeddings Team et al., 2026), which supports 1500 text languages and 177 speech languages. To construct V-SONAR, we propose a post-hoc alignment pipeline that maps the representations of an existing vision encoder into the SONAR space. We thoroughly evaluate V-SONAR and show that its embeddings achieve competitive performance on text-to-video retrieval. Equipped with the OMNISONAR text decoder, V-SONAR further surpasses state-of-the-art vision-language models on video captioning tasks, including DREAM-1K (BLEU 23.9 vs. 19.6) and PE-VIDEO (BLEU 39.0 vs. 30.0). Leveraging V-SONAR, we first demonstrate that the Large Concept Model (LCM; LCM team et al. 2024) operating in SONAR and trained with English text only, can perform both single- and multi-visual concept understanding in a zero-shot manner. Finally, we introduce V-LCM, which extends the LCM with vision-language instruction tuning. V-LCM encodes vision and language inputs into an unified sequence of latent embeddings via V-SONAR and SONAR, and it is trained with the same latent diffusion objective for next-embedding prediction as in LCM's text-only pre-training. Experiments on a large-scale multilingual and -modal instruction-tuning data mixture highlight the potential of V-LCM: V-LCM matches state-of-the-art vision-language models on tasks covering image/video captioning and question answering, while significantly outperforming them across 61 rich- to low-resource languages out of all 62 tested languages.

TLDR: The paper introduces V-SONAR, a vision-language embedding space aligned with the SONAR text embedding space, and V-LCM, a vision-language model built upon this space. V-LCM demonstrates strong multilingual performance across various vision-language tasks, particularly in low-resource languages.

TLDR: 本文介绍了V-SONAR，一个与SONAR文本嵌入空间对齐的视觉-语言嵌入空间，以及V-LCM，一个基于该空间构建的视觉-语言模型。V-LCM在多种视觉-语言任务上表现出强大的多语言性能，尤其是在低资源语言方面。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yifu Qiu, Paul-Ambroise Duquenne, Holger Schwenk

LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model

We present \textbf{LLaDA-o}, an effective and length-adaptive omni diffusion model for multimodal understanding and generation. LLaDA-o is built on a Mixture of Diffusion (MoD) framework that decouples discrete masked diffusion for text understanding and continuous diffusion for visual generation, while coupling them through a shared, simple, and efficient attention backbone that reduces redundant computation for fixed conditions. Building on MoD, we further introduce a data-centric length adaptation strategy that enables flexible-length decoding in multimodal settings without architectural changes. Extensive experiments show that LLaDA-o achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks, and reaches 87.04 on DPG-Bench for text-to-image generation, supporting the effectiveness of unified omni diffusion modeling. Code is available at https://github.com/ML-GSAI/LLaDA-o.

TLDR: LLaDA-o is a novel omni-diffusion model using a Mixture of Diffusion framework and a length adaptation strategy, achieving state-of-the-art results in multimodal understanding and generation.

TLDR: LLaDA-o 是一种新颖的 Omni 扩散模型，它使用混合扩散框架和一个长度适应策略，在多模态理解和生成方面实现了最先进的结果。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Zebin You, Xiaolu Zhang, Jun Zhou, Chongxuan Li, Ji-Rong Wen

GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis

Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging. Diffusion-based models rely on stochastic noise-to-data transitions, which obscure deterministic structures and yield inconsistent view predictions. We propose a Data-to-Data Flow Matching framework that learns deterministic transformations directly between paired views, enhancing view-consistent synthesis through explicit data coupling. To further enhance geometric coherence, we introduce Probability Density Geodesic Flow Matching (PDG-FM), which constrains flow trajectories using geodesic interpolants derived from probability density metrics of pretrained diffusion models. Such alignment with high-density regions of the data manifold promotes more realistic interpolants between samples. Empirically, our method surpasses diffusion-based NVS baselines, demonstrating improved structural coherence and smoother transitions across views. These results highlight the advantages of incorporating data-dependent geometric regularization into deterministic flow matching for consistent novel view generation.

TLDR: This paper introduces GeodesicNVS, a Data-to-Data Flow Matching framework enhanced with Probability Density Geodesic Flow Matching (PDG-FM) for improved view-consistent novel view synthesis, surpassing diffusion-based methods.

TLDR: 该论文介绍了GeodesicNVS，一种数据到数据流匹配框架，通过概率密度测地线流匹配(PDG-FM)增强，以改进视角一致的新视角合成，优于基于扩散的方法。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Xuqin Wang, Tao Wu, Yanfeng Zhang, Lu Liu, Mingwei Sun, Yongliang Wang, Niclas Zeller, Daniel Cremers

Let Your Image Move with Your Motion! -- Implicit Multi-Object Multi-Motion Transfer

Motion transfer has emerged as a promising direction for controllable video generation, yet existing methods largely focus on single-object scenarios and struggle when multiple objects require distinct motion patterns. In this work, we present FlexiMMT, the first implicit image-to-video (I2V) motion transfer framework that explicitly enables multi-object, multi-motion transfer. Given a static multi-object image and multiple reference videos, FlexiMMT independently extracts motion representations and accurately assigns them to different objects, supporting flexible recombination and arbitrary motion-to-object mappings. To address the core challenge of cross-object motion entanglement, we introduce a Motion Decoupled Mask Attention Mechanism that uses object-specific masks to constrain attention, ensuring that motion and text tokens only influence their designated regions. We further propose a Differentiated Mask Propagation Mechanism that derives object-specific masks directly from diffusion attention and progressively propagates them across frames efficiently. Extensive experiments demonstrate that FlexiMMT achieves precise, compositional, and state-of-the-art performance in I2V-based multi-object multi-motion transfer.

TLDR: The paper introduces FlexiMMT, a novel image-to-video motion transfer framework that enables multi-object, multi-motion transfer by decoupling motion and assigning it to different objects using mask attention mechanisms.

TLDR: 该论文介绍了FlexiMMT，一种新颖的图像到视频的运动迁移框架，它通过使用掩码注意力机制解耦运动并将其分配给不同的对象，从而实现多对象、多运动的迁移。

Relevance: (9/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yuze Li, Dong Gong, Xiao Cao, Junchao Yuan, Dongsheng Li, Lei Zhou, Yun Sing Koh, Cheng Yan, Xinyu Zhang

Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

Text-to-image generation powers content creation across design, media, and data augmentation. Post-training of text-to-image generative models is a promising path to better match human preferences, factuality, and improved aesthetics. We introduce ARC (Adaptive Rewarding by self-Confidence), a post-training framework that replaces external reward supervision with an internal self-confidence signal, obtained by evaluating how accurately the model recovers injected noise under self-denoising probes. ARC converts this intrinsic signal into scalar rewards, enabling fully unsupervised optimization without additional datasets, annotators, or reward models. Empirically, by reinforcing high-confidence generations, ARC delivers consistent gains in compositional generation, text rendering and text-image alignment over the baseline. We also find that integrating ARC with external rewards results in a complementary improvement, with alleviated reward hacking.

TLDR: The paper introduces ARC, a post-training framework for text-to-image generation that uses an intrinsic self-confidence signal derived from denoising accuracy to improve compositional generation, text rendering, and text-image alignment without external data or reward models.

TLDR: 该论文介绍了一种名为ARC的文本到图像生成后训练框架，它使用从去噪准确性获得的内在自信信号，以提高组合生成、文本渲染和文本-图像对齐，而无需外部数据或奖励模型。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Seungwook Kim, Minsu Cho

Analyzing and Improving Fast Sampling of Text-to-Image Diffusion Models

Text-to-image diffusion models have achieved unprecedented success but still struggle to produce high-quality results under limited sampling budgets. Existing training-free sampling acceleration methods are typically developed independently, leaving the overall performance and compatibility among these methods unexplored. In this paper, we bridge this gap by systematically elucidating the design space, and our comprehensive experiments identify the sampling time schedule as the most pivotal factor. Inspired by the geometric properties of diffusion models revealed through the Frenet-Serret formulas, we propose constant total rotation schedule (TORS), a scheduling strategy that ensures uniform geometric variation along the sampling trajectory. TORS outperforms previous training-free acceleration methods and produces high-quality images with 10 sampling steps on Flux.1-Dev and Stable Diffusion 3.5. Extensive experiments underscore the adaptability of our method to unseen models, hyperparameters, and downstream applications.

TLDR: This paper analyzes and improves the sampling efficiency of text-to-image diffusion models by proposing a novel sampling schedule (TORS) based on Frenet-Serret formulas, achieving high-quality image generation with fewer sampling steps.

TLDR: 该论文分析并改进了文本到图像扩散模型的采样效率，通过提出一种基于Frenet-Serret公式的新型采样策略（TORS），以更少的采样步骤实现高质量的图像生成。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Zhenyu Zhou, Defang Chen, Siwei Lyu, Chun Chen, Can Wang

BeautyGRPO: Aesthetic Alignment for Face Retouching via Dynamic Path Guidance and Fine-Grained Preference Modeling

Face retouching requires removing subtle imperfections while preserving unique facial identity features, in order to enhance overall aesthetic appeal. However, existing methods suffer from a fundamental trade-off. Supervised learning on labeled data is constrained to pixel-level label mimicry, failing to capture complex subjective human aesthetic preferences. Conversely, while online reinforcement learning (RL) excels at preference alignment, its stochastic exploration paradigm conflicts with the high-fidelity demands of face retouching and often introduces noticeable noise artifacts due to accumulated stochastic drift. To address these limitations, we propose BeautyGRPO, a reinforcement learning framework that aligns face retouching with human aesthetic preferences. We construct FRPref-10K, a fine-grained preference dataset covering five key retouching dimensions, and train a specialized reward model capable of evaluating subtle perceptual differences. To reconcile exploration and fidelity, we introduce Dynamic Path Guidance (DPG). DPG stabilizes the stochastic sampling trajectory by dynamically computing an anchor-based ODE path and replanning a guided trajectory at each sampling timestep, effectively correcting stochastic drift while maintaining controlled exploration. Extensive experiments show that BeautyGRPO outperforms both specialized face retouching methods and general image editing models, achieving superior texture quality, more accurate blemish removal, and overall results that better align with human aesthetic preferences.

TLDR: The BeautyGRPO paper introduces a reinforcement learning framework for face retouching that combines dynamic path guidance and fine-grained preference modeling to align with human aesthetic preferences, overcoming limitations of supervised learning and standard RL approaches. They introduce a new dataset for fine-grained retouching preferences.

TLDR: BeautyGRPO 论文介绍了一个用于面部修饰的强化学习框架，该框架结合了动态路径引导和细粒度偏好建模，以符合人类的审美偏好，克服了监督学习和标准强化学习方法的局限性。他们还引入了一个新的用于细粒度修饰偏好的数据集。

Relevance: (4/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (6/10)

Read Paper (PDF)

Authors: Jiachen Yang, Xianhui Lin, Yi Dong, Zebiao Zheng, Xing Liu, Hong Gu, Yanmei Fang

AIGC Daily Papers

FREE-Edit: Using Editing-aware Injection in Rectified Flow Models for Zero-shot Image-Driven Video Editing

Unified Vision-Language Modeling via Concept Space Alignment

LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model

GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis

Let Your Image Move with Your Motion! -- Implicit Multi-Object Multi-Motion Transfer

Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

Analyzing and Improving Fast Sampling of Text-to-Image Diffusion Models

BeautyGRPO: Aesthetic Alignment for Face Retouching via Dynamic Path Guidance and Fine-Grained Preference Modeling