ArXiv CS.CV Papers (Image/Video Generation)

LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

This paper focuses on the alignment of flow matching models with human preferences. A promising way is fine-tuning by directly backpropagating reward gradients through the differentiable generation process of flow matching. However, backpropagating through long trajectories results in prohibitive memory costs and gradient explosion. Therefore, direct-gradient methods struggle to update early generation steps, which are crucial for determining the global structure of the final image. To address this issue, we introduce LeapAlign, a fine-tuning method that reduces computational cost and enables direct gradient propagation from reward to early generation steps. Specifically, we shorten the long trajectory into only two steps by designing two consecutive leaps, each skipping multiple ODE sampling steps and predicting future latents in a single step. By randomizing the start and end timesteps of the leaps, LeapAlign leads to efficient and stable model updates at any generation step. To better use such shortened trajectories, we assign higher training weights to those that are more consistent with the long generation path. To further enhance gradient stability, we reduce the weights of gradient terms with large magnitude, instead of completely removing them as done in previous works. When fine-tuning the Flux model, LeapAlign consistently outperforms state-of-the-art GRPO-based and direct-gradient methods across various metrics, achieving superior image quality and image-text alignment.

TLDR: LeapAlign is a fine-tuning method for flow matching models that reduces computational costs and enables direct gradient propagation from reward to early generation steps, improving image quality and image-text alignment.

TLDR: LeapAlign是一种用于flow matching模型的微调方法，它降低了计算成本，并实现了从奖励到早期生成步骤的直接梯度传播，从而提高了图像质量和图像-文本对齐。

Relevance: (8/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Zhanhao Liang, Tao Yang, Jie Wu, Chengjian Feng, Liang Zheng

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. MM-WebAgent jointly optimizes global layout, local multimodal content, and their integration, producing coherent and visually consistent webpages. We further introduce a benchmark for multimodal webpage generation and a multi-level evaluation protocol for systematic assessment. Experiments demonstrate that MM-WebAgent outperforms code-generation and agent-based baselines, especially on multimodal element generation and integration. Code & Data: https://aka.ms/mm-webagent.

TLDR: The paper introduces MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that addresses style inconsistency and poor global coherence issues. It outperforms existing methods, especially in multimodal element generation and integration.

TLDR: 该论文介绍了MM-WebAgent，一个用于多模态网页生成的分层代理框架，旨在解决风格不一致和全局连贯性差的问题。它优于现有方法，尤其是在多模态元素生成和集成方面。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yan Li, Zezi Zeng, Yifan Yang, Yuqing Yang, Ning Liao, Weiwei Guo, Lili Qiu, Mingxi Cheng, Qi Dai, Zhendong Wang, Zhengyuan Yang, Xue Yang, Ji Li, Lijuan Wang, Chong Luo

AnimationBench: Are Video Models Good at Character-Centric Animation?

Video generation has advanced rapidly, with recent methods producing increasingly convincing animated results. However, existing benchmarks-largely designed for realistic videos-struggle to evaluate animation-style generation with its stylized appearance, exaggerated motion, and character-centric consistency. Moreover, they also rely on fixed prompt sets and rigid pipelines, offering limited flexibility for open-domain content and custom evaluation needs. To address this gap, we introduce AnimationBench, the first systematic benchmark for evaluating animation image-to-video generation. AnimationBench operationalizes the Twelve Basic Principles of Animation and IP Preservation into measurable evaluation dimensions, together with Broader Quality Dimensions including semantic consistency, motion rationality, and camera motion consistency. The benchmark supports both a standardized close-set evaluation for reproducible comparison and a flexible open-set evaluation for diagnostic analysis, and leverages visual-language models for scalable assessment. Extensive experiments show that AnimationBench aligns well with human judgment and exposes animation-specific quality differences overlooked by realism-oriented benchmarks, leading to more informative and discriminative evaluation of state-of-the-art I2V models.

TLDR: AnimationBench is introduced as a new benchmark tailored for evaluating animation image-to-video generation, addressing the limitations of existing benchmarks designed for realistic videos by incorporating animation principles and flexible evaluation setups.

TLDR: AnimationBench被提出，作为一个专门为评估动画图像到视频生成的新基准，通过结合动画原则和灵活的评估设置，解决了现有用于评估现实视频的基准的局限性。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Leyi Wu, Pengjun Fang, Kai Sun, Yazhou Xing, Yinwei Wu, Songsong Wang, Ziqi Huang, Dan Zhou, Yingqing He, Ying-Cong Chen, Qifeng Chen

Prompt-to-Gesture: Measuring the Capabilities of Image-to-Video Deictic Gesture Generation

Gesture recognition research, unlike NLP, continues to face acute data scarcity, with progress constrained by the need for costly human recordings or image processing approaches that cannot generate authentic variability in the gestures themselves. Recent advancements in image-to-video foundation models have enabled the generation of photorealistic, semantically rich videos guided by natural language. These capabilities open up new possibilities for creating effort-free synthetic data, raising the critical question of whether video Generative AI models can augment and complement traditional human-generated gesture data. In this paper, we introduce and analyze prompt-based video generation to construct a realistic deictic gestures dataset and rigorously evaluate its effectiveness for downstream tasks. We propose a data generation pipeline that produces deictic gestures from a small number of reference samples collected from human participants, providing an accessible approach that can be leveraged both within and beyond the machine learning community. Our results demonstrate that the synthetic gestures not only align closely with real ones in terms of visual fidelity but also introduce meaningful variability and novelty that enrich the original data, further supported by superior performance of various deep models using a mixed dataset. These findings highlight that image-to-video techniques, even in their early stages, offer a powerful zero-shot approach to gesture synthesis with clear benefits for downstream tasks.

TLDR: This paper explores using image-to-video generation to create synthetic deictic gesture datasets, demonstrating their effectiveness in improving downstream task performance and addressing data scarcity in gesture recognition.

TLDR: 该论文探索了使用图像到视频生成技术来创建合成的指示手势数据集，并证明了它们在提高下游任务性能和解决手势识别中数据稀缺问题方面的有效性。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Hassan Ali, Doreen Jirak, Luca Müller, Stefan Wermter

Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes

Text-driven inversion of generative models is a core paradigm for manipulating 2D or 3D content, unlocking numerous applications such as text-based editing, style transfer, or inverse problems. However, it relies on the assumption that generative models remain sensitive to natural language prompts. We demonstrate that for state-of-the-art native text-to-3D generative models, this assumption often collapses. We identify a critical failure mode where generation trajectories are drawn into latent ``sink traps'': regions where the model becomes insensitive to prompt modifications. In these regimes, changes to the input text fail to alter internal representations in a way that alters the output geometry. Crucially, we observe that this is not a limitation of the model's \textit{geometric} expressivity; the same generative models possess the ability to produce a vast diversity of shapes but, as we demonstrate, become insensitive to out-of-distribution \textit{text} guidance. We investigate this behavior by analyzing the sampling trajectories of the generative model, and find that complex geometries can still be represented and produced by leveraging the model's unconditional generative prior. This leads to a more robust framework for text-based 3D shape editing that bypasses latent sinks by decoupling a model's geometric representation power from its linguistic sensitivity. Our approach addresses the limitations of current 3D pipelines and enables high-fidelity semantic manipulation of out-of-distribution 3D shapes. Project webpage: https://daidedou.sorpi.fr/publication/beyondprompts

TLDR: The paper identifies and addresses a failure mode in text-to-3D generative models where they become insensitive to text prompts for out-of-distribution shapes, proposing a method to decouple geometric representation from linguistic sensitivity for more robust editing.

TLDR: 该论文指出并解决了一种文本到3D生成模型中的失效模式，即对于分布外的形状，模型对文本提示变得不敏感。提出了一种将几何表示与语言敏感性分离的方法，以实现更鲁棒的编辑。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Victoria Yue Chen, Emery Pierson, Léopold Maillard, Maks Ovsjanikov

Reward-Aware Trajectory Shaping for Few-step Visual Generation

Achieving high-fidelity generation in extremely few sampling steps has long been a central goal of generative modeling. Existing approaches largely rely on distillation-based frameworks to compress the original multi-step denoising process into a few-step generator. However, such methods inherently constrain the student to imitate a stronger multi-step teacher, imposing the teacher as an upper bound on student performance. We argue that introducing \textbf{preference alignment awareness} enables the student to optimize toward reward-preferred generation quality, potentially surpassing the teacher instead of being restricted to rigid teacher imitation. To this end, we propose \textbf{Reward-Aware Trajectory Shaping (RATS)}, a lightweight framework for preference-aligned few-step generation. Specifically, teacher and student latent trajectories are aligned at key denoising stages through horizon matching, while a \textbf{reward-aware gate} is introduced to adaptively regulate teacher guidance based on their relative reward performance. Trajectory shaping is strengthened when the teacher achieves higher rewards, and relaxed when the student matches or surpasses the teacher, thereby enabling continued reward-driven improvement. By seamlessly integrating trajectory distillation, reward-aware gating, and preference alignment, RATS effectively transfers preference-relevant knowledge from high-step generators without incurring additional test-time computational overhead. Experimental results demonstrate that RATS substantially improves the efficiency--quality trade-off in few-step visual generation, significantly narrowing the gap between few-step students and stronger multi-step generators.

TLDR: The paper introduces Reward-Aware Trajectory Shaping (RATS), a novel framework for few-step visual generation that surpasses teacher-student imitation limitations by incorporating preference alignment and reward-aware gating for improved efficiency and quality.

TLDR: 本文提出了一种名为奖励感知轨迹塑造 (RATS) 的新型框架，用于少量步骤的视觉生成。该框架通过结合偏好对齐和奖励感知门控，克服了师生模仿的局限性，从而提高了效率和质量。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Rui Li, Bingyu Li, Yuanzhi Liang, HuangHai Bin, Chi Zhang, XueLong Li

Prompt-Guided Image Editing with Masked Logit Nudging in Visual Autoregressive Models

We address the problem of prompt-guided image editing in visual autoregressive models. Given a source image and a target text prompt, we aim to modify the source image according to the target prompt, while preserving all regions which are unrelated to the requested edit. To this end, we present Masked Logit Nudging, which uses the source image token maps to introduce a guidance step that aligns the model's predictions under the target prompt with these source token maps. Specifically, we convert the fixed source encodings into logits using the VAR encoding, nudging the model's predicted logits towards the targets along a semantic trajectory defined by the source-target prompts. Edits are applied only within spatial masks obtained through a dedicated masking scheme that leverages cross-attention differences between the source and edited prompts. Then, we introduce a refinement to correct quantization errors and improve reconstruction quality. Our approach achieves the best image editing performance on the PIE benchmark at 512px and 1024px resolutions. Beyond editing, our method delivers faithful reconstructions and outperforms previous methods on COCO at 512px and OpenImages at 1024px. Overall, our method outperforms VAR-related approaches and achieves comparable or even better performance than diffusion models, while being much faster. Code is available at 'https://github.com/AmirMaEl/MLN'.

TLDR: The paper introduces Masked Logit Nudging, a novel method for prompt-guided image editing using visual autoregressive models, achieving state-of-the-art performance and faster inference compared to diffusion models.

TLDR: 该论文介绍了一种名为 Masked Logit Nudging 的新方法，用于使用视觉自回归模型进行提示引导的图像编辑，与扩散模型相比，实现了最先进的性能和更快的推理速度。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Amir El-Ghoussani, Marc Hölle, Gustavo Carneiro, Vasileios Belagiannis

Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs

Omni-modal Large Language Models (Omni-MLLMs) promise a unified integration of diverse sensory streams. However, recent evaluations reveal a critical performance paradox: unimodal baselines frequently outperform joint multimodal inference. We trace this perceptual fragility to the static fusion topologies universally employed by current models, identifying two structural pathologies: positional bias in sequential inputs and alignment traps in interleaved formats, which systematically distort attention regardless of task semantics. To resolve this functional rigidity, we propose Chain of Modality (CoM), an agentic framework that transitions multimodal fusion from passive concatenation to dynamic orchestration. CoM adaptively orchestrates input topologies, switching among parallel, sequential, and interleaved pathways to neutralize structural biases. Furthermore, CoM bifurcates cognitive execution into two task-aligned pathways: a streamlined ``Direct-Decide'' path for direct perception and a structured ``Reason-Decide'' path for analytical auditing. Operating in either a training-free or a data-efficient SFT setting, CoM achieves robust and consistent generalization across diverse benchmarks.

TLDR: The paper introduces Chain of Modality (CoM), a framework that uses dynamic orchestration of input modalities to improve the performance of Omni-MLLMs by addressing issues of static fusion and biases in current models.

TLDR: 本文介绍了一种名为Chain of Modality (CoM)的框架，该框架通过动态编排输入模态来提高Omni-MLLM的性能，解决了当前模型中静态融合和偏差问题。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Ziyang Luo, Nian Liu, Junwei Han

Geometrically Consistent Multi-View Scene Generation from Freehand Sketches

We tackle a new problem: generating geometrically consistent multi-view scenes from a single freehand sketch. Freehand sketches are the most geometrically impoverished input one could offer a multi-view generator. They convey scene intent through abstract strokes while introducing spatial distortions that actively conflict with any consistent 3D interpretation. No prior method attempts this; existing multi-view approaches require photographs or text, while sketch-to-3D methods need multiple views or costly per-scene optimisation. We address three compounding challenges; absent training data, the need for geometric reasoning from distorted 2D input, and cross-view consistency, through three mutually reinforcing contributions: (i) a curated dataset of $\sim$9k sketch-to-multiview samples, constructed via an automated generation and filtering pipeline; (ii) Parallel Camera-Aware Attention Adapters (CA3) that inject geometric inductive biases into the video transformer; and (iii) a Sparse Correspondence Supervision Loss (CSL) derived from Structure-from-Motion reconstructions. Our framework synthesizes all views in a single denoising process without requiring reference images, iterative refinement, or per-scene optimization. Our approach significantly outperforms state-of-the-art two-stage baselines, improving realism (FID) by over 60% and geometric consistency (Corr-Acc) by 23%, while providing up to a 3.7$\times$ inference speedup.

TLDR: This paper introduces a novel framework for generating geometrically consistent multi-view scenes from single freehand sketches, addressing the challenges of limited training data, geometric reasoning from distorted inputs, and cross-view consistency, achieving superior performance compared to existing methods.

TLDR: 本文介绍了一种新的框架，用于从单个手绘草图生成几何上一致的多视图场景。该框架解决了训练数据有限、从扭曲输入进行几何推理以及跨视图一致性的问题，并且与现有方法相比，实现了卓越的性能。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Ahmed Bourouis, Savas Ozkan, Andrea Maracani, Yi-Zhe Song, Mete Ozay

An Analysis of Regularization and Fokker-Planck Residuals in Diffusion Models for Image Generation

Recent work has shown that diffusion models trained with the denoising score matching (DSM) objective often violate the Fokker--Planck (FP) equation that governs the evolution of the true data density. Directly penalizing these deviations in the objective function reduces their magnitude but introduces a significant computational overhead. It is also observed that enforcing strict adherence to the FP equation does not necessarily lead to improvements in the quality of the generated samples, as often the best results are obtained with weaker FP regularization. In this paper, we investigate whether simpler penalty terms can provide similar benefits. We empirically analyze several lightweight regularizers, study their effect on FP residuals and generation quality, and show that the benefits of FP regularization are available at substantially lower computational cost. Our code is available at https://github.com/OnnoNiemann/fp_diffusion_analysis.

TLDR: This paper analyzes lightweight regularization methods in diffusion models to reduce Fokker-Planck residuals, achieving similar generation quality gains to more computationally expensive methods.

TLDR: 本文分析了扩散模型中减少Fokker-Planck残差的轻量级正则化方法，实现了与计算成本更高的正则化方法相似的生成质量提升。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Onno Niemann, Gonzalo Martínez Muñoz, Alberto Suárez Gonzalez

Generative Modeling of Complex-Valued Brain MRI Data

Objective. Standard Magnetic Resonance Imaging (MRI) reconstruction pipelines discard phase information captured during acquisition, despite evidence that it encodes tissue properties relevant to tumor diagnosis. Current machine learning approaches inherit this limitation by operating exclusively on reconstructed magnitude images. The aim of this study is to build a generative framework which is capable of jointly modeling magnitude and phase information of complex-valued MRI scans. Approach. The proposed generative framework combines a conditional variational autoencoder, which compresses complex-valued MRI scans into compact latent representations while preserving phase coherence, with a flow-matching-based generative model. Synthetic sample quality is assessed via a real-versus-synthetic classifier and by training downstream classifiers on synthetic data for abnormal tissue detection. Main results. The autoencoder preserves phase coherence above 0.997. Real-versus-synthetic classification yields low AUROC values between 0.50 and 0.66 across all acquisition sequences, indicating generated samples are nearly indistinguishable from real data. In downstream normal-versus-abnormal classification, classifiers trained entirely on synthetic data achieve an AUROC of 0.880, surpassing the real-data baseline of 0.842 on a publicly available dataset (fastMRI). This advantage persists on an independent external test set from a different institution with biopsy-confirmed labels. Significance. The proposed framework demonstrates the feasibility of jointly modeling magnitude and phase information for normal and abnormal complex-valued brain MRI data. Beyond synthetic data generation, it establishes a foundation for the usage of complete brain MRI information in future diagnostic applications and enables systematic investigation of how magnitude and phase jointly encode pathology-specific features.

TLDR: The paper introduces a generative framework combining a conditional VAE and flow-matching to jointly model magnitude and phase information in complex-valued MRI. Results suggest successful synthesis of realistic MRI data and improved downstream classification of abnormal tissue.

TLDR: 该论文介绍了一种生成框架，它结合了条件VAE和流匹配，以共同建模复值MRI中的幅度和相位信息。结果表明成功合成了逼真的MRI数据，并改进了异常组织的下游分类。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Marco Schlimbach, Moritz Rempe, Jessica Mnischek, Lukas T. Rotkopf, Jens Weingarten, Jens Kleesiek, Kevin Kröninger

AIGC Daily Papers

LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

AnimationBench: Are Video Models Good at Character-Centric Animation?

Prompt-to-Gesture: Measuring the Capabilities of Image-to-Video Deictic Gesture Generation

Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes

Reward-Aware Trajectory Shaping for Few-step Visual Generation

Prompt-Guided Image Editing with Masked Logit Nudging in Visual Autoregressive Models

Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs

Geometrically Consistent Multi-View Scene Generation from Freehand Sketches

An Analysis of Regularization and Fokker-Planck Residuals in Diffusion Models for Image Generation

Generative Modeling of Complex-Valued Brain MRI Data