ArXiv CS.CV Papers (Image/Video Generation)

MedGEN-Bench: Contextually entangled benchmark for open-ended multimodal medical generation

As Vision-Language Models (VLMs) increasingly gain traction in medical applications, clinicians are progressively expecting AI systems not only to generate textual diagnoses but also to produce corresponding medical images that integrate seamlessly into authentic clinical workflows. Despite the growing interest, existing medical visual benchmarks present notable limitations. They often rely on ambiguous queries that lack sufficient relevance to image content, oversimplify complex diagnostic reasoning into closed-ended shortcuts, and adopt a text-centric evaluation paradigm that overlooks the importance of image generation capabilities. To address these challenges, we introduce \textsc{MedGEN-Bench}, a comprehensive multimodal benchmark designed to advance medical AI research. MedGEN-Bench comprises 6,422 expert-validated image-text pairs spanning six imaging modalities, 16 clinical tasks, and 28 subtasks. It is structured into three distinct formats: Visual Question Answering, Image Editing, and Contextual Multimodal Generation. What sets MedGEN-Bench apart is its focus on contextually intertwined instructions that necessitate sophisticated cross-modal reasoning and open-ended generative outputs, moving beyond the constraints of multiple-choice formats. To evaluate the performance of existing systems, we employ a novel three-tier assessment framework that integrates pixel-level metrics, semantic text analysis, and expert-guided clinical relevance scoring. Using this framework, we systematically assess 10 compositional frameworks, 3 unified models, and 5 VLMs.

TLDR: The paper introduces MedGEN-Bench, a new multimodal medical benchmark dataset that focuses on contextually intertwined instructions necessitating cross-modal reasoning and open-ended generative outputs in image generation, image editing, and visual question answering, to advance medical AI research.

TLDR: 该论文介绍了MedGEN-Bench，一个新的多模态医学基准数据集，专注于上下文交织的指令，需要跨模态推理和开放式的生成输出，包括图像生成、图像编辑和视觉问题回答，旨在推进医学人工智能研究。

Relevance: (10/10)

Novelty: (9/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Junjie Yang, Yuhao Yan, Gang Wu, Yuxuan Wang, Ruoyu Liang, Xinjie Jiang, Xiang Wan, Fenglei Fan, Yongquan Zhang, Feiwei Qin, Changmiao Wan

Distribution Matching Distillation Meets Reinforcement Learning

Distribution Matching Distillation (DMD) distills a pre-trained multi-step diffusion model to a few-step one to improve inference efficiency. However, the performance of the latter is often capped by the former. To circumvent this dilemma, we propose DMDR, a novel framework that combines Reinforcement Learning (RL) techniques into the distillation process. We show that for the RL of the few-step generator, the DMD loss itself is a more effective regularization compared to the traditional ones. In turn, RL can help to guide the mode coverage process in DMD more effectively. These allow us to unlock the capacity of the few-step generator by conducting distillation and RL simultaneously. Meanwhile, we design the dynamic distribution guidance and dynamic renoise sampling training strategies to improve the initial distillation process. The experiments demonstrate that DMDR can achieve leading visual quality, prompt coherence among few-step methods, and even exhibit performance that exceeds the multi-step teacher.

TLDR: The paper introduces DMDR, a framework combining Distribution Matching Distillation and Reinforcement Learning to improve the performance of few-step diffusion models, surpassing the original multi-step teacher model through simultaneous distillation and RL.

TLDR: 该论文提出了DMDR，一个结合分布匹配蒸馏和强化学习的框架，旨在提高少步扩散模型的性能，通过同时进行蒸馏和强化学习，使其超越原始的多步教师模型。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Dengyang Jiang, Dongyang Liu, Zanyi Wang, Qilong Wu, Xin Jin, David Liu, Zhen Li, Mengmeng Wang, Peng Gao, Harry Yang

PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image

3D modeling is shifting from static visual representations toward physical, articulated assets that can be directly used in simulation and interaction. However, most existing 3D generation methods overlook key physical and articulation properties, thereby limiting their utility in embodied AI. To bridge this gap, we introduce PhysX-Anything, the first simulation-ready physical 3D generative framework that, given a single in-the-wild image, produces high-quality sim-ready 3D assets with explicit geometry, articulation, and physical attributes. Specifically, we propose the first VLM-based physical 3D generative model, along with a new 3D representation that efficiently tokenizes geometry. It reduces the number of tokens by 193x, enabling explicit geometry learning within standard VLM token budgets without introducing any special tokens during fine-tuning and significantly improving generative quality. In addition, to overcome the limited diversity of existing physical 3D datasets, we construct a new dataset, PhysX-Mobility, which expands the object categories in prior physical 3D datasets by over 2x and includes more than 2K common real-world objects with rich physical annotations. Extensive experiments on PhysX-Mobility and in-the-wild images demonstrate that PhysX-Anything delivers strong generative performance and robust generalization. Furthermore, simulation-based experiments in a MuJoCo-style environment validate that our sim-ready assets can be directly used for contact-rich robotic policy learning. We believe PhysX-Anything can substantially empower a broad range of downstream applications, especially in embodied AI and physics-based simulation.

TLDR: PhysX-Anything is a new framework for generating simulation-ready 3D assets with physical properties from single images, using a novel VLM-based approach and a new dataset for physical 3D objects.

TLDR: PhysX-Anything是一个新的框架，可以通过单张图像生成具有物理属性的、可用于仿真的3D资产。它采用了一种新的基于VLM的方法，并使用了新的物理3D对象数据集。

Relevance: (7/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Ziang Cao, Fangzhou Hong, Zhaoxi Chen, Liang Pan, Ziwei Liu

VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping

Visual autoregressive (AR) generation models have demonstrated strong potential for image generation, yet their next-token-prediction paradigm introduces considerable inference latency. Although speculative decoding (SD) has been proven effective for accelerating visual AR models, its "draft one step, then verify one step" paradigm prevents a direct reduction of the forward passes, thus restricting acceleration potential. Motivated by the visual token interchangeability, we for the first time to explore verification skipping in the SD process of visual AR model generation to explicitly cut the number of target model forward passes, thereby reducing inference latency. Based on an analysis of the drafting stage's characteristics, we observe that verification redundancy and stale feature reusability are key factors to retain generation quality and speedup for verification-free steps. Inspired by these two observations, we propose a novel SD framework VVS to accelerate visual AR generation via partial verification skipping, which integrates three complementary modules: (1) a verification-free token selector with dynamical truncation, (2) token-level feature caching and reuse, and (3) fine-grained skipped step scheduling. Consequently, VVS reduces the number of target model forward passes by a factor of $2.8\times$ relative to vanilla AR decoding while maintaining competitive generation quality, offering a superior speed-quality trade-off over conventional SD frameworks and revealing strong potential to reshape the SD paradigm.

TLDR: The paper introduces VVS, a novel speculative decoding framework for visual autoregressive generation that accelerates inference by strategically skipping verification steps, achieving a significant speedup while maintaining competitive generation quality.

TLDR: 该论文介绍了VVS，一种新颖的视觉自回归生成推测解码框架，通过策略性地跳过验证步骤来加速推理，在保持竞争性生成质量的同时实现了显著的加速。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Haotian Dong, Ye Li, Rongwei Lu, Chen Tang, Shu-Tao Xia, Zhi Wang

DriveLiDAR4D: Sequential and Controllable LiDAR Scene Generation for Autonomous Driving

The generation of realistic LiDAR point clouds plays a crucial role in the development and evaluation of autonomous driving systems. Although recent methods for 3D LiDAR point cloud generation have shown significant improvements, they still face notable limitations, including the lack of sequential generation capabilities and the inability to produce accurately positioned foreground objects and realistic backgrounds. These shortcomings hinder their practical applicability. In this paper, we introduce DriveLiDAR4D, a novel LiDAR generation pipeline consisting of multimodal conditions and a novel sequential noise prediction model LiDAR4DNet, capable of producing temporally consistent LiDAR scenes with highly controllable foreground objects and realistic backgrounds. To the best of our knowledge, this is the first work to address the sequential generation of LiDAR scenes with full scene manipulation capability in an end-to-end manner. We evaluated DriveLiDAR4D on the nuScenes and KITTI datasets, where we achieved an FRD score of 743.13 and an FVD score of 16.96 on the nuScenes dataset, surpassing the current state-of-the-art (SOTA) method, UniScene, with an performance boost of 37.2% in FRD and 24.1% in FVD, respectively.

TLDR: The paper introduces DriveLiDAR4D, a novel sequential LiDAR scene generation pipeline that addresses limitations in existing methods by enabling controllable foreground objects and realistic backgrounds, achieving state-of-the-art results on nuScenes and KITTI datasets.

TLDR: 该论文介绍了DriveLiDAR4D，一种新型的序列LiDAR场景生成管道，通过实现可控的前景对象和真实的背景来解决现有方法的局限性，并在nuScenes和KITTI数据集上取得了最先进的结果。

Relevance: (7/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Kaiwen Cai, Xinze Liu, Xia Zhou, Hengtong Hu, Jie Xiang, Luyao Zhang, Xueyang Zhang, Kun Zhan, Yifei Zhan, Xianpeng Lang

MeanFlow Transformers with Representation Autoencoders

MeanFlow (MF) is a diffusion-motivated generative model that enables efficient few-step generation by learning long jumps directly from noise to data. In practice, it is often used as a latent MF by leveraging the pre-trained Stable Diffusion variational autoencoder (SD-VAE) for high-dimensional data modeling. However, MF training remains computationally demanding and is often unstable. During inference, the SD-VAE decoder dominates the generation cost, and MF depends on complex guidance hyperparameters for class-conditional generation. In this work, we develop an efficient training and sampling scheme for MF in the latent space of a Representation Autoencoder (RAE), where a pre-trained vision encoder (e.g., DINO) provides semantically rich latents paired with a lightweight decoder. We observe that naive MF training in the RAE latent space suffers from severe gradient explosion. To stabilize and accelerate training, we adopt Consistency Mid-Training for trajectory-aware initialization and use a two-stage scheme: distillation from a pre-trained flow matching teacher to speed convergence and reduce variance, followed by an optional bootstrapping stage with a one-point velocity estimator to further reduce deviation from the oracle mean flow. This design removes the need for guidance, simplifies training configurations, and reduces computation in both training and sampling. Empirically, our method achieves a 1-step FID of 2.03, outperforming vanilla MF's 3.43, while reducing sampling GFLOPS by 38% and total training cost by 83% on ImageNet 256. We further scale our approach to ImageNet 512, achieving a competitive 1-step FID of 3.23 with the lowest GFLOPS among all baselines. Code is available at https://github.com/sony/mf-rae.

TLDR: The paper introduces an efficient training and sampling scheme for MeanFlow generative models using Representation Autoencoders (RAE), achieving state-of-the-art few-shot image generation performance with significantly reduced computational costs.

TLDR: 本文介绍了一种使用表示自编码器 (RAE) 的 MeanFlow 生成模型的高效训练和采样方案，它以显著降低的计算成本实现了最先进的少样本图像生成性能。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Zheyuan Hu, Chieh-Hsin Lai, Ge Wu, Yuki Mitsufuji, Stefano Ermon

Recurrent Autoregressive Diffusion: Global Memory Meets Local Attention

Recent advancements in video generation have demonstrated the potential of using video diffusion models as world models, with autoregressive generation of infinitely long videos through masked conditioning. However, such models, usually with local full attention, lack effective memory compression and retrieval for long-term generation beyond the window size, leading to issues of forgetting and spatiotemporal inconsistencies. To enhance the retention of historical information within a fixed memory budget, we introduce a recurrent neural network (RNN) into the diffusion transformer framework. Specifically, a diffusion model incorporating LSTM with attention achieves comparable performance to state-of-the-art RNN blocks, such as TTT and Mamba2. Moreover, existing diffusion-RNN approaches often suffer from performance degradation due to training-inference gap or the lack of overlap across windows. To address these limitations, we propose a novel Recurrent Autoregressive Diffusion (RAD) framework, which executes frame-wise autoregression for memory update and retrieval, consistently across training and inference time. Experiments on Memory Maze and Minecraft datasets demonstrate the superiority of RAD for long video generation, highlighting the efficiency of LSTM in sequence modeling.

TLDR: The paper introduces Recurrent Autoregressive Diffusion (RAD), a novel framework integrating RNNs (specifically LSTM) within a diffusion transformer to improve long video generation by addressing forgetting and spatiotemporal inconsistency issues through frame-wise autoregression for memory update and retrieval.

TLDR: 本文介绍了一种新的循环自回归扩散(RAD)框架，该框架将RNN(特别是LSTM)集成到扩散transformer中，通过帧式自回归进行记忆更新和检索，从而改善长视频生成，解决遗忘和时空不一致问题。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Taiye Chen, Zihan Ding, Anjian Li, Christina Zhang, Zeqi Xiao, Yisen Wang, Chi Jin

Text2Traffic: A Text-to-Image Generation and Editing Method for Traffic Scenes

With the rapid advancement of intelligent transportation systems, text-driven image generation and editing techniques have demonstrated significant potential in providing rich, controllable visual scene data for applications such as traffic monitoring and autonomous driving. However, several challenges remain, including insufficient semantic richness of generated traffic elements, limited camera viewpoints, low visual fidelity of synthesized images, and poor alignment between textual descriptions and generated content. To address these issues, we propose a unified text-driven framework for both image generation and editing, leveraging a controllable mask mechanism to seamlessly integrate the two tasks. Furthermore, we incorporate both vehicle-side and roadside multi-view data to enhance the geometric diversity of traffic scenes. Our training strategy follows a two-stage paradigm: first, we perform conceptual learning using large-scale coarse-grained text-image data; then, we fine-tune with fine-grained descriptive data to enhance text-image alignment and detail quality. Additionally, we introduce a mask-region-weighted loss that dynamically emphasizes small yet critical regions during training, thereby substantially enhancing the generation fidelity of small-scale traffic elements. Extensive experiments demonstrate that our method achieves leading performance in text-based image generation and editing within traffic scenes.

TLDR: The paper introduces Text2Traffic, a text-driven image generation and editing framework for traffic scenes, addressing challenges like semantic richness, viewpoint diversity, and visual fidelity through a controllable mask mechanism, multi-view data, and a two-stage training strategy.

TLDR: 该论文介绍了Text2Traffic，一个用于交通场景的文本驱动图像生成和编辑框架。通过可控的掩码机制、多视角数据和两阶段训练策略，解决了语义丰富性、视角多样性和视觉保真度等挑战。

Relevance: (9/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Feng Lv, Haoxuan Feng, Zilu Zhang, Chunlong Xia, Yanfeng Li

Generative Photographic Control for Scene-Consistent Video Cinematic Editing

Cinematic storytelling is profoundly shaped by the artful manipulation of photographic elements such as depth of field and exposure. These effects are crucial in conveying mood and creating aesthetic appeal. However, controlling these effects in generative video models remains highly challenging, as most existing methods are restricted to camera motion control. In this paper, we propose CineCtrl, the first video cinematic editing framework that provides fine control over professional camera parameters (e.g., bokeh, shutter speed). We introduce a decoupled cross-attention mechanism to disentangle camera motion from photographic inputs, allowing fine-grained, independent control without compromising scene consistency. To overcome the shortage of training data, we develop a comprehensive data generation strategy that leverages simulated photographic effects with a dedicated real-world collection pipeline, enabling the construction of a large-scale dataset for robust model training. Extensive experiments demonstrate that our model generates high-fidelity videos with precisely controlled, user-specified photographic camera effects.

TLDR: The paper introduces CineCtrl, a generative video editing framework that allows for fine-grained control over photographic camera parameters like bokeh and shutter speed while maintaining scene consistency, using a decoupled cross-attention mechanism and a large-scale training dataset.

TLDR: 该论文介绍了CineCtrl，一个生成式视频编辑框架，它允许对摄影相机参数（如散景和快门速度）进行细粒度控制，同时保持场景一致性。该框架使用解耦的交叉注意力机制和一个大规模训练数据集。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Huiqiang Sun, Liao Shen, Zhan Peng, Kun Wang, Size Wu, Yuhang Zang, Tianqi Liu, Zihao Huang, Xingyu Zeng, Zhiguo Cao, Wei Li, Chen Change Loy

ActVAR: Activating Mixtures of Weights and Tokens for Efficient Visual Autoregressive Generation

Visual Autoregressive (VAR) models enable efficient image generation via next-scale prediction but face escalating computational costs as sequence length grows. Existing static pruning methods degrade performance by permanently removing weights or tokens, disrupting pretrained dependencies. To address this, we propose ActVAR, a dynamic activation framework that introduces dual sparsity across model weights and token sequences to enhance efficiency without sacrificing capacity. ActVAR decomposes feedforward networks (FFNs) into lightweight expert sub-networks and employs a learnable router to dynamically select token-specific expert subsets based on content. Simultaneously, a gated token selector identifies high-update-potential tokens for computation while reconstructing unselected tokens to preserve global context and sequence alignment. Training employs a two-stage knowledge distillation strategy, where the original VAR model supervises the learning of routing and gating policies to align with pretrained knowledge. Experiments on the ImageNet $256\times 256$ benchmark demonstrate that ActVAR achieves up to $21.2\%$ FLOPs reduction with minimal performance degradation.

TLDR: ActVAR introduces a dynamic activation framework for visual autoregressive models, using weight and token sparsity to improve efficiency with minimal performance degradation, achieving up to 21.2% FLOPs reduction on ImageNet.

TLDR: ActVAR 引入了一种用于视觉自回归模型的动态激活框架，通过权重和token稀疏性来提高效率，并以最小的性能损失在 ImageNet 上实现了高达 21.2% 的 FLOPs 减少。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Kaixin Zhang, Ruiqing Yang, Yuan Zhang, Shan You, Tao Huang

InterMoE: Individual-Specific 3D Human Interaction Generation via Dynamic Temporal-Selective MoE

Generating high-quality human interactions holds significant value for applications like virtual reality and robotics. However, existing methods often fail to preserve unique individual characteristics or fully adhere to textual descriptions. To address these challenges, we introduce InterMoE, a novel framework built on a Dynamic Temporal-Selective Mixture of Experts. The core of InterMoE is a routing mechanism that synergistically uses both high-level text semantics and low-level motion context to dispatch temporal motion features to specialized experts. This allows experts to dynamically determine the selection capacity and focus on critical temporal features, thereby preserving specific individual characteristic identities while ensuring high semantic fidelity. Extensive experiments show that InterMoE achieves state-of-the-art performance in individual-specific high-fidelity 3D human interaction generation, reducing FID scores by 9% on the InterHuman dataset and 22% on InterX.

TLDR: InterMoE is a novel framework for generating individual-specific 3D human interactions from text using a dynamic temporal-selective mixture of experts, achieving state-of-the-art results.

TLDR: InterMoE是一个新颖的框架，它使用动态时间选择性专家混合模型，从文本生成特定于个体的3D人体互动，并取得最佳结果。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Lipeng Wang, Hongxing Fan, Haohua Chen, Zehuan Huang, Lu Sheng

TripleFDS: Triple Feature Disentanglement and Synthesis for Scene Text Editing

Scene Text Editing (STE) aims to naturally modify text in images while preserving visual consistency, the decisive factors of which can be divided into three parts, i.e., text style, text content, and background. Previous methods have struggled with incomplete disentanglement of editable attributes, typically addressing only one aspect - such as editing text content - thus limiting controllability and visual consistency. To overcome these limitations, we propose TripleFDS, a novel framework for STE with disentangled modular attributes, and an accompanying dataset called SCB Synthesis. SCB Synthesis provides robust training data for triple feature disentanglement by utilizing the "SCB Group", a novel construct that combines three attributes per image to generate diverse, disentangled training groups. Leveraging this construct as a basic training unit, TripleFDS first disentangles triple features, ensuring semantic accuracy through inter-group contrastive regularization and reducing redundancy through intra-sample multi-feature orthogonality. In the synthesis phase, TripleFDS performs feature remapping to prevent "shortcut" phenomena during reconstruction and mitigate potential feature leakage. Trained on 125,000 SCB Groups, TripleFDS achieves state-of-the-art image fidelity (SSIM of 44.54) and text accuracy (ACC of 93.58%) on the mainstream STE benchmarks. Besides superior performance, the more flexible editing of TripleFDS supports new operations such as style replacement and background transfer. Code: https://github.com/yusenbao01/TripleFDS

TLDR: The paper introduces TripleFDS, a novel scene text editing framework with disentangled features and a new dataset, SCB Synthesis, achieving state-of-the-art results in image fidelity and text accuracy and enabling more flexible editing operations.

TLDR: 本文介绍了一种名为TripleFDS的新型场景文本编辑框架，它具有解耦特征和一个新的数据集SCB Synthesis，在图像保真度和文本准确性方面取得了最先进的结果，并支持更灵活的编辑操作。

Relevance: (6/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Yuchen Bao, Yiting Wang, Wenjian Huang, Haowei Wang, Shen Chen, Taiping Yao, Shouhong Ding, Jianguo Zhang

Uncovering and Mitigating Transient Blindness in Multimodal Model Editing

Multimodal Model Editing (MMED) aims to correct erroneous knowledge in multimodal models. Existing evaluation methods, adapted from textual model editing, overstate success by relying on low-similarity or random inputs, obscure overfitting. We propose a comprehensive locality evaluation framework, covering three key dimensions: random-image locality, no-image locality, and consistent-image locality, operationalized through seven distinct data types, enabling a detailed and structured analysis of multimodal edits. We introduce De-VQA, a dynamic evaluation for visual question answering, uncovering a phenomenon we term transient blindness, overfitting to edit-similar text while ignoring visuals. Token analysis shows edits disproportionately affect textual tokens. We propose locality-aware adversarial losses to balance cross-modal representations. Empirical results demonstrate that our approach consistently outperforms existing baselines, reducing transient blindness and improving locality by 17% on average.

TLDR: This paper identifies and mitigates transient blindness in multimodal model editing, where edits overfit to the textual modality, ignoring visuals. They introduce a comprehensive evaluation framework and locality-aware adversarial losses to reduce this issue.

TLDR: 本文提出了多模态模型编辑中存在的“瞬时盲视”问题，即模型编辑过程中过度拟合文本模态而忽略视觉信息。他们引入了一个全面的评估框架和局部感知对抗损失来解决这个问题。

Relevance: (6/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Xiaoqi Han, Ru Li, Ran Yi, Hongye Tan, Zhuomin Liang, Víctor Gutiérrez-Basulto, Jeff Z. Pan

Functional Mean Flow in Hilbert Space

We present Functional Mean Flow (FMF) as a one-step generative model defined in infinite-dimensional Hilbert space. FMF extends the one-step Mean Flow framework to functional domains by providing a theoretical formulation for Functional Flow Matching and a practical implementation for efficient training and sampling. We also introduce an $x_1$-prediction variant that improves stability over the original $u$-prediction form. The resulting framework is a practical one-step Flow Matching method applicable to a wide range of functional data generation tasks such as time series, images, PDEs, and 3D geometry.

TLDR: The paper introduces Functional Mean Flow (FMF), a one-step generative model in Hilbert space applicable to various functional data generation tasks, including images and time series, and proposes an improved training variant.

TLDR: 本文介绍了一种称为功能均值流（FMF）的单步生成模型，它在希尔伯特空间中运行，适用于各种功能数据生成任务，包括图像和时间序列，并提出了一种改进的训练变体。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Zhiqi Li, Yuchen Sun, Greg Turk, Bo Zhu

AIGC Daily Papers

MedGEN-Bench: Contextually entangled benchmark for open-ended multimodal medical generation

Distribution Matching Distillation Meets Reinforcement Learning

PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image

VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping

DriveLiDAR4D: Sequential and Controllable LiDAR Scene Generation for Autonomous Driving

MeanFlow Transformers with Representation Autoencoders

Recurrent Autoregressive Diffusion: Global Memory Meets Local Attention

Text2Traffic: A Text-to-Image Generation and Editing Method for Traffic Scenes

Generative Photographic Control for Scene-Consistent Video Cinematic Editing

ActVAR: Activating Mixtures of Weights and Tokens for Efficient Visual Autoregressive Generation

InterMoE: Individual-Specific 3D Human Interaction Generation via Dynamic Temporal-Selective MoE

TripleFDS: Triple Feature Disentanglement and Synthesis for Scene Text Editing

Uncovering and Mitigating Transient Blindness in Multimodal Model Editing

Functional Mean Flow in Hilbert Space