ArXiv CS.CV Papers (Image/Video Generation)

A Reason-then-Describe Instruction Interpreter for Controllable Video Generation

Diffusion Transformers have significantly improved video fidelity and temporal coherence, however, practical controllability remains limited. Concise, ambiguous, and compositionally complex user inputs contrast with the detailed prompts used in training, yielding an intent-output mismatch. We propose ReaDe, a universal, model-agnostic interpreter that converts raw instructions into precise, actionable specifications for downstream video generators. ReaDe follows a reason-then-describe paradigm: it first analyzes the user request to identify core requirements and resolve ambiguities, then produces detailed guidance that enables faithful, controllable generation. We train ReaDe via a two-stage optimization: (i) reasoning-augmented supervision imparts analytic parsing with stepwise traces and dense captions, and (ii) a multi-dimensional reward assigner enables stable, feedback-driven refinement for natural-style captions. Experiments across single- and multi-condition scenarios show consistent gains in instruction fidelity, caption accuracy, and downstream video quality, with strong generalization to reasoning-intensive and unseen inputs. ReaDe offers a practical route to aligning controllable video generation with accurately interpreted user intent. Project Page: https://sqwu.top/ReaDe/.

TLDR: This paper introduces ReaDe, a novel instruction interpreter that enhances the controllability of video generation by first reasoning about user requests and then generating detailed specifications for downstream video generators, resulting in improved instruction fidelity and video quality.

TLDR: 本文介绍了一种名为ReaDe的新型指令解释器，通过首先推理用户请求，然后为下游视频生成器生成详细规范，从而增强视频生成的可控性，从而提高指令保真度和视频质量。

Relevance: (10/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Shengqiong Wu, Weicai Ye, Yuanxing Zhang, Jiahao Wang, Quande Liu, Xintao Wang, Pengfei Wan, Kun Gai, Hao Fei, Tat-Seng Chua

PhysChoreo: Physics-Controllable Video Generation with Part-Aware Semantic Grounding

While recent video generation models have achieved significant visual fidelity, they often suffer from the lack of explicit physical controllability and plausibility. To address this, some recent studies attempted to guide the video generation with physics-based rendering. However, these methods face inherent challenges in accurately modeling complex physical properties and effectively control ling the resulting physical behavior over extended temporal sequences. In this work, we introduce PhysChoreo, a novel framework that can generate videos with diverse controllability and physical realism from a single image. Our method consists of two stages: first, it estimates the static initial physical properties of all objects in the image through part-aware physical property reconstruction. Then, through temporally instructed and physically editable simulation, it synthesizes high-quality videos with rich dynamic behaviors and physical realism. Experimental results show that PhysChoreo can generate videos with rich behaviors and physical realism, outperforming state-of-the-art methods on multiple evaluation metrics.

TLDR: PhysChoreo generates physically plausible and controllable videos from a single image by estimating initial physical properties and using physics simulation for video synthesis.

TLDR: PhysChoreo通过估计初始物理属性并使用物理模拟进行视频合成，从单个图像生成物理上合理且可控的视频。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Haoze Zhang, Tianyu Huang, Zichen Wan, Xiaowei Jin, Hongzhi Zhang, Hui Li, Wangmeng Zuo

STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flow

Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state-of-the-art systems almost exclusively rely on diffusion-based models. In this work, we revisit this design space by presenting STARFlow-V, a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction, and native likelihood estimation. Building upon the recently proposed STARFlow, STARFlow-V operates in the spatiotemporal latent space with a global-local architecture which restricts causal dependencies to a global latent space while preserving rich local within-frame interactions. This eases error accumulation over time, a common pitfall of standard autoregressive diffusion model generation. Additionally, we propose flow-score matching, which equips the model with a light-weight causal denoiser to improve the video generation consistency in an autoregressive fashion. To improve the sampling efficiency, STARFlow-V employs a video-aware Jacobi iteration scheme that recasts inner updates as parallelizable iterations without breaking causality. Thanks to the invertible structure, the same model can natively support text-to-video, image-to-video as well as video-to-video generation tasks. Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion-based baselines. These results present the first evidence, to our knowledge, that NFs are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models. Code and generated samples are available at https://github.com/apple/ml-starflow.

TLDR: The paper introduces STARFlow-V, a normalizing flow-based video generator that achieves strong visual fidelity and temporal consistency, comparable to diffusion-based models, while offering end-to-end learning and native likelihood estimation.

TLDR: 该论文介绍了STARFlow-V，一种基于归一化流的视频生成器，它实现了强大的视觉保真度和时间一致性，与基于扩散的模型相当，同时提供端到端学习和原生似然估计。

Relevance: (10/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Jiatao Gu, Ying Shen, Tianrong Chen, Laurent Dinh, Yuyang Wang, Miguel Angel Bautista, David Berthelot, Josh Susskind, Shuangfei Zhai

GigaWorld-0: World Models as Data Engine to Empower Embodied AI

World models are emerging as a foundational paradigm for scalable, data-efficient embodied AI. In this work, we present GigaWorld-0, a unified world model framework designed explicitly as a data engine for Vision-Language-Action (VLA) learning. GigaWorld-0 integrates two synergistic components: GigaWorld-0-Video, which leverages large-scale video generation to produce diverse, texture-rich, and temporally coherent embodied sequences under fine-grained control of appearance, camera viewpoint, and action semantics; and GigaWorld-0-3D, which combines 3D generative modeling, 3D Gaussian Splatting reconstruction, physically differentiable system identification, and executable motion planning to ensure geometric consistency and physical realism. Their joint optimization enables the scalable synthesis of embodied interaction data that is visually compelling, spatially coherent, physically plausible, and instruction-aligned. Training at scale is made feasible through our efficient GigaTrain framework, which exploits FP8-precision and sparse attention to drastically reduce memory and compute requirements. We conduct comprehensive evaluations showing that GigaWorld-0 generates high-quality, diverse, and controllable data across multiple dimensions. Critically, VLA model (e.g., GigaBrain-0) trained on GigaWorld-0-generated data achieve strong real-world performance, significantly improving generalization and task success on physical robots without any real-world interaction during training.

TLDR: The paper introduces GigaWorld-0, a unified world model framework for generating high-quality, diverse, and controllable Vision-Language-Action (VLA) data. Training VLA models on this synthetic data achieves strong zero-shot real-world performance on physical robots.

TLDR: 该论文介绍了 GigaWorld-0，一个统一的世界模型框架，用于生成高质量、多样化和可控的视觉-语言-动作（VLA）数据。在此合成数据上训练的 VLA 模型在物理机器人上实现了强大的零样本真实世界性能。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, Qiuping Deng, Siting Wang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yankai Wang, Yu Cao, Yifan Chang, Yuan Xu, Yun Ye, Yang Wang, Yukun Zhou, Zhengyuan Zhang, Zhehao Dong, Zheng Zhu

Rectified SpaAttn: Revisiting Attention Sparsity for Efficient Video Generation

Diffusion Transformers dominate video generation, but the quadratic complexity of attention computation introduces substantial latency. Attention sparsity reduces computational costs by focusing on critical tokens while ignoring non-critical tokens. However, existing methods suffer from severe performance degradation. In this paper, we revisit attention sparsity and reveal that existing methods induce systematic biases in attention allocation: (1) excessive focus on critical tokens amplifies their attention weights; (2) complete neglect of non-critical tokens causes the loss of relevant attention weights. To address these issues, we propose Rectified SpaAttn, which rectifies attention allocation with implicit full attention reference, thereby enhancing the alignment between sparse and full attention maps. Specifically: (1) for critical tokens, we show that their bias is proportional to the sparse attention weights, with the ratio governed by the amplified weights. Accordingly, we propose Isolated-Pooling Attention Reallocation, which calculates accurate rectification factors by reallocating multimodal pooled weights. (2) for non-critical tokens, recovering attention weights from the pooled query-key yields attention gains but also introduces pooling errors. Therefore, we propose Gain-Aware Pooling Rectification, which ensures that the rectified gain consistently surpasses the induced error. Moreover, we customize and integrate the Rectified SpaAttn kernel using Triton, achieving up to 3.33 and 2.08 times speedups on HunyuanVideo and Wan 2.1, respectively, while maintaining high generation quality. We release Rectified SpaAttn as open-source at https://github.com/BienLuky/Rectified-SpaAttn .

TLDR: This paper introduces Rectified SpaAttn, a novel attention sparsity method for efficient video generation that addresses biases in existing sparse attention mechanisms by rectifying attention allocation with implicit full attention reference, achieving significant speedups while maintaining generation quality.

TLDR: 该论文介绍了Rectified SpaAttn，一种用于高效视频生成的新型注意力稀疏方法，通过使用隐式完整注意力参考来纠正注意力分配，从而解决现有稀疏注意力机制中的偏差，并在保持生成质量的同时实现了显着的加速。

Relevance: (10/10)

Novelty: (9/10)

Clarity: (9/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Xuewen Liu, Zhikai Li, Jing Zhang, Mengjuan Chen, Qingyi Gu

Terminal Velocity Matching

We propose Terminal Velocity Matching (TVM), a generalization of flow matching that enables high-fidelity one- and few-step generative modeling. TVM models the transition between any two diffusion timesteps and regularizes its behavior at its terminal time rather than at the initial time. We prove that TVM provides an upper bound on the $2$-Wasserstein distance between data and model distributions when the model is Lipschitz continuous. However, since Diffusion Transformers lack this property, we introduce minimal architectural changes that achieve stable, single-stage training. To make TVM efficient in practice, we develop a fused attention kernel that supports backward passes on Jacobian-Vector Products, which scale well with transformer architectures. On ImageNet-256x256, TVM achieves 3.29 FID with a single function evaluation (NFE) and 1.99 FID with 4 NFEs. It similarly achieves 4.32 1-NFE FID and 2.94 4-NFE FID on ImageNet-512x512, representing state-of-the-art performance for one/few-step models from scratch.

TLDR: The paper introduces Terminal Velocity Matching (TVM), a generalization of flow matching for high-fidelity few-step generative modeling, achieving state-of-the-art FID scores in image generation via architectural changes and an efficient fused attention kernel.

TLDR: 该论文提出了终端速度匹配（TVM），这是一种流量匹配的泛化方法，用于高保真度的少步生成建模，通过架构改进和高效的融合注意力内核在图像生成中实现了最先进的FID分数。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Linqi Zhou, Mathias Parger, Ayaan Haque, Jiaming Song

One Attention, One Scale: Phase-Aligned Rotary Positional Embeddings for Mixed-Resolution Diffusion Transformer

We identify a core failure mode that occurs when using the usual linear interpolation on rotary positional embeddings (RoPE) for mixed-resolution denoising with Diffusion Transformers. When tokens from different spatial grids are mixed, the attention mechanism collapses. The issue is structural. Linear coordinate remapping forces a single attention head to compare RoPE phases sampled at incompatible rates, creating phase aliasing that destabilizes the score landscape. Pretrained DiTs are especially brittle-many heads exhibit extremely sharp, periodic phase selectivity-so even tiny cross-rate inconsistencies reliably cause blur, artifacts, or full collapse. To this end, our main contribution is Cross-Resolution Phase-Aligned Attention (CRPA), a training-free drop-in fix that eliminates this failure at its source. CRPA modifies only the RoPE index map for each attention call: all Q/K positions are expressed on the query's stride so that equal physical distances always induce identical phase increments. This restores the precise phase patterns that DiTs rely on. CRPA is fully compatible with pretrained DiTs, stabilizes all heads and layers uniformly. We demonstrate that CRPA enables high-fidelity and efficient mixed-resolution generation, outperforming previous state-of-the-art methods on image and video generation.

TLDR: The paper introduces Cross-Resolution Phase-Aligned Attention (CRPA), a training-free fix for Diffusion Transformers' attention collapse in mixed-resolution denoising by ensuring consistent phase increments in RoPE, improving image and video generation fidelity.

TLDR: 该论文介绍了一种名为Cross-Resolution Phase-Aligned Attention (CRPA)的免训练修复方法，用于解决Diffusion Transformers在混合分辨率去噪中出现的注意力机制崩溃问题，通过确保RoPE中一致的相位增量，从而提高图像和视频生成的保真度。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Haoyu Wu, Jingyi Xu, Qiaomu Miao, Dimitris Samaras, Hieu Le

DINO-Tok: Adapting DINO for Visual Tokenizers

Recent advances in visual generation have highlighted the rise of Latent Generative Models (LGMs), which rely on effective visual tokenizers to bridge pixels and semantics. However, existing tokenizers are typically trained from scratch and struggle to balance semantic representation and reconstruction fidelity, particularly in high-dimensional latent spaces. In this work, we introduce DINO-Tok, a DINO-based visual tokenizer that unifies hierarchical representations into an information-complete latent space. By integrating shallow features that retain fine-grained details with deep features encoding global semantics, DINO-Tok effectively bridges pretrained representations and visual generation. We further analyze the challenges of vector quantization (VQ) in this high-dimensional space, where key information is often lost and codebook collapse occurs. We thus propose a global PCA reweighting mechanism to stabilize VQ and preserve essential information across dimensions. On ImageNet 256$\times$256, DINO-Tok achieves state-of-the-art reconstruction performance, reaching 28.54 PSNR for autoencoding and 23.98 PSNR for VQ-based modeling, significantly outperforming prior tokenizers and comparable to billion-level data trained models (such as Hunyuan and Wan). These results demonstrate that adapting powerful pretrained vision models like DINO for tokenization enables semantically aligned and high-fidelity latent representations, enabling next-generation visual generative models. Code will be publicly available at https://github.com/MKJia/DINO-Tok.

TLDR: DINO-Tok is a novel DINO-based visual tokenizer that improves reconstruction fidelity and semantic representation in latent generative models by unifying hierarchical features and using a PCA reweighting mechanism for VQ to prevent codebook collapse, achieving SOTA results on ImageNet.

TLDR: DINO-Tok 是一种新颖的基于 DINO 的视觉 tokenizer，通过统一分层特征并使用 PCA 重新加权机制来防止 VQ 的码本崩溃，从而提高了潜在生成模型中的重建保真度和语义表示，并在 ImageNet 上实现了 SOTA 结果。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Mingkai Jia, Mingxiao Li, Liaoyuan Fan, Tianxing Shi, Jiaxin Guo, Zeming Li, Xiaoyang Guo, Xiao-Xiao Long, Qian Zhang, Ping Tan, Wei Yin

Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward

Recent years have witnessed significant progress in Unified Multimodal Models, yet a fundamental question remains: Does understanding truly inform generation? To investigate this, we introduce UniSandbox, a decoupled evaluation framework paired with controlled, synthetic datasets to avoid data leakage and enable detailed analysis. Our findings reveal a significant understanding-generation gap, which is mainly reflected in two key dimensions: reasoning generation and knowledge transfer. Specifically, for reasoning generation tasks, we observe that explicit Chain-of-Thought (CoT) in the understanding module effectively bridges the gap, and further demonstrate that a self-training approach can successfully internalize this ability, enabling implicit reasoning during generation. Additionally, for knowledge transfer tasks, we find that CoT assists the generative process by helping retrieve newly learned knowledge, and also discover that query-based architectures inherently exhibit latent CoT-like properties that affect this transfer. UniSandbox provides preliminary insights for designing future unified architectures and training strategies that truly bridge the gap between understanding and generation. Code and data are available at https://github.com/PKU-YuanGroup/UniSandBox

TLDR: This paper introduces UniSandbox, a framework to evaluate if understanding truly informs generation in unified multimodal models. The authors identify and address an understanding-generation gap, particularly in reasoning and knowledge transfer, using Chain-of-Thought techniques.

TLDR: 本文介绍了一个名为UniSandbox的框架，用于评估统一多模态模型中理解是否真正影响生成。作者发现并解决了理解与生成之间的差距，特别是在推理和知识迁移方面，并使用了思维链（Chain-of-Thought）技术。

Relevance: (7/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yuwei Niu, Weiyang Jin, Jiaqi Liao, Chaoran Feng, Peng Jin, Bin Lin, Zongjian Li, Bin Zhu, Weihao Yu, Li Yuan

Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning

Diffusion Models have emerged as a leading class of generative models, yet their iterative sampling process remains computationally expensive. Timestep distillation is a promising technique to accelerate generation, but it often requires extensive training and leads to image quality degradation. Furthermore, fine-tuning these distilled models for specific objectives, such as aesthetic appeal or user preference, using Reinforcement Learning (RL) is notoriously unstable and easily falls into reward hacking. In this work, we introduce Flash-DMD, a novel framework that enables fast convergence with distillation and joint RL-based refinement. Specifically, we first propose an efficient timestep-aware distillation strategy that significantly reduces training cost with enhanced realism, outperforming DMD2 with only $2.1\%$ its training cost. Second, we introduce a joint training scheme where the model is fine-tuned with an RL objective while the timestep distillation training continues simultaneously. We demonstrate that the stable, well-defined loss from the ongoing distillation acts as a powerful regularizer, effectively stabilizing the RL training process and preventing policy collapse. Extensive experiments on score-based and flow matching models show that our proposed Flash-DMD not only converges significantly faster but also achieves state-of-the-art generation quality in the few-step sampling regime, outperforming existing methods in visual quality, human preference, and text-image alignment metrics. Our work presents an effective paradigm for training efficient, high-fidelity, and stable generative models. Codes are coming soon.

TLDR: Flash-DMD introduces a novel framework for fast and high-fidelity image generation through efficient timestep distillation and joint reinforcement learning, achieving state-of-the-art results with reduced training costs and improved stability.

TLDR: Flash-DMD 提出了一种新的框架，通过高效的时间步长蒸馏和联合强化学习实现快速、高保真的图像生成，以更低的训练成本和更高的稳定性实现了最先进的结果。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Guanjie Chen, Shirui Huang, Kai Liu, Jianchen Zhu, Xiaoye Qu, Peng Chen, Yu Cheng, Yifu Sun

Beyond Generation: Multi-Hop Reasoning for Factual Accuracy in Vision-Language Models

Visual Language Models (VLMs) are powerful generative tools but often produce factually in- accurate outputs due to a lack of robust reason- ing capabilities. While extensive research has been conducted on integrating external knowl- edge for reasoning in large language models (LLMs), such efforts remain underexplored in VLMs, where the challenge is compounded by the need to bridge multiple modalities seam- lessly. This work introduces a framework for knowledge-guided reasoning in VLMs, leverag- ing structured knowledge graphs for multi-hop verification using image-captioning task to il- lustrate our framework. Our approach enables systematic reasoning across multiple steps, in- cluding visual entity recognition, knowledge graph traversal, and fact-based caption refine- ment. We evaluate the framework using hi- erarchical, triple-based and bullet-point based knowledge representations, analyzing their ef- fectiveness in factual accuracy and logical infer- ence. Empirical results show that our approach improves factual accuracy by approximately 31% on preliminary experiments on a curated dataset of mixtures from Google Landmarks v2, Conceptual captions and Coco captions re- vealing key insights into reasoning patterns and failure modes. This work demonstrates the po- tential of integrating external knowledge for advancing reasoning in VLMs, paving the way for more reliable and knowledgable multimodal systems.

TLDR: The paper introduces a framework for improving factual accuracy in Vision-Language Models (VLMs) by integrating structured knowledge graphs for multi-hop reasoning, showcasing improved accuracy in image captioning tasks.

TLDR: 该论文介绍了一个框架，通过整合结构化知识图谱进行多跳推理，以提高视觉语言模型（VLMs）的事实准确性，并在图像描述任务中展示了显著的准确性提升。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Shamima Hossain

HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation

Recent unified models integrate understanding experts (e.g., LLMs) with generative experts (e.g., diffusion models), achieving strong multimodal performance. However, recent advanced methods such as BAGEL and LMFusion follow the Mixture-of-Transformers (MoT) paradigm, adopting a symmetric design that mirrors one expert to another for convenient initialization and fusion, which remains suboptimal due to inherent modality discrepancies. In this work, we propose HBridge, an asymmetric H-shaped architecture that enables heterogeneous experts to optimally leverage pretrained priors from their respective modality domains. Unlike prior dense fusion strategies that straightforwardly connect all layers between experts via shared attention, HBridge selectively bridges intermediate layers, reducing over 40% attention sharing, which improves efficiency and enhances generation quality. Shallow and deep layers, which capture modality-specific representations, are decoupled, while mid-layer bridging promotes semantic alignment. To further strengthen cross-modal coherence, we introduce semantic reconstruction tokens that explicitly guide the generative expert to reconstruct visual semantic tokens of the target image. Extensive experiments across multiple benchmarks demonstrate the effectiveness and superior performance of HBridge, establishing a new paradigm for unified multimodal generation.

TLDR: The paper introduces HBridge, an asymmetric H-shaped architecture for unified multimodal understanding and generation that selectively bridges heterogeneous experts (LLMs and diffusion models) to improve efficiency and generation quality by decoupling modality-specific layers and promoting semantic alignment.

TLDR: 该论文介绍了 HBridge，一种用于统一多模态理解和生成的非对称 H 形架构，它选择性地桥接异构专家（LLM 和扩散模型），通过解耦特定模态层并促进语义对齐，从而提高效率和生成质量。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Xiang Wang, Zhifei Zhang, He Zhang, Zhe Lin, Yuqian Zhou, Qing Liu, Shiwei Zhang, Yijun Li, Shaoteng Liu, Haitian Zheng, Jason Kuen, Yuehuan Wang, Changxin Gao, Nong Sang

Learning to Generate Human-Human-Object Interactions from Textual Descriptions

The way humans interact with each other, including interpersonal distances, spatial configuration, and motion, varies significantly across different situations. To enable machines to understand such complex, context-dependent behaviors, it is essential to model multiple people in relation to the surrounding scene context. In this paper, we present a novel research problem to model the correlations between two people engaged in a shared interaction involving an object. We refer to this formulation as Human-Human-Object Interactions (HHOIs). To overcome the lack of dedicated datasets for HHOIs, we present a newly captured HHOIs dataset and a method to synthesize HHOI data by leveraging image generative models. As an intermediary, we obtain individual human-object interaction (HOIs) and human-human interaction (HHIs) from the HHOIs, and with these data, we train an text-to-HOI and text-to-HHI model using score-based diffusion model. Finally, we present a unified generative framework that integrates the two individual model, capable of synthesizing complete HHOIs in a single advanced sampling process. Our method extends HHOI generation to multi-human settings, enabling interactions involving more than two individuals. Experimental results show that our method generates realistic HHOIs conditioned on textual descriptions, outperforming previous approaches that focus only on single-human HOIs. Furthermore, we introduce multi-human motion generation involving objects as an application of our framework.

TLDR: This paper introduces a new problem of generating Human-Human-Object Interactions (HHOIs) from text, presenting a new dataset and a score-based diffusion model for synthesizing realistic HHOIs, even extending to multi-human scenarios.

TLDR: 本文提出了一个从文本生成人-人-物交互 (HHOI) 的新问题，提出了一个新的数据集和一个基于分数的扩散模型来合成逼真的 HHOI，甚至扩展到多人场景。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Jeonghyeon Na, Sangwon Baik, Inhee Lee, Junyoung Lee, Hanbyul Joo

Block Cascading: Training Free Acceleration of Block-Causal Video Models

Block-causal video generation faces a stark speed-quality trade-off: small 1.3B models manage only 16 FPS while large 14B models crawl at 4.5 FPS, forcing users to choose between responsiveness and quality. Block Cascading significantly mitigates this trade-off through training-free parallelization. Our key insight: future video blocks do not need fully denoised current blocks to begin generation. By starting block generation with partially denoised context from predecessors, we transform sequential pipelines into parallel cascades where multiple blocks denoise simultaneously. With 5 GPUs exploiting temporal parallelism, we achieve ~2x acceleration across all model scales: 1.3B models accelerate from 16 to 30 FPS, 14B models from 4.5 to 12.5 FPS. Beyond inference speed, Block Cascading eliminates overhead from KV-recaching (of ~200ms) during context switches for interactive generation. Extensive evaluations validated against multiple block-causal pipelines demonstrate no significant loss in generation quality when switching from block-causal to Block Cascading pipelines for inference. Project Page: https://hmrishavbandy.github.io/block_cascading_page/

TLDR: This paper introduces Block Cascading, a training-free method to significantly accelerate block-causal video generation by parallelizing the denoising process across multiple GPUs without a drop in video quality, addressing the speed-quality trade-off.

TLDR: 本文介绍了Block Cascading，一种无需训练的方法，通过在多个GPU上并行化去噪过程，显著加速了块因果视频生成，且不降低视频质量，有效解决了速度与质量的权衡问题。

Relevance: (8/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Hmrishav Bandyopadhyay, Nikhil Pinnaparaju, Rahim Entezari, Jim Scott, Yi-Zhe Song, Varun Jampani

FREE: Uncertainty-Aware Autoregression for Parallel Diffusion Transformers

Diffusion Transformers (DiTs) achieve state-of-the-art generation quality but require long sequential denoising trajectories, leading to high inference latency. Recent speculative inference methods enable lossless parallel sampling in U-Net-based diffusion models via a drafter-verifier scheme, but their acceleration is limited on DiTs due to insufficient draft accuracy during verification. To address this limitation, we analyze the DiTs' feature dynamics and find the features of the final transformer layer (top-block) exhibit strong temporal consistency and rich semantic abstraction. Based on this insight, we propose FREE, a novel framework that employs a lightweight drafter to perform feature-level autoregression with parallel verification, guaranteeing lossless acceleration with theoretical and empirical support. Meanwhile, prediction variance (uncertainty) of DiTs naturally increases in later denoising steps, reducing acceptance rates under speculative sampling. To mitigate this effect, we further introduce an uncertainty-guided relaxation strategy, forming FREE (relax), which dynamically adjusts the acceptance probability in response to uncertainty levels. Experiments on ImageNet-$512^2$ show that FREE achieves up to $1.86 \times$ acceleration, and FREE (relax) further reaches $2.25 \times$ speedup while maintaining high perceptual and quantitative fidelity in generation quality.

TLDR: This paper introduces FREE, a method to accelerate Diffusion Transformers by using a lightweight autoregressive drafter at the feature level and an uncertainty-guided relaxation strategy for parallel verification, achieving up to 2.25x speedup in image generation.

TLDR: 本文提出了FREE，一种通过在特征层使用轻量级自回归起草器和不确定性引导的松弛策略进行并行验证来加速扩散Transformer的方法，从而在图像生成中实现了高达2.25倍的加速。

Relevance: (8/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Xinwan Wen, Bowen Li, Jiajun Luo, Ye Li, Zhi Wang

TReFT: Taming Rectified Flow Models For One-Step Image Translation

Rectified Flow (RF) models have advanced high-quality image and video synthesis via optimal transport theory. However, when applied to image-to-image translation, they still depend on costly multi-step denoising, hindering real-time applications. Although the recent adversarial training paradigm, CycleGAN-Turbo, works in pretrained diffusion models for one-step image translation, we find that directly applying it to RF models leads to severe convergence issues. In this paper, we analyze these challenges and propose TReFT, a novel method to Tame Rectified Flow models for one-step image Translation. Unlike previous works, TReFT directly uses the velocity predicted by pretrained DiT or UNet as output-a simple yet effective design that tackles the convergence issues under adversarial training with one-step inference. This design is mainly motivated by a novel observation that, near the end of the denoising process, the velocity predicted by pretrained RF models converges to the vector from origin to the final clean image, a property we further justify through theoretical analysis. When applying TReFT to large pretrained RF models such as SD3.5 and FLUX, we introduce memory-efficient latent cycle-consistency and identity losses during training, as well as lightweight architectural simplifications for faster inference. Pretrained RF models finetuned with TReFT achieve performance comparable to sota methods across multiple image translation datasets while enabling real-time inference.

TLDR: The paper introduces TReFT, a novel method to enable one-step image translation using Rectified Flow models by directly using the velocity predicted by pre-trained DiT or UNet as output, addressing convergence issues in adversarial training and achieving real-time inference.

TLDR: 该论文介绍了TReFT，一种新颖的方法，通过直接使用预训练的DiT或UNet预测的速度作为输出，来实现使用矫正流模型进行一步图像翻译。该方法解决了对抗训练中的收敛问题，并实现了实时推理。

Relevance: (8/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Shengqian Li, Ming Gao, Yi Liu, Zuzeng Lin, Feng Wang, Feng Dai

Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement

Recent progress in video generation has led to impressive visual quality, yet current models still struggle to produce results that align with real-world physical principles. To this end, we propose an iterative self-refinement framework that leverages large language models and vision-language models to provide physics-aware guidance for video generation. Specifically, we introduce a multimodal chain-of-thought (MM-CoT) process that refines prompts based on feedback from physical inconsistencies, progressively enhancing generation quality. This method is training-free and plug-and-play, making it readily applicable to a wide range of video generation models. Experiments on the PhyIQ benchmark show that our method improves the Physics-IQ score from 56.31 to 62.38. We hope this work serves as a preliminary exploration of physics-consistent video generation and may offer insights for future research.

TLDR: This paper introduces a training-free, plug-and-play method for improving the physics consistency of video generation by using large language models and vision-language models to provide iterative feedback and prompt refinement.

TLDR: 本文介绍了一种无需训练、即插即用的方法，通过使用大型语言模型和视觉-语言模型提供迭代反馈和提示改进，从而提高视频生成在物理上的一致性。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yang Liu, Xilin Zhao, Peisong Wen, Siran Dai, Qingming Huang

The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation

A reliable reward function is essential for reinforcement learning (RL) in image generation. Most current RL approaches depend on pre-trained preference models that output scalar rewards to approximate human preferences. However, these rewards often fail to capture human perception and are vulnerable to reward hacking, where higher scores do not correspond to better images. To address this, we introduce Adv-GRPO, an RL framework with an adversarial reward that iteratively updates both the reward model and the generator. The reward model is supervised using reference images as positive samples and can largely avoid being hacked. Unlike KL regularization that constrains parameter updates, our learned reward directly guides the generator through its visual outputs, leading to higher-quality images. Moreover, while optimizing existing reward functions can alleviate reward hacking, their inherent biases remain. For instance, PickScore may degrade image quality, whereas OCR-based rewards often reduce aesthetic fidelity. To address this, we take the image itself as a reward, using reference images and vision foundation models (e.g., DINO) to provide rich visual rewards. These dense visual signals, instead of a single scalar, lead to consistent gains across image quality, aesthetics, and task-specific metrics. Finally, we show that combining reference samples with foundation-model rewards enables distribution transfer and flexible style customization. In human evaluation, our method outperforms Flow-GRPO and SD3, achieving 70.0% and 72.4% win rates in image quality and aesthetics, respectively. Code and models have been released.

TLDR: The paper introduces Adv-GRPO, a reinforcement learning framework for image generation that uses an adversarial reward model trained with reference images and foundation model features to improve image quality and aesthetics, outperforming existing RL methods.

TLDR: 该论文介绍了Adv-GRPO，一种用于图像生成的强化学习框架，它使用对抗性奖励模型，该模型通过参考图像和基础模型特征进行训练，以提高图像质量和美学效果，并且优于现有的强化学习方法。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Weijia Mao, Hao Chen, Zhenheng Yang, Mike Zheng Shou

PromptMoG: Enhancing Diversity in Long-Prompt Image Generation via Prompt Embedding Mixture-of-Gaussian Sampling

Recent advances in text-to-image (T2I) generation have achieved remarkable visual outcomes through large-scale rectified flow models. However, how these models behave under long prompts remains underexplored. Long prompts encode rich content, spatial, and stylistic information that enhances fidelity but often suppresses diversity, leading to repetitive and less creative outputs. In this work, we systematically study this fidelity-diversity dilemma and reveal that state-of-the-art models exhibit a clear drop in diversity as prompt length increases. To enable consistent evaluation, we introduce LPD-Bench, a benchmark designed for assessing both fidelity and diversity in long-prompt generation. Building on our analysis, we develop a theoretical framework that increases sampling entropy through prompt reformulation and propose a training-free method, PromptMoG, which samples prompt embeddings from a Mixture-of-Gaussians in the embedding space to enhance diversity while preserving semantics. Extensive experiments on four state-of-the-art models, SD3.5-Large, Flux.1-Krea-Dev, CogView4, and Qwen-Image, demonstrate that PromptMoG consistently improves long-prompt generation diversity without semantic drifting.

TLDR: The paper identifies a diversity problem in long-prompt image generation and introduces PromptMoG, a training-free method based on Mixture-of-Gaussians sampling in the prompt embedding space, to enhance diversity without sacrificing semantics.

TLDR: 该论文指出了长提示图像生成中的多样性问题，并引入 PromptMoG，一种基于提示嵌入空间中混合高斯抽样的无训练方法，以增强多样性而不牺牲语义。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Bo-Kai Ruan, Teng-Fang Hsiao, Ling Lo, Yi-Lun Wu, Hong-Han Shuai

Text-guided Controllable Diffusion for Realistic Camouflage Images Generation

Camouflage Images Generation (CIG) is an emerging research area that focuses on synthesizing images in which objects are harmoniously blended and exhibit high visual consistency with their surroundings. Existing methods perform CIG by either fusing objects into specific backgrounds or outpainting the surroundings via foreground object-guided diffusion. However, they often fail to obtain natural results because they overlook the logical relationship between camouflaged objects and background environments. To address this issue, we propose CT-CIG, a Controllable Text-guided Camouflage Images Generation method that produces realistic and logically plausible camouflage images. Leveraging Large Visual Language Models (VLM), we design a Camouflage-Revealing Dialogue Mechanism (CRDM) to annotate existing camouflage datasets with high-quality text prompts. Subsequently, the constructed image-prompt pairs are utilized to finetune Stable Diffusion, incorporating a lightweight controller to guide the location and shape of camouflaged objects for enhanced camouflage scene fitness. Moreover, we design a Frequency Interaction Refinement Module (FIRM) to capture high-frequency texture features, facilitating the learning of complex camouflage patterns. Extensive experiments, including CLIPScore evaluation and camouflage effectiveness assessment, demonstrate the semantic alignment of our generated text prompts and CT-CIG's ability to produce photorealistic camouflage images.

TLDR: The paper introduces CT-CIG, a text-guided controllable diffusion method for generating realistic camouflage images by leveraging VLMs to create annotated image-prompt pairs that are then used to finetune Stable Diffusion with a controller and a frequency interaction refinement module.

TLDR: 该论文介绍了 CT-CIG，一种文本引导的可控扩散方法，通过利用 VLM 创建带注释的图像-提示对来生成逼真的伪装图像，然后使用控制器和频率交互细化模块对 Stable Diffusion 进行微调。

Relevance: (8/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yuhang Qian, Haiyan Chen, Wentong Li, Ningzhong Liu, Jie Qin

Exo2EgoSyn: Unlocking Foundation Video Generation Models for Exocentric-to-Egocentric Video Synthesis

Foundation video generation models such as WAN 2.2 exhibit strong text- and image-conditioned synthesis abilities but remain constrained to the same-view generation setting. In this work, we introduce Exo2EgoSyn, an adaptation of WAN 2.2 that unlocks Exocentric-to-Egocentric(Exo2Ego) cross-view video synthesis. Our framework consists of three key modules. Ego-Exo View Alignment(EgoExo-Align) enforces latent-space alignment between exocentric and egocentric first-frame representations, reorienting the generative space from the given exo view toward the ego view. Multi-view Exocentric Video Conditioning (MultiExoCon) aggregates multi-view exocentric videos into a unified conditioning signal, extending WAN2.2 beyond its vanilla single-image or text conditioning. Furthermore, Pose-Aware Latent Injection (PoseInj) injects relative exo-to-ego camera pose information into the latent state, guiding geometry-aware synthesis across viewpoints. Together, these modules enable high-fidelity ego view video generation from third-person observations without retraining from scratch. Experiments on ExoEgo4D validate that Exo2EgoSyn significantly improves Ego2Exo synthesis, paving the way for scalable cross-view video generation with foundation models. Source code and models will be released publicly.

TLDR: The paper introduces Exo2EgoSyn, a framework that adapts a foundation video generation model (WAN 2.2) for exocentric-to-egocentric video synthesis by aligning views, aggregating multi-view exocentric videos, and injecting pose information, without retraining.

TLDR: 该论文介绍了Exo2EgoSyn，一个将基础视频生成模型(WAN 2.2)适配于从以外部视角到以自我为中心的视频合成的框架。该框架通过对齐视角、聚合多视角外部视频并注入姿态信息，实现了无需重新训练即可进行视角转换。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Mohammad Mahdi, Yuqian Fu, Nedko Savov, Jiancheng Pan, Danda Pani Paudel, Luc Van Gool

OmniRefiner: Reinforcement-Guided Local Diffusion Refinement

Reference-guided image generation has progressed rapidly, yet current diffusion models still struggle to preserve fine-grained visual details when refining a generated image using a reference. This limitation arises because VAE-based latent compression inherently discards subtle texture information, causing identity- and attribute-specific cues to vanish. Moreover, post-editing approaches that amplify local details based on existing methods often produce results inconsistent with the original image in terms of lighting, texture, or shape. To address this, we introduce \ourMthd{}, a detail-aware refinement framework that performs two consecutive stages of reference-driven correction to enhance pixel-level consistency. We first adapt a single-image diffusion editor by fine-tuning it to jointly ingest the draft image and the reference image, enabling globally coherent refinement while maintaining structural fidelity. We then apply reinforcement learning to further strengthen localized editing capability, explicitly optimizing for detail accuracy and semantic consistency. Extensive experiments demonstrate that \ourMthd{} significantly improves reference alignment and fine-grained detail preservation, producing faithful and visually coherent edits that surpass both open-source and commercial models on challenging reference-guided restoration benchmarks.

TLDR: OmniRefiner introduces a novel reinforcement-guided diffusion refinement framework to improve fine-grained detail preservation and consistency in reference-guided image generation, outperforming existing methods.

TLDR: OmniRefiner 提出了一种新的强化引导的扩散细化框架，旨在提高参考引导图像生成中细粒度细节的保存和一致性，性能优于现有方法。

Relevance: (8/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yaoli Liu, Ziheng Ouyang, Shengtao Lou, Yiren Song

HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning

Recent advances in diffusion models have demonstrated impressive capability in generating high-quality images for simple prompts. However, when confronted with complex prompts involving multiple objects and hierarchical structures, existing models struggle to accurately follow instructions, leading to issues such as concept omission, confusion, and poor compositionality. To address these limitations, we propose a Hierarchical Compositional Generative framework (HiCoGen) built upon a novel Chain of Synthesis (CoS) paradigm. Instead of monolithic generation, HiCoGen first leverages a Large Language Model (LLM) to decompose complex prompts into minimal semantic units. It then synthesizes these units iteratively, where the image generated in each step provides crucial visual context for the next, ensuring all textual concepts are faithfully constructed into the final scene. To further optimize this process, we introduce a reinforcement learning (RL) framework. Crucially, we identify that the limited exploration of standard diffusion samplers hinders effective RL. We theoretically prove that sample diversity is maximized by concentrating stochasticity in the early generation stages and, based on this insight, propose a novel Decaying Stochasticity Schedule to enhance exploration. Our RL algorithm is then guided by a hierarchical reward mechanism that jointly evaluates the image at the global, subject, and relationship levels. We also construct HiCoPrompt, a new text-to-image benchmark with hierarchical prompts for rigorous evaluation. Experiments show our approach significantly outperforms existing methods in both concept coverage and compositional accuracy.

TLDR: The paper introduces HiCoGen, a hierarchical compositional text-to-image generation framework using diffusion models and reinforcement learning to address issues with complex prompts by iteratively synthesizing semantic units and optimizing exploration through a novel stochasticity schedule.

TLDR: 本文介绍了HiCoGen，一个分层组合文本到图像生成框架，它使用扩散模型和强化学习来解决复杂提示的问题。该框架通过迭代合成语义单元，并通过一种新颖的随机性调度来优化探索。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Hongji Yang, Yucheng Zhou, Wencheng Han, Runzhou Tao, Zhongying Qiu, Jianfei Yang, Jianbing Shen

Low-Resolution Editing is All You Need for High-Resolution Editing

High-resolution content creation is rapidly emerging as a central challenge in both the vision and graphics communities. While images serve as the most fundamental modality for visual expression, content generation that aligns with the user intent requires effective, controllable high-resolution image manipulation mechanisms. However, existing approaches remain limited to low-resolution settings, typically supporting only up to 1K resolution. In this work, we introduce the task of high-resolution image editing and propose a test-time optimization framework to address it. Our method performs patch-wise optimization on high-resolution source images, followed by a fine-grained detail transfer module and a novel synchronization strategy to maintain consistency across patches. Extensive experiments show that our method produces high-quality edits, facilitating the way toward high-resolution content creation.

TLDR: The paper introduces a test-time optimization framework for high-resolution image editing, addressing the limitations of existing methods that are constrained to low-resolution settings. The proposed method uses patch-wise optimization with detail transfer and synchronization to produce high-quality edits.

TLDR: 该论文介绍了一个用于高分辨率图像编辑的测试时优化框架，解决了现有方法仅限于低分辨率设置的局限性。该方法采用分块优化，并具有细节迁移和同步机制，以产生高质量的编辑。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Junsung Lee, Hyunsoo Lee, Yong Jae Lee, Bohyung Han

Temporal-Visual Semantic Alignment: A Unified Architecture for Transferring Spatial Priors from Vision Models to Zero-Shot Temporal Tasks

Large Multimodal Models (LMMs) have achieved remarkable progress in aligning and generating content across text and image modalities. However, the potential of using non-visual, continuous sequential, as a conditioning signal for high-fidelity image generation remains largely unexplored. Furthermore, existing methods that convert series into "pseudo-images" for temporal forecasting fail to establish semantic-level alignment. In this paper, we propose TimeArtist, a temporal-visual conversion framework that pioneers semantic-level alignment between time series fluctuations and visual concepts. It pioneers a "warmup-align" paradigm: first, a dual-autoencoder and shared quantizer are self-supervised trained on large-scale datasets to learn modality-shared representations. Then, the encoders and quantizer are frozen, and a projection is introduced to align temporal and visual samples at the representation level. TimeArtist establishes a versatile cross-modal framework, enabling high-quality, diverse image generation directly from time series, while capturing temporal fluctuation patterns to render images as styles transfer. Extensive experiments show that TimeArtist achieves satisfactory performance in image generation metrics, while also attaining superior results in zero-shot temporal tasks. Our work establishes a new paradigm for cross-modal generation, bridging the gap between temporal dynamics and visual semantics.

TLDR: The paper introduces TimeArtist, a framework for aligning time series data with visual concepts to enable high-quality image generation from temporal data and improve zero-shot temporal task performance by transferring spatial priors from vision models.

TLDR: 该论文介绍了TimeArtist，一个将时间序列数据与视觉概念对齐的框架，旨在通过从视觉模型迁移空间先验来，实现根据时间数据高质量图像生成并提高零样本时间任务的性能。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Xiangkai Ma, Han Zhang, Wenzhong Li, Sanglu Lu

4DWorldBench: A Comprehensive Evaluation Framework for 3D/4D World Generation Models

World Generation Models are emerging as a cornerstone of next-generation multimodal intelligence systems. Unlike traditional 2D visual generation, World Models aim to construct realistic, dynamic, and physically consistent 3D/4D worlds from images, videos, or text. These models not only need to produce high-fidelity visual content but also maintain coherence across space, time, physics, and instruction control, enabling applications in virtual reality, autonomous driving, embodied intelligence, and content creation. However, prior benchmarks emphasize different evaluation dimensions and lack a unified assessment of world-realism capability. To systematically evaluate World Models, we introduce the 4DWorldBench, which measures models across four key dimensions: Perceptual Quality, Condition-4D Alignment, Physical Realism, and 4D Consistency. The benchmark covers tasks such as Image-to-3D/4D, Video-to-4D, Text-to-3D/4D. Beyond these, we innovatively introduce adaptive conditioning across multiple modalities, which not only integrates but also extends traditional evaluation paradigms. To accommodate different modality-conditioned inputs, we map all modality conditions into a unified textual space during evaluation, and further integrate LLM-as-judge, MLLM-as-judge, and traditional network-based methods. This unified and adaptive design enables more comprehensive and consistent evaluation of alignment, physical realism, and cross-modal coherence. Preliminary human studies further demonstrate that our adaptive tool selection achieves closer agreement with subjective human judgments. We hope this benchmark will serve as a foundation for objective comparisons and improvements, accelerating the transition from "visual generation" to "world generation." Our project can be found at https://yeppp27.github.io/4DWorldBench.github.io/.

TLDR: The paper introduces 4DWorldBench, a novel benchmark for evaluating 3D/4D world generation models, covering perceptual quality, physical realism, 4D consistency, and condition alignment across multiple modalities via a unified text-based evaluation.

TLDR: 本文介绍了一个新的基准测试4DWorldBench，用于评估3D/4D世界生成模型，涵盖感知质量、物理真实感、4D一致性和条件对齐等多个维度。该基准使用统一的基于文本的评估方法来处理跨模态输入。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yiting Lu, Wei Luo, Peiyan Tu, Haoran Li, Hanxin Zhu, Zihao Yu, Xingrui Wang, Xinyi Chen, Xinge Peng, Xin Li, Zhibo Chen

ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding

We present ReDirector, a novel camera-controlled video retake generation method for dynamically captured variable-length videos. In particular, we rectify a common misuse of RoPE in previous works by aligning the spatiotemporal positions of the input video and the target retake. Moreover, we introduce Rotary Camera Encoding (RoCE), a camera-conditioned RoPE phase shift that captures and integrates multi-view relationships within and across the input and target videos. By integrating camera conditions into RoPE, our method generalizes to out-of-distribution camera trajectories and video lengths, yielding improved dynamic object localization and static background preservation. Extensive experiments further demonstrate significant improvements in camera controllability, geometric consistency, and video quality across various trajectories and lengths.

TLDR: ReDirector introduces a novel method, Rotary Camera Encoding (RoCE), to generate variable-length video retakes by integrating camera conditions into RoPE for improved camera controllability, geometric consistency, and video quality, even with out-of-distribution camera trajectories.

TLDR: ReDirector 提出了一种名为旋转相机编码（RoCE）的新方法，通过将相机条件集成到 RoPE 中，生成可变长度的视频重拍，从而提高了相机可控性、几何一致性和视频质量，即使在非分布相机轨迹下也是如此。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Byeongjun Park, Byung-Hoon Kim, Hyungjin Chung, Jong Chul Ye

Training-Free Generation of Diverse and High-Fidelity Images via Prompt Semantic Space Optimization

Image diversity remains a fundamental challenge for text-to-image diffusion models. Low-diversity models tend to generate repetitive outputs, increasing sampling redundancy and hindering both creative exploration and downstream applications. A primary cause is that generation often collapses toward a strong mode in the learned distribution. Existing attempts to improve diversity, such as noise resampling, prompt rewriting, or steering-based guidance, often still collapse to dominant modes or introduce distortions that degrade image quality. In light of this, we propose Token-Prompt embedding Space Optimization (TPSO), a training-free and model-agnostic module. TPSO introduces learnable parameters to explore underrepresented regions of the token embedding space, reducing the tendency of the model to repeatedly generate samples from strong modes of the learned distribution. At the same time, the prompt-level space provides a global semantic constraint that regulates distribution shifts, preventing quality degradation while maintaining high fidelity. Extensive experiments on MS-COCO and three diffusion backbones show that TPSO significantly enhances generative diversity, improving baseline performance from 1.10 to 4.18 points, without sacrificing image quality. Code will be released upon acceptance.

TLDR: The paper introduces TPSO, a training-free module for text-to-image diffusion models, designed to improve image diversity without sacrificing fidelity by optimizing prompt embeddings to explore underrepresented regions of the semantic space.

TLDR: 该论文介绍了一种名为 TPSO 的训练自由模块，用于文本到图像扩散模型，旨在通过优化提示嵌入来探索语义空间中未充分表示的区域，从而提高图像多样性，同时不牺牲保真度。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Debin Meng, Chen Jin, Zheng Gao, Yanran Li, Ioannis Patras, Georgios Tzimiropoulos

Efficient Transferable Optimal Transport via Min-Sliced Transport Plans

Optimal Transport (OT) offers a powerful framework for finding correspondences between distributions and addressing matching and alignment problems in various areas of computer vision, including shape analysis, image generation, and multimodal tasks. The computation cost of OT, however, hinders its scalability. Slice-based transport plans have recently shown promise for reducing the computational cost by leveraging the closed-form solutions of 1D OT problems. These methods optimize a one-dimensional projection (slice) to obtain a conditional transport plan that minimizes the transport cost in the ambient space. While efficient, these methods leave open the question of whether learned optimal slicers can transfer to new distribution pairs under distributional shift. Understanding this transferability is crucial in settings with evolving data or repeated OT computations across closely related distributions. In this paper, we study the min-Sliced Transport Plan (min-STP) framework and investigate the transferability of optimized slicers: can a slicer trained on one distribution pair yield effective transport plans for new, unseen pairs? Theoretically, we show that optimized slicers remain close under slight perturbations of the data distributions, enabling efficient transfer across related tasks. To further improve scalability, we introduce a minibatch formulation of min-STP and provide statistical guarantees on its accuracy. Empirically, we demonstrate that the transferable min-STP achieves strong one-shot matching performance and facilitates amortized training for point cloud alignment and flow-based generative modeling.

TLDR: This paper introduces an efficient, transferable version of Optimal Transport (OT) using min-Sliced Transport Plans (min-STP), demonstrating theoretical guarantees and empirical performance in point cloud alignment and flow-based generative modeling. It addresses the scalability issues of OT by exploring the transferability of optimal slicers across different distribution pairs.

TLDR: 本文介绍了一种高效、可迁移的最优传输(OT)方法，使用最小切片传输计划(min-STP)，并在点云对齐和基于流的生成模型中展示了理论保证和经验性能。它通过探索在不同分布对之间最优切片器的可迁移性，解决了OT的可扩展性问题。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Xinran Liu, Elaheh Akbari, Rocio Diaz Martin, Navid NaderiAlizadeh, Soheil Kolouri

Are Image-to-Video Models Good Zero-Shot Image Editors?

Large-scale video diffusion models show strong world simulation and temporal reasoning abilities, but their use as zero-shot image editors remains underexplored. We introduce IF-Edit, a tuning-free framework that repurposes pretrained image-to-video diffusion models for instruction-driven image editing. IF-Edit addresses three key challenges: prompt misalignment, redundant temporal latents, and blurry late-stage frames. It includes (1) a chain-of-thought prompt enhancement module that transforms static editing instructions into temporally grounded reasoning prompts; (2) a temporal latent dropout strategy that compresses frame latents after the expert-switch point, accelerating denoising while preserving semantic and temporal coherence; and (3) a self-consistent post-refinement step that sharpens late-stage frames using a short still-video trajectory. Experiments on four public benchmarks, covering non-rigid editing, physical and temporal reasoning, and general instruction edits, show that IF-Edit performs strongly on reasoning-centric tasks while remaining competitive on general-purpose edits. Our study provides a systematic view of video diffusion models as image editors and highlights a simple recipe for unified video-image generative reasoning.

TLDR: This paper introduces IF-Edit, a tuning-free framework that leverages pretrained image-to-video diffusion models for instruction-driven image editing, addressing challenges like prompt misalignment and blurry frames.

TLDR: 本文介绍了一个名为IF-Edit的免调优框架，该框架利用预训练的图像到视频扩散模型进行指令驱动的图像编辑，解决了诸如提示不对齐和帧模糊等问题。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Zechuan Zhang, Zhenyuan Chen, Zongxin Yang, Yi Yang

Breaking the Likelihood-Quality Trade-off in Diffusion Models by Merging Pretrained Experts

Diffusion models for image generation often exhibit a trade-off between perceptual sample quality and data likelihood: training objectives emphasizing high-noise denoising steps yield realistic images but poor likelihoods, whereas likelihood-oriented training overweights low-noise steps and harms visual fidelity. We introduce a simple plug-and-play sampling method that combines two pretrained diffusion experts by switching between them along the denoising trajectory. Specifically, we apply an image-quality expert at high noise levels to shape global structure, then switch to a likelihood expert at low noise levels to refine pixel statistics. The approach requires no retraining or fine-tuning -- only the choice of an intermediate switching step. On CIFAR-10 and ImageNet32, the merged model consistently matches or outperforms its base components, improving or preserving both likelihood and sample quality relative to each expert alone. These results demonstrate that expert switching across noise levels is an effective way to break the likelihood-quality trade-off in image diffusion models.

TLDR: This paper introduces a simple method to improve both image quality and likelihood in diffusion models by switching between two pre-trained experts at different noise levels during the sampling process, without requiring any retraining.

TLDR: 本文介绍了一种简单的方法，通过在采样过程中在不同噪声水平下切换两个预训练专家模型，从而在扩散模型中提高图像质量和似然性，而无需任何重新训练。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yasin Esfandiari, Stefan Bauer, Sebastian U. Stich, Andrea Dittadi

Flow Map Distillation Without Data

State-of-the-art flow models achieve remarkable quality but require slow, iterative sampling. To accelerate this, flow maps can be distilled from pre-trained teachers, a procedure that conventionally requires sampling from an external dataset. We argue that this data-dependency introduces a fundamental risk of Teacher-Data Mismatch, as a static dataset may provide an incomplete or even misaligned representation of the teacher's full generative capabilities. This leads us to question whether this reliance on data is truly necessary for successful flow map distillation. In this work, we explore a data-free alternative that samples only from the prior distribution, a distribution the teacher is guaranteed to follow by construction, thereby circumventing the mismatch risk entirely. To demonstrate the practical viability of this philosophy, we introduce a principled framework that learns to predict the teacher's sampling path while actively correcting for its own compounding errors to ensure high fidelity. Our approach surpasses all data-based counterparts and establishes a new state-of-the-art by a significant margin. Specifically, distilling from SiT-XL/2+REPA, our method reaches an impressive FID of 1.45 on ImageNet 256x256, and 1.49 on ImageNet 512x512, both with only 1 sampling step. We hope our work establishes a more robust paradigm for accelerating generative models and motivates the broader adoption of flow map distillation without data.

TLDR: This paper introduces a data-free flow map distillation method for accelerating generative models, achieving state-of-the-art FID scores on ImageNet using only prior samples and a novel error-correction framework. It addresses the Teacher-Data Mismatch problem.

TLDR: 本文提出了一种无数据流图蒸馏方法，用于加速生成模型。该方法仅使用先验样本和一个新颖的误差校正框架，在ImageNet上实现了最先进的FID分数，并解决了教师-数据不匹配的问题。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Shangyuan Tong, Nanye Ma, Saining Xie, Tommi Jaakkola

In-Video Instructions: Visual Signals as Generative Control

Large-scale video generative models have recently demonstrated strong visual capabilities, enabling the prediction of future frames that adhere to the logical and physical cues in the current observation. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction. In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects. Extensive experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions, particularly in complex multi-object scenarios.

TLDR: The paper introduces "In-Video Instruction," a method for controllable image-to-video generation where visual signals within frames (e.g., arrows, text) serve as instructions, offering more spatial control compared to text prompts, and demonstrates its effectiveness on state-of-the-art video generators.

TLDR: 该论文介绍了"In-Video Instruction"，一种可控的图像到视频生成方法，其中帧内的视觉信号（如箭头、文本）作为指令，与文本提示相比，提供了更强的空间控制，并在最先进的视频生成器上展示了其有效性。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Gongfan Fang, Xinyin Ma, Xinchao Wang

DesignPref: Capturing Personal Preferences in Visual Design Generation

Generative models, such as large language models and text-to-image diffusion models, are increasingly used to create visual designs like user interfaces (UIs) and presentation slides. Finetuning and benchmarking these generative models have often relied on datasets of human-annotated design preferences. Yet, due to the subjective and highly personalized nature of visual design, preference varies widely among individuals. In this paper, we study this problem by introducing DesignPref, a dataset of 12k pairwise comparisons of UI design generation annotated by 20 professional designers with multi-level preference ratings. We found that among trained designers, substantial levels of disagreement exist (Krippendorff's alpha = 0.25 for binary preferences). Natural language rationales provided by these designers indicate that disagreements stem from differing perceptions of various design aspect importance and individual preferences. With DesignPref, we demonstrate that traditional majority-voting methods for training aggregated judge models often do not accurately reflect individual preferences. To address this challenge, we investigate multiple personalization strategies, particularly fine-tuning or incorporating designer-specific annotations into RAG pipelines. Our results show that personalized models consistently outperform aggregated baseline models in predicting individual designers' preferences, even when using 20 times fewer examples. Our work provides the first dataset to study personalized visual design evaluation and support future research into modeling individual design taste.

TLDR: The paper introduces DesignPref, a dataset of UI design preferences annotated by professional designers, and explores personalization strategies to address disagreement among designers, demonstrating the benefits of personalized models over aggregated ones.

TLDR: 该论文介绍了DesignPref，一个由专业设计师标注的UI设计偏好数据集，并探讨了个性化策略以解决设计师之间的分歧，证明了个性化模型优于聚合模型的优势。

Relevance: (6/10)

Novelty: (7/10)

Clarity: (8/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Yi-Hao Peng, Jeffrey P. Bigham, Jason Wu

Image-Free Timestep Distillation via Continuous-Time Consistency with Trajectory-Sampled Pairs

Timestep distillation is an effective approach for improving the generation efficiency of diffusion models. The Consistency Model (CM), as a trajectory-based framework, demonstrates significant potential due to its strong theoretical foundation and high-quality few-step generation. Nevertheless, current continuous-time consistency distillation methods still rely heavily on training data and computational resources, hindering their deployment in resource-constrained scenarios and limiting their scalability to diverse domains. To address this issue, we propose Trajectory-Backward Consistency Model (TBCM), which eliminates the dependence on external training data by extracting latent representations directly from the teacher model's generation trajectory. Unlike conventional methods that require VAE encoding and large-scale datasets, our self-contained distillation paradigm significantly improves both efficiency and simplicity. Moreover, the trajectory-extracted samples naturally bridge the distribution gap between training and inference, thereby enabling more effective knowledge transfer. Empirically, TBCM achieves 6.52 FID and 28.08 CLIP scores on MJHQ-30k under one-step generation, while reducing training time by approximately 40% compared to Sana-Sprint and saving a substantial amount of GPU memory, demonstrating superior efficiency without sacrificing quality. We further reveal the diffusion-generation space discrepancy in continuous-time consistency distillation and analyze how sampling strategies affect distillation performance, offering insights for future distillation research. GitHub Link: https://github.com/hustvl/TBCM.

TLDR: The paper proposes a new consistency model distillation method (TBCM) that eliminates the need for external training data by extracting latent representations directly from the teacher model's generation trajectory, achieving improved efficiency and comparable quality in image generation.

TLDR: 该论文提出了一种新的连续性模型蒸馏方法（TBCM），通过直接从教师模型的生成轨迹中提取潜在表示，消除了对外部训练数据的依赖，在图像生成中实现了更高的效率和可比的质量。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Bao Tang, Shuai Zhang, Yueting Zhu, Jijun Xiang, Xin Yang, Li Yu, Wenyu Liu, Xinggang Wang

CREward: A Type-Specific Creativity Reward Model

Creativity is a complex phenomenon. When it comes to representing and assessing creativity, treating it as a single undifferentiated quantity would appear naive and underwhelming. In this work, we learn the \emph{first type-specific creativity reward model}, coined CREward, which spans three creativity ``axes," geometry, material, and texture, to allow us to view creativity through the lens of the image formation pipeline. To build our reward model, we first conduct a human benchmark evaluation to capture human perception of creativity for each type across various creative images. We then analyze the correlation between human judgments and predictions by large vision-language models (LVLMs), confirming that LVLMs exhibit strong alignment with human perception. Building on this observation, we collect LVLM-generated labels to train our CREward model that is applicable to both evaluation and generation of creative images. We explore three applications of CREward: creativity assessment, explainable creativity, and creative sample acquisition for both human design inspiration and guiding creative generation through low-rank adaptation.

TLDR: The paper introduces CREward, a type-specific creativity reward model for images, focusing on geometry, material, and texture, trained using LVLM-generated labels and validated against human perception, for creativity assessment, explanation, and sample acquisition.

TLDR: 该论文介绍了一种特定类型的创造力奖励模型CREward，用于图像的创造力评估，侧重于几何，材料和纹理，使用LVLM生成的标签进行训练，并根据人类感知进行验证，用于创造力评估、解释和样本获取。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Jiyeon Han, Ali Mahdavi-Amiri, Hao Zhang, Haedong Jeong

Multiscale Vector-Quantized Variational Autoencoder for Endoscopic Image Synthesis

Gastrointestinal (GI) imaging via Wireless Capsule Endoscopy (WCE) generates a large number of images requiring manual screening. Deep learning-based Clinical Decision Support (CDS) systems can assist screening, yet their performance relies on the existence of large, diverse, training medical datasets. However, the scarcity of such data, due to privacy constraints and annotation costs, hinders CDS development. Generative machine learning offers a viable solution to combat this limitation. While current Synthetic Data Generation (SDG) methods, such as Generative Adversarial Networks and Variational Autoencoders have been explored, they often face challenges with training stability and capturing sufficient visual diversity, especially when synthesizing abnormal findings. This work introduces a novel VAE-based methodology for medical image synthesis and presents its application for the generation of WCE images. The novel contributions of this work include a) multiscale extension of the Vector Quantized VAE model, named as Multiscale Vector Quantized Variational Autoencoder (MSVQ-VAE); b) unlike other VAE-based SDG models for WCE image generation, MSVQ-VAE is used to seamlessly introduce abnormalities into normal WCE images; c) it enables conditional generation of synthetic images, enabling the introduction of different types of abnormalities into the normal WCE images; d) it performs experiments with a variety of abnormality types, including polyps, vascular and inflammatory conditions. The utility of the generated images for CDS is assessed via image classification. Comparative experiments demonstrate that training a CDS classifier using the abnormal images generated by the proposed methodology yield comparable results with a classifier trained with only real data. The generality of the proposed methodology promises its applicability to various domains related to medical multimedia.

TLDR: This paper introduces a Multiscale Vector Quantized VAE (MSVQ-VAE) for generating synthetic endoscopic images with abnormalities, demonstrating its utility in training classifiers for clinical decision support systems.

TLDR: 本文介绍了一种多尺度向量量化VAE（MSVQ-VAE），用于生成具有异常的合成内窥镜图像，并证明了其在训练临床决策支持系统分类器中的效用。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (8/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Dimitrios E. Diamantis, Dimitris K. Iakovidis

AIGC Daily Papers

A Reason-then-Describe Instruction Interpreter for Controllable Video Generation

PhysChoreo: Physics-Controllable Video Generation with Part-Aware Semantic Grounding

STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flow

GigaWorld-0: World Models as Data Engine to Empower Embodied AI

Rectified SpaAttn: Revisiting Attention Sparsity for Efficient Video Generation

Terminal Velocity Matching

One Attention, One Scale: Phase-Aligned Rotary Positional Embeddings for Mixed-Resolution Diffusion Transformer

DINO-Tok: Adapting DINO for Visual Tokenizers

Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward

Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning

Beyond Generation: Multi-Hop Reasoning for Factual Accuracy in Vision-Language Models

HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation

Learning to Generate Human-Human-Object Interactions from Textual Descriptions

Block Cascading: Training Free Acceleration of Block-Causal Video Models

FREE: Uncertainty-Aware Autoregression for Parallel Diffusion Transformers

TReFT: Taming Rectified Flow Models For One-Step Image Translation

Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement

The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation

PromptMoG: Enhancing Diversity in Long-Prompt Image Generation via Prompt Embedding Mixture-of-Gaussian Sampling

Text-guided Controllable Diffusion for Realistic Camouflage Images Generation

Exo2EgoSyn: Unlocking Foundation Video Generation Models for Exocentric-to-Egocentric Video Synthesis

OmniRefiner: Reinforcement-Guided Local Diffusion Refinement

HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning

Low-Resolution Editing is All You Need for High-Resolution Editing

Temporal-Visual Semantic Alignment: A Unified Architecture for Transferring Spatial Priors from Vision Models to Zero-Shot Temporal Tasks

4DWorldBench: A Comprehensive Evaluation Framework for 3D/4D World Generation Models

ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding

Training-Free Generation of Diverse and High-Fidelity Images via Prompt Semantic Space Optimization

Efficient Transferable Optimal Transport via Min-Sliced Transport Plans

Are Image-to-Video Models Good Zero-Shot Image Editors?

Breaking the Likelihood-Quality Trade-off in Diffusion Models by Merging Pretrained Experts

Flow Map Distillation Without Data

In-Video Instructions: Visual Signals as Generative Control

DesignPref: Capturing Personal Preferences in Visual Design Generation

Image-Free Timestep Distillation via Continuous-Time Consistency with Trajectory-Sampled Pairs

CREward: A Type-Specific Creativity Reward Model

Multiscale Vector-Quantized Variational Autoencoder for Endoscopic Image Synthesis