ArXiv CS.CV Papers (Image/Video Generation)

Tele-Omni: a Unified Multimodal Framework for Video Generation and Editing

Recent advances in diffusion-based video generation have substantially improved visual fidelity and temporal coherence. However, most existing approaches remain task-specific and rely primarily on textual instructions, limiting their ability to handle multimodal inputs, contextual references, and diverse video generation and editing scenarios within a unified framework. Moreover, many video editing methods depend on carefully engineered pipelines tailored to individual operations, which hinders scalability and composability. In this paper, we propose Tele-Omni, a unified multimodal framework for video generation and editing that follows multimodal instructions, including text, images, and reference videos, within a single model. Tele-Omni leverages pretrained multimodal large language models to parse heterogeneous instructions and infer structured generation or editing intents, while diffusion-based generators perform high-quality video synthesis conditioned on these structured signals. To enable joint training across heterogeneous video tasks, we introduce a task-aware data processing pipeline that unifies multimodal inputs into a structured instruction format while preserving task-specific constraints. Tele-Omni supports a wide range of video-centric tasks, including text-to-video generation, image-to-video generation, first-last-frame video generation, in-context video generation, and in-context video editing. By decoupling instruction parsing from video synthesis and combining it with task-aware data design, Tele-Omni achieves flexible multimodal control while maintaining strong temporal coherence and visual consistency. Experimental results demonstrate that Tele-Omni achieves competitive performance across multiple tasks.

TLDR: The paper introduces Tele-Omni, a unified multimodal framework for video generation and editing that uses multimodal instructions (text, images, video) with a single diffusion model. It shows how to handle heterogeneous video tasks by instruction parsing and task-aware data processing.

TLDR: 该论文介绍了一种统一的多模态视频生成和编辑框架 Tele-Omni，它使用多模态指令（文本、图像、视频）并利用单个扩散模型。它展示了如何通过指令解析和任务感知的数据处理来处理异构视频任务。

Relevance: (10/10)

Novelty: (9/10)

Clarity: (9/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Jialun Liu, Yukuo Ma, Xiao Cao, Tian Li, Gonghu Shang, Haibin Huang, Chi Zhang, Xuelong Li, Cong Liu, Junqi Liu, Jiakui Hu, Robby T. Tan, Shiwen Zhang, Liying Yang, Xiaoyan Yang, Qizhen Weng, Xiangzhen Chang, Yuanzhi Liang, Yifan Xu, Zhiyong Huang, Zuoxin Li, Xuelong Li

Fine-T2I: An Open, Large-Scale, and Diverse Dataset for High-Quality T2I Fine-Tuning

High-quality and open datasets remain a major bottleneck for text-to-image (T2I) fine-tuning. Despite rapid progress in model architectures and training pipelines, most publicly available fine-tuning datasets suffer from low resolution, poor text-image alignment, or limited diversity, resulting in a clear performance gap between open research models and enterprise-grade models. In this work, we present Fine-T2I, a large-scale, high-quality, and fully open dataset for T2I fine-tuning. Fine-T2I spans 10 task combinations, 32 prompt categories, 11 visual styles, and 5 prompt templates, and combines synthetic images generated by strong modern models with carefully curated real images from professional photographers. All samples are rigorously filtered for text-image alignment, visual fidelity, and prompt quality, with over 95% of initial candidates removed. The final dataset contains over 6 million text-image pairs, around 2 TB on disk, approaching the scale of pretraining datasets while maintaining fine-tuning-level quality. Across a diverse set of pretrained diffusion and autoregressive models, fine-tuning on Fine-T2I consistently improves both generation quality and instruction adherence, as validated by human evaluation, visual comparison, and automatic metrics. We release Fine-T2I under an open license to help close the data gap in T2I fine-tuning in the open community.

TLDR: The paper introduces Fine-T2I, a large-scale, high-quality, and open dataset of 6 million text-image pairs for fine-tuning text-to-image models, addressing limitations in existing fine-tuning datasets.

TLDR: 该论文介绍了一个大规模、高质量且开放的文本到图像微调数据集Fine-T2I，包含600万个文本图像对，旨在解决现有微调数据集的局限性。

Relevance: (10/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Xu Ma, Yitian Zhang, Qihua Dong, Yun Fu

Autoregressive Image Generation with Masked Bit Modeling

This paper challenges the dominance of continuous pipelines in visual generation. We systematically investigate the performance gap between discrete and continuous methods. Contrary to the belief that discrete tokenizers are intrinsically inferior, we demonstrate that the disparity arises primarily from the total number of bits allocated in the latent space (i.e., the compression ratio). We show that scaling up the codebook size effectively bridges this gap, allowing discrete tokenizers to match or surpass their continuous counterparts. However, existing discrete generation methods struggle to capitalize on this insight, suffering from performance degradation or prohibitive training costs with scaled codebook. To address this, we propose masked Bit AutoRegressive modeling (BAR), a scalable framework that supports arbitrary codebook sizes. By equipping an autoregressive transformer with a masked bit modeling head, BAR predicts discrete tokens through progressively generating their constituent bits. BAR achieves a new state-of-the-art gFID of 0.99 on ImageNet-256, outperforming leading methods across both continuous and discrete paradigms, while significantly reducing sampling costs and converging faster than prior continuous approaches. Project page is available at https://bar-gen.github.io/

TLDR: The paper introduces Masked Bit AutoRegressive modeling (BAR), a novel and scalable autoregressive framework for image generation that achieves state-of-the-art results on ImageNet-256 using discrete tokenizers and reduces sampling costs.

TLDR: 本文介绍了一种新的可扩展自回归框架——掩码比特自回归建模 (BAR)，用于图像生成，该框架使用离散分词器在 ImageNet-256 上实现了最先进的结果，并降低了采样成本。

Relevance: (9/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Qihang Yu, Qihao Liu, Ju He, Xinyang Zhang, Yang Liu, Liang-Chieh Chen, Xi Chen

ArcFlow: Unleashing 2-Step Text-to-Image Generation via High-Precision Non-Linear Flow Distillation

Diffusion models have achieved remarkable generation quality, but they suffer from significant inference cost due to their reliance on multiple sequential denoising steps, motivating recent efforts to distill this inference process into a few-step regime. However, existing distillation methods typically approximate the teacher trajectory by using linear shortcuts, which makes it difficult to match its constantly changing tangent directions as velocities evolve across timesteps, thereby leading to quality degradation. To address this limitation, we propose ArcFlow, a few-step distillation framework that explicitly employs non-linear flow trajectories to approximate pre-trained teacher trajectories. Concretely, ArcFlow parameterizes the velocity field underlying the inference trajectory as a mixture of continuous momentum processes. This enables ArcFlow to capture velocity evolution and extrapolate coherent velocities to form a continuous non-linear trajectory within each denoising step. Importantly, this parameterization admits an analytical integration of this non-linear trajectory, which circumvents numerical discretization errors and results in high-precision approximation of the teacher trajectory. To train this parameterization into a few-step generator, we implement ArcFlow via trajectory distillation on pre-trained teacher models using lightweight adapters. This strategy ensures fast, stable convergence while preserving generative diversity and quality. Built on large-scale models (Qwen-Image-20B and FLUX.1-dev), ArcFlow only fine-tunes on less than 5% of original parameters and achieves a 40x speedup with 2 NFEs over the original multi-step teachers without significant quality degradation. Experiments on benchmarks show the effectiveness of ArcFlow both qualitatively and quantitatively.

TLDR: ArcFlow introduces a non-linear flow distillation method for few-step text-to-image generation, achieving a 40x speedup over diffusion models with minimal quality loss by using trajectory distillation and lightweight adapters.

TLDR: ArcFlow 提出了一种用于少步文本到图像生成的非线性流蒸馏方法，通过使用轨迹蒸馏和轻量级适配器，实现了比扩散模型快 40 倍的速度，且质量损失极小。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Zihan Yang, Shuyuan Tu, Licheng Zhang, Qi Dai, Yu-Gang Jiang, Zuxuan Wu

AdaTSQ: Pushing the Pareto Frontier of Diffusion Transformers via Temporal-Sensitivity Quantization

Diffusion Transformers (DiTs) have emerged as the state-of-the-art backbone for high-fidelity image and video generation. However, their massive computational cost and memory footprint hinder deployment on edge devices. While post-training quantization (PTQ) has proven effective for large language models (LLMs), directly applying existing methods to DiTs yields suboptimal results due to the neglect of the unique temporal dynamics inherent in diffusion processes. In this paper, we propose AdaTSQ, a novel PTQ framework that pushes the Pareto frontier of efficiency and quality by exploiting the temporal sensitivity of DiTs. First, we propose a Pareto-aware timestep-dynamic bit-width allocation strategy. We model the quantization policy search as a constrained pathfinding problem. We utilize a beam search algorithm guided by end-to-end reconstruction error to dynamically assign layer-wise bit-widths across different timesteps. Second, we propose a Fisher-guided temporal calibration mechanism. It leverages temporal Fisher information to prioritize calibration data from highly sensitive timesteps, seamlessly integrating with Hessian-based weight optimization. Extensive experiments on four advanced DiTs (e.g., Flux-Dev, Flux-Schnell, Z-Image, and Wan2.1) demonstrate that AdaTSQ significantly outperforms state-of-the-art methods like SVDQuant and ViDiT-Q. Our code will be released at https://github.com/Qiushao-E/AdaTSQ.

TLDR: AdaTSQ is a novel post-training quantization framework for Diffusion Transformers that achieves better efficiency and quality by exploiting the temporal sensitivity of diffusion processes through timestep-dynamic bit-width allocation and Fisher-guided temporal calibration.

TLDR: AdaTSQ 是一种新的扩散Transformer后训练量化框架，通过利用扩散过程的时间敏感性，实现更好的效率和质量，具体方法包括 timestep-dynamic bit-width 分配和 Fisher 引导的时间校准。

Relevance: (8/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Shaoqiu Zhang, Zizhong Ding, Kaicheng Yang, Junyi Wu, Xianglong Yan, Xi Li, Bingnan Duan, Jianping Fang, Yulun Zhang

Kelix Technique Report

Autoregressive large language models (LLMs) scale well by expressing diverse tasks as sequences of discrete natural-language tokens and training with next-token prediction, which unifies comprehension and generation under self-supervision. Extending this paradigm to multimodal data requires a shared, discrete representation across modalities. However, most vision-language models (VLMs) still rely on a hybrid interface: discrete text tokens paired with continuous Vision Transformer (ViT) features. Because supervision is largely text-driven, these models are often biased toward understanding and cannot fully leverage large-scale self-supervised learning on non-text data. Recent work has explored discrete visual tokenization to enable fully autoregressive multimodal modeling, showing promising progress toward unified understanding and generation. Yet existing discrete vision tokens frequently lose information due to limited code capacity, resulting in noticeably weaker understanding than continuous-feature VLMs. We present Kelix, a fully discrete autoregressive unified model that closes the understanding gap between discrete and continuous visual representations.

TLDR: The paper introduces Kelix, a fully discrete autoregressive model that aims to bridge the performance gap between discrete and continuous visual representations in vision-language models, potentially enabling better multimodal understanding and generation.

TLDR: 该论文介绍了Kelix，一种全离散自回归模型，旨在弥合视觉语言模型中离散和连续视觉表示之间的性能差距，从而潜在地实现更好的多模态理解和生成。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Boyang Ding, Chenglong Chu, Dunju Zang, Han Li, Jiangxia Cao, Kun Gai, Muhao Wei, Ruiming Tang, Shiyao Wang, Siyang Mao, Xinchen Luo, Yahui Liu, Zhixin Ling, Zhuoran Yang, Ziming Li, Chengru Song, Guorui Zhou, Guowang Zhang, Hao Peng, Hao Wang, Jiaxin Deng, Jin Ouyang, Jinghao Zhang, Lejian Ren, Qianqian Wang, Qigen Hu, Tao Wang, Xingmei Wang, Yiping Yang, Zixing Zhang, Ziqi Wang

Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures

Egocentric interactive world models are essential for augmented reality and embodied AI, where visual generation must respond to user input with low latency, geometric consistency, and long-term stability. We study egocentric interaction generation from a single scene image under free-space hand gestures, aiming to synthesize photorealistic videos in which hands enter the scene, interact with objects, and induce plausible world dynamics under head motion. This setting introduces fundamental challenges, including distribution shift between free-space gestures and contact-heavy training data, ambiguity between hand motion and camera motion in monocular views, and the need for arbitrary-length video generation. We present Hand2World, a unified autoregressive framework that addresses these challenges through occlusion-invariant hand conditioning based on projected 3D hand meshes, allowing visibility and occlusion to be inferred from scene context rather than encoded in the control signal. To stabilize egocentric viewpoint changes, we inject explicit camera geometry via per-pixel Plücker-ray embeddings, disentangling camera motion from hand motion and preventing background drift. We further develop a fully automated monocular annotation pipeline and distill a bidirectional diffusion model into a causal generator, enabling arbitrary-length synthesis. Experiments on three egocentric interaction benchmarks show substantial improvements in perceptual quality and 3D consistency while supporting camera control and long-horizon interactive generation.

TLDR: The paper introduces Hand2World, an autoregressive framework for generating egocentric videos of hand-object interactions, addressing challenges like distribution shift and camera motion ambiguity through occlusion-invariant hand conditioning and Plücker-ray embeddings.

TLDR: 本文介绍了一个名为Hand2World的自回归框架，用于生成手部与物体交互的以自我为中心的视频。它通过不依赖于遮挡的手部调节和Plücker-ray嵌入，解决了诸如分布偏移和相机运动模糊等挑战。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yuxi Wang, Wenqi Ouyang, Tianyi Wei, Yi Dong, Zhiqi Shen, Xingang Pan

MieDB-100k: A Comprehensive Dataset for Medical Image Editing

The scarcity of high-quality data remains a primary bottleneck in adapting multimodal generative models for medical image editing. Existing medical image editing datasets often suffer from limited diversity, neglect of medical image understanding and inability to balance quality with scalability. To address these gaps, we propose MieDB-100k, a large-scale, high-quality and diverse dataset for text-guided medical image editing. It categorizes editing tasks into perspectives of Perception, Modification and Transformation, considering both understanding and generation abilities. We construct MieDB-100k via a data curation pipeline leveraging both modality-specific expert models and rule-based data synthetic methods, followed by rigorous manual inspection to ensure clinical fidelity. Extensive experiments demonstrate that model trained with MieDB-100k consistently outperform both open-source and proprietary models while exhibiting strong generalization ability. We anticipate that this dataset will serve as a cornerstone for future advancements in specialized medical image editing.

TLDR: MieDB-100k is a new large-scale, high-quality dataset for text-guided medical image editing, addressing the scarcity of suitable data and improving the performance and generalization of trained models.

TLDR: MieDB-100k是一个新的大规模、高质量的医疗图像编辑数据集，通过文本指导图像编辑，解决了相关数据稀缺的问题，并提升了训练模型的性能和泛化能力。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yongfan Lai, Wen Qian, Bo Liu, Hongyan Li, Hao Luo, Fan Wang, Bohan Zhuang, Shenda Hong

AUHead: Realistic Emotional Talking Head Generation via Action Units Control

Realistic talking-head video generation is critical for virtual avatars, film production, and interactive systems. Current methods struggle with nuanced emotional expressions due to the lack of fine-grained emotion control. To address this issue, we introduce a novel two-stage method (AUHead) to disentangle fine-grained emotion control, i.e. , Action Units (AUs), from audio and achieve controllable generation. In the first stage, we explore the AU generation abilities of large audio-language models (ALMs), by spatial-temporal AU tokenization and an "emotion-then-AU" chain-of-thought mechanism. It aims to disentangle AUs from raw speech, effectively capturing subtle emotional cues. In the second stage, we propose an AU-driven controllable diffusion model that synthesizes realistic talking-head videos conditioned on AU sequences. Specifically, we first map the AU sequences into the structured 2D facial representation to enhance spatial fidelity, and then model the AU-vision interaction within cross-attention modules. To achieve flexible AU-quality trade-off control, we introduce an AU disentanglement guidance strategy during inference, further refining the emotional expressiveness and identity consistency of the generated videos. Results on benchmark datasets demonstrate that our approach achieves competitive performance in emotional realism, accurate lip synchronization, and visual coherence, significantly surpassing existing techniques. Our implementation is available at https://github.com/laura990501/AUHead_ICLR

TLDR: The paper introduces AUHead, a two-stage method for realistic emotional talking head video generation with fine-grained emotion control using Action Units (AUs). It employs an audio-language model for AU generation and an AU-driven diffusion model for video synthesis.

TLDR: 该论文介绍了一种名为AUHead的两阶段方法，用于生成逼真的情感说话头像视频，该方法利用动作单元（AU）进行细粒度的情感控制。它使用音频语言模型生成AU，并使用AU驱动的扩散模型合成视频。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Jiayi Lyu, Leigang Qu, Wenjing Zhang, Hanyu Jiang, Kai Liu, Zhenglin Zhou, Xiaobo Xia, Jian Xue, Tat-Seng Chua

Beyond Next-Token Alignment: Distilling Multimodal Large Language Models via Token Interactions

Multimodal Large Language Models (MLLMs) demonstrate impressive cross-modal capabilities, yet their substantial size poses significant deployment challenges. Knowledge distillation (KD) is a promising solution for compressing these models, but existing methods primarily rely on static next-token alignment, neglecting the dynamic token interactions, which embed essential capabilities for multimodal understanding and generation. To this end, we introduce Align-TI, a novel KD framework designed from the perspective of Token Interactions. Our approach is motivated by the insight that MLLMs rely on two primary interactions: vision-instruction token interactions to extract relevant visual information, and intra-response token interactions for coherent generation. Accordingly, Align-TI introduces two components: IVA enables the student model to imitate the teacher's instruction-relevant visual information extract capability by aligning on salient visual regions. TPA captures the teacher's dynamic generative logic by aligning the sequential token-to-token transition probabilities. Extensive experiments demonstrate Align-TI's superiority. Notably, our approach achieves $2.6\%$ relative improvement over Vanilla KD, and our distilled Align-TI-2B even outperforms LLaVA-1.5-7B (a much larger MLLM) by $7.0\%$, establishing a new state-of-the-art distillation framework for training parameter-efficient MLLMs. Code is available at https://github.com/lchen1019/Align-TI.

TLDR: The paper introduces Align-TI, a knowledge distillation framework for multimodal large language models (MLLMs) that considers token interactions, achieving state-of-the-art performance in distilling smaller MLLMs.

TLDR: 该论文介绍了一个名为Align-TI的多模态大语言模型（MLLM）知识蒸馏框架，该框架考虑了token之间的交互，并在蒸馏更小的MLLM方面实现了最先进的性能。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Lin Chen, Xiaoke Zhao, Kun Ding, Weiwei Feng, Changtao Miao, Zili Wang, Wenxuan Guo, Ying Wang, Kaiyuan Zheng, Bo Zhang, Zhe Li, Shiming Xiang

K-Sort Eval: Efficient Preference Evaluation for Visual Generation via Corrected VLM-as-a-Judge

The rapid development of visual generative models raises the need for more scalable and human-aligned evaluation methods. While the crowdsourced Arena platforms offer human preference assessments by collecting human votes, they are costly and time-consuming, inherently limiting their scalability. Leveraging vision-language model (VLMs) as substitutes for manual judgments presents a promising solution. However, the inherent hallucinations and biases of VLMs hinder alignment with human preferences, thus compromising evaluation reliability. Additionally, the static evaluation approach lead to low efficiency. In this paper, we propose K-Sort Eval, a reliable and efficient VLM-based evaluation framework that integrates posterior correction and dynamic matching. Specifically, we curate a high-quality dataset from thousands of human votes in K-Sort Arena, with each instance containing the outputs and rankings of K models. When evaluating a new model, it undergoes (K+1)-wise free-for-all comparisons with existing models, and the VLM provide the rankings. To enhance alignment and reliability, we propose a posterior correction method, which adaptively corrects the posterior probability in Bayesian updating based on the consistency between the VLM prediction and human supervision. Moreover, we propose a dynamic matching strategy, which balances uncertainty and diversity to maximize the expected benefit of each comparison, thus ensuring more efficient evaluation. Extensive experiments show that K-Sort Eval delivers evaluation results consistent with K-Sort Arena, typically requiring fewer than 90 model runs, demonstrating both its efficiency and reliability.

TLDR: The paper introduces K-Sort Eval, a VLM-based evaluation framework for visual generative models that uses posterior correction and dynamic matching for efficient and human-aligned preference evaluation, achieving comparable results to human evaluation but with significantly fewer model runs.

TLDR: 该论文介绍了K-Sort Eval，一种基于VLM的视觉生成模型评估框架，它使用后验校正和动态匹配来实现高效且与人类偏好对齐的评估，与人类评估相比，能以更少的模型运行次数获得可比的结果。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Zhikai Li, Jiatong Li, Xuewen Liu, Wangbo Zhao, Pan Du, Kaicheng Zhou, Qingyi Gu, Yang You, Zhen Dong, Kurt Keutzer

Rethinking Global Text Conditioning in Diffusion Transformers

Diffusion transformers typically incorporate textual information via attention layers and a modulation mechanism using a pooled text embedding. Nevertheless, recent approaches discard modulation-based text conditioning and rely exclusively on attention. In this paper, we address whether modulation-based text conditioning is necessary and whether it can provide any performance advantage. Our analysis shows that, in its conventional usage, the pooled embedding contributes little to overall performance, suggesting that attention alone is generally sufficient for faithfully propagating prompt information. However, we reveal that the pooled embedding can provide significant gains when used from a different perspective-serving as guidance and enabling controllable shifts toward more desirable properties. This approach is training-free, simple to implement, incurs negligible runtime overhead, and can be applied to various diffusion models, bringing improvements across diverse tasks, including text-to-image/video generation and image editing.

TLDR: This paper investigates the necessity of modulation-based text conditioning in diffusion transformers, finding it largely redundant in conventional usage but beneficial when used as a guidance mechanism for controllable generation.

TLDR: 本文研究了扩散Transformer中基于调制的文本条件作用的必要性，发现其在传统用法中基本是多余的，但当用作可控生成的指导机制时是有益的。

Relevance: (9/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Nikita Starodubcev, Daniil Pakhomov, Zongze Wu, Ilya Drobyshevskiy, Yuchen Liu, Zhonghao Wang, Yuqian Zhou, Zhe Lin, Dmitry Baranchuk

All-in-One Conditioning for Text-to-Image Synthesis

Accurate interpretation and visual representation of complex prompts involving multiple objects, attributes, and spatial relationships is a critical challenge in text-to-image synthesis. Despite recent advancements in generating photorealistic outputs, current models often struggle with maintaining semantic fidelity and structural coherence when processing intricate textual inputs. We propose a novel approach that grounds text-to-image synthesis within the framework of scene graph structures, aiming to enhance the compositional abilities of existing models. Eventhough, prior approaches have attempted to address this by using pre-defined layout maps derived from prompts, such rigid constraints often limit compositional flexibility and diversity. In contrast, we introduce a zero-shot, scene graph-based conditioning mechanism that generates soft visual guidance during inference. At the core of our method is the Attribute-Size-Quantity-Location (ASQL) Conditioner, which produces visual conditions via a lightweight language model and guides diffusion-based generation through inference-time optimization. This enables the model to maintain text-image alignment while supporting lightweight, coherent, and diverse image synthesis.

TLDR: This paper introduces a novel, scene graph-based conditioning mechanism called ASQL Conditioner for text-to-image synthesis, aiming to improve compositional abilities and semantic fidelity in generated images by generating soft visual guidance during inference.

TLDR: 本文介绍了一种新颖的基于场景图的调节机制，称为ASQL Conditioner，用于文本到图像的合成。该方法旨在通过在推理过程中生成软视觉指导，来提高生成图像的组合能力和语义保真度。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Hirunima Jayasekara, Chuong Huynh, Yixuan Ren, Christabel Acquaye, Abhinav Shrivastava

Agent Banana: High-Fidelity Image Editing with Agentic Thinking and Tooling

We study instruction-based image editing under professional workflows and identify three persistent challenges: (i) editors often over-edit, modifying content beyond the user's intent; (ii) existing models are largely single-turn, while multi-turn edits can alter object faithfulness; and (iii) evaluation at around 1K resolution is misaligned with real workflows that often operate on ultra high-definition images (e.g., 4K). We propose Agent Banana, a hierarchical agentic planner-executor framework for high-fidelity, object-aware, deliberative editing. Agent Banana introduces two key mechanisms: (1) Context Folding, which compresses long interaction histories into structured memory for stable long-horizon control; and (2) Image Layer Decomposition, which performs localized layer-based edits to preserve non-target regions while enabling native-resolution outputs. To support rigorous evaluation, we build HDD-Bench, a high-definition, dialogue-based benchmark featuring verifiable stepwise targets and native 4K images (11.8M pixels) for diagnosing long-horizon failures. On HDD-Bench, Agent Banana achieves the best multi-turn consistency and background fidelity (e.g., IC 0.871, SSIM-OM 0.84, LPIPS-OM 0.12) while remaining competitive on instruction following, and also attains strong performance on standard single-turn editing benchmarks. We hope this work advances reliable, professional-grade agentic image editing and its integration into real workflows.

TLDR: The paper introduces Agent Banana, a hierarchical agentic framework for high-fidelity image editing, addressing challenges in multi-turn editing and high-resolution processing, along with a new benchmark, HDD-Bench, for evaluation.

TLDR: 该论文介绍了 Agent Banana，一个用于高保真图像编辑的分层代理框架，解决了多轮编辑和高分辨率处理中的挑战，并提出了一个新的基准 HDD-Bench 用于评估。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Ruijie Ye, Jiayi Zhang, Zhuoxin Liu, Zihao Zhu, Siyuan Yang, Li Li, Tianfu Fu, Franck Dernoncourt, Yue Zhao, Jiacheng Zhu, Ryan Rossi, Wenhao Chai, Zhengzhong Tu

SciFlow-Bench: Evaluating Structure-Aware Scientific Diagram Generation via Inverse Parsing

Scientific diagrams convey explicit structural information, yet modern text-to-image models often produce visually plausible but structurally incorrect results. Existing benchmarks either rely on image-centric or subjective metrics insensitive to structure, or evaluate intermediate symbolic representations rather than final rendered images, leaving pixel-based diagram generation underexplored. We introduce SciFlow-Bench, a structure-first benchmark for evaluating scientific diagram generation directly from pixel-level outputs. Built from real scientific PDFs, SciFlow-Bench pairs each source framework figure with a canonical ground-truth graph and evaluates models as black-box image generators under a closed-loop, round-trip protocol that inverse-parses generated diagram images back into structured graphs for comparison. This design enforces evaluation by structural recoverability rather than visual similarity alone, and is enabled by a hierarchical multi-agent system that coordinates planning, perception, and structural reasoning. Experiments show that preserving structural correctness remains a fundamental challenge, particularly for diagrams with complex topology, underscoring the need for structure-aware evaluation.

TLDR: The paper introduces SciFlow-Bench, a benchmark for evaluating scientific diagram generation by inverse-parsing generated images back into structured graphs, highlighting the challenge of preserving structural correctness in generated diagrams.

TLDR: 该论文介绍了SciFlow-Bench，这是一个通过将生成的图像反向解析为结构化图来评估科学图表生成的基准，强调了在生成的图表中保持结构正确性的挑战。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Tong Zhang, Honglin Lin, Zhou Liu, Chong Chen, Wentao Zhang

Look-Ahead and Look-Back Flows: Training-Free Image Generation with Trajectory Smoothing

Recent advances have reformulated diffusion models as deterministic ordinary differential equations (ODEs) through the framework of flow matching, providing a unified formulation for the noise-to-data generative process. Various training-free flow matching approaches have been developed to improve image generation through flow velocity field adjustment, eliminating the need for costly retraining. However, Modifying the velocity field $v$ introduces errors that propagate through the full generation path, whereas adjustments to the latent trajectory $z$ are naturally corrected by the pretrained velocity network, reducing error accumulation. In this paper, we propose two complementary training-free latent-trajectory adjustment approaches based on future and past velocity $v$ and latent trajectory $z$ information that refine the generative path directly in latent space. We propose two training-free trajectory smoothing schemes: \emph{Look-Ahead}, which averages the current and next-step latents using a curvature-gated weight, and \emph{Look-Back}, which smoothes latents using an exponential moving average with decay. We demonstrate through extensive experiments and comprehensive evaluation metrics that the proposed training-free trajectory smoothing models substantially outperform various state-of-the-art models across multiple datasets including COCO17, CUB-200, and Flickr30K.

TLDR: The paper introduces two training-free methods, Look-Ahead and Look-Back, to smooth the latent trajectory in flow matching models for improved image generation, demonstrating state-of-the-art performance on multiple datasets.

TLDR: 本文提出了两种免训练方法，前瞻（Look-Ahead）和回顾（Look-Back），用于平滑flow matching模型中的潜在轨迹，从而改进图像生成，并在多个数据集上展示了最佳性能。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Yan Luo, Henry Huang, Todd Y. Zhou, Mengyu Wang

SceneReVis: A Self-Reflective Vision-Grounded Framework for 3D Indoor Scene Synthesis via Multi-turn RL

Current one-pass 3D scene synthesis methods often suffer from spatial hallucinations, such as collisions, due to a lack of deliberative reasoning. To bridge this gap, we introduce SceneReVis, a vision-grounded self-reflection framework that employs an iterative ``diagnose-and-act'' loop to explicitly intercept and resolve spatial conflicts using multi-modal feedback. To support this step-wise paradigm, we construct SceneChain-12k, a large-scale dataset of causal construction trajectories derived through a novel reverse engineering pipeline. We further propose a two-stage training recipe that transitions from Supervised Fine-Tuning to Agentic Reinforcement Learning, evolving the model into an active spatial planner. Extensive experiments demonstrate that SceneReVis achieves state-of-the-art performance in high-fidelity generation and goal-oriented optimization, with robust generalization to long-tail domains.

TLDR: The paper introduces SceneReVis, a vision-grounded self-reflection framework for 3D indoor scene synthesis using multi-turn reinforcement learning to address spatial hallucinations, along with a new dataset SceneChain-12k and a two-stage training process.

TLDR: 该论文介绍了 SceneReVis，一个基于视觉的自反思框架，通过多轮强化学习进行 3D 室内场景合成，以解决空间幻觉问题。此外，还提出了一个新的数据集 SceneChain-12k 和一个两阶段训练过程。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Yang Zhao, Shizhao Sun, Meisheng Zhang, Yingdong Shi, Xubo Yang, Jiang Bian

AIGC Daily Papers

Tele-Omni: a Unified Multimodal Framework for Video Generation and Editing

Fine-T2I: An Open, Large-Scale, and Diverse Dataset for High-Quality T2I Fine-Tuning

Autoregressive Image Generation with Masked Bit Modeling

ArcFlow: Unleashing 2-Step Text-to-Image Generation via High-Precision Non-Linear Flow Distillation

AdaTSQ: Pushing the Pareto Frontier of Diffusion Transformers via Temporal-Sensitivity Quantization

Kelix Technique Report

Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures

MieDB-100k: A Comprehensive Dataset for Medical Image Editing

AUHead: Realistic Emotional Talking Head Generation via Action Units Control

Beyond Next-Token Alignment: Distilling Multimodal Large Language Models via Token Interactions

K-Sort Eval: Efficient Preference Evaluation for Visual Generation via Corrected VLM-as-a-Judge

Rethinking Global Text Conditioning in Diffusion Transformers

All-in-One Conditioning for Text-to-Image Synthesis

Agent Banana: High-Fidelity Image Editing with Agentic Thinking and Tooling

SciFlow-Bench: Evaluating Structure-Aware Scientific Diagram Generation via Inverse Parsing

Look-Ahead and Look-Back Flows: Training-Free Image Generation with Trajectory Smoothing

SceneReVis: A Self-Reflective Vision-Grounded Framework for 3D Indoor Scene Synthesis via Multi-turn RL