AIGC Daily Papers

Daily papers related to Image/Video/Multimodal Generation from cs.CV

April 14, 2026

LottieGPT: Tokenizing Vector Animation for Autoregressive Generation

Despite rapid progress in video generation, existing models are incapable of producing vector animation, a dominant and highly expressive form of multimedia on the Internet. Vector animations offer resolution-independence, compactness, semantic structure, and editable parametric motion representations, yet current generative models operate exclusively in raster space and thus cannot synthesize them. Meanwhile, recent advances in large multimodal models demonstrate strong capabilities in generating structured data such as slides, 3D meshes, LEGO sequences, and indoor layouts, suggesting that native vector animation generation may be achievable. In this work, we present the first framework for tokenizing and autoregressively generating vector animations. We adopt Lottie, a widely deployed JSON-based animation standard, and design a tailored Lottie Tokenizer that encodes layered geometric primitives, transforms, and keyframe-based motion into a compact and semantically aligned token sequence. To support large-scale training, we also construct LottieAnimation-660K, the largest and most diverse vector animation dataset to date, consisting of 660k real-world Lottie animation and 15M static Lottie image files curated from broad Internet sources. Building upon these components, we finetune Qwen-VL to create LottieGPT, a native multimodal model capable of generating coherent, editable vector animations directly from natural language or visual prompts. Experiments show that our tokenizer dramatically reduces sequence length while preserving structural fidelity, enabling effective autoregressive learning of dynamic vector content. LottieGPT exhibits strong generalization across diverse animation styles and outperforms previous state-of-the-art models on SVG generation (a special case of single-frame vector animation).

TLDR: The paper introduces LottieGPT, a novel framework for tokenizing and autoregressively generating Lottie vector animations from natural language or visual prompts, supported by a large-scale Lottie animation dataset.

TLDR: 本文介绍了 LottieGPT,一个用于从自然语言或视觉提示中标记和自回归生成 Lottie 矢量动画的新框架,并由一个大规模 Lottie 动画数据集支持。

Relevance: (9/10)
Novelty: (9/10)
Clarity: (8/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Junhao Chen, Kejun Gao, Yuehan Cui, Mingze Sun, Mingjin Chen, Shaohui Wang, Xiaoxiao Long, Fei Ma, Qi Tian, Ruqi Huang, Hao Zhao

LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

Large Multimodal Models (LMMs) have achieved remarkable progress in general-purpose vision--language understanding, yet they remain limited in tasks requiring precise object-level grounding, fine-grained spatial reasoning, and controllable visual manipulation. In particular, existing systems often struggle to identify the correct instance, preserve object identity across interactions, and localize or modify designated regions with high precision. Object-centric vision provides a principled framework for addressing these challenges by promoting explicit representations and operations over visual entities, thereby extending multimodal systems from global scene understanding to object-level understanding, segmentation, editing, and generation. This paper presents a comprehensive review of recent advances at the convergence of LMMs and object-centric vision. We organize the literature into four major themes: object-centric visual understanding, object-centric referring segmentation, object-centric visual editing, and object-centric visual generation. We further summarize the key modeling paradigms, learning strategies, and evaluation protocols that support these capabilities. Finally, we discuss open challenges and future directions, including robust instance permanence, fine-grained spatial control, consistent multi-step interaction, unified cross-task modeling, and reliable benchmarking under distribution shift. We hope this paper provides a structured perspective on the development of scalable, precise, and trustworthy object-centric multimodal systems.

TLDR: This paper reviews the intersection of Large Multimodal Models (LMMs) and object-centric vision, covering understanding, segmentation, editing, and generation, highlighting challenges and future directions for precise object-level multimodal systems.

TLDR: 本文综述了大型多模态模型(LMMs)与以对象为中心的视觉的交叉领域,涵盖了理解、分割、编辑和生成,并强调了精确对象级多模态系统所面临的挑战和未来方向。

Relevance: (8/10)
Novelty: (7/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Yuqian Yuan, Wenqiao Zhang, Juekai Lin, Yu Zhong, Mingjian Gao, Binhe Yu, Yunqi Cao, Wentong Li, Yueting Zhuang, Beng Chin Ooi

HDR Video Generation via Latent Alignment with Logarithmic Encoding

High dynamic range (HDR) imagery offers a rich and faithful representation of scene radiance, but remains challenging for generative models due to its mismatch with the bounded, perceptually compressed data on which these models are trained. A natural solution is to learn new representations for HDR, which introduces additional complexity and data requirements. In this work, we show that HDR generation can be achieved in a much simpler way by leveraging the strong visual priors already captured by pretrained generative models. We observe that a logarithmic encoding widely used in cinematic pipelines maps HDR imagery into a distribution that is naturally aligned with the latent space of these models, enabling direct adaptation via lightweight fine-tuning without retraining an encoder. To recover details that are not directly observable in the input, we further introduce a training strategy based on camera-mimicking degradations that encourages the model to infer missing high dynamic range content from its learned priors. Combining these insights, we demonstrate high-quality HDR video generation using a pretrained video model with minimal adaptation, achieving strong results across diverse scenes and challenging lighting conditions. Our results indicate that HDR, despite representing a fundamentally different image formation regime, can be handled effectively without redesigning generative models, provided that the representation is chosen to align with their learned priors.

TLDR: This paper presents a simple and effective approach for HDR video generation by leveraging the strong visual priors of pre-trained generative models and a logarithmic encoding to align HDR imagery with the latent space, requiring minimal fine-tuning.

TLDR: 本文提出了一种简单有效的HDR视频生成方法,通过利用预训练生成模型的强大视觉先验和对数编码将HDR图像与潜在空间对齐,只需少量微调即可。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (7/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Naomi Ken Korem, Mohamed Oumoumad, Harel Cain, Matan Ben Yosef, Urska Jelercic, Ofir Bibi, Yaron Inger, Or Patashnik, Daniel Cohen-Or

Learning Long-term Motion Embeddings for Efficient Kinematics Generation

Understanding and predicting motion is a fundamental component of visual intelligence. Although modern video models exhibit strong comprehension of scene dynamics, exploring multiple possible futures through full video synthesis remains prohibitively inefficient. We model scene dynamics orders of magnitude more efficiently by directly operating on a long-term motion embedding that is learned from large-scale trajectories obtained from tracker models. This enables efficient generation of long, realistic motions that fulfill goals specified via text prompts or spatial pokes. To achieve this, we first learn a highly compressed motion embedding with a temporal compression factor of 64x. In this space, we train a conditional flow-matching model to generate motion latents conditioned on task descriptions. The resulting motion distributions outperform those of both state-of-the-art video models and specialized task-specific approaches.

TLDR: This paper introduces a method for efficiently generating long, realistic motions by learning a compressed motion embedding and using a conditional flow-matching model, outperforming existing video models and task-specific approaches.

TLDR: 该论文介绍了一种高效生成长期逼真运动的方法,通过学习压缩的运动嵌入并使用条件流匹配模型,性能优于现有的视频模型和特定任务方法。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Nick Stracke, Kolja Bauer, Stefan Andreas Baumann, Miguel Angel Bautista, Josh Susskind, Björn Ommer

Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

Accurate future video prediction requires both high visual fidelity and consistent scene semantics, particularly in complex dynamic environments such as autonomous driving. We present Re2Pix, a hierarchical video prediction framework that decomposes forecasting into two stages: semantic representation prediction and representation-guided visual synthesis. Instead of directly predicting future RGB frames, our approach first forecasts future scene structure in the feature space of a frozen vision foundation model, and then conditions a latent diffusion model on these predicted representations to render photorealistic frames. This decomposition enables the model to focus first on scene dynamics and then on appearance generation. A key challenge arises from the train-test mismatch between ground-truth representations available during training and predicted ones used at inference. To address this, we introduce two conditioning strategies, nested dropout and mixed supervision, that improve robustness to imperfect autoregressive predictions. Experiments on challenging driving benchmarks demonstrate that the proposed semantics-first design significantly improves temporal semantic consistency, perceptual quality, and training efficiency compared to strong diffusion baselines. We provide the implementation code at https://github.com/Sta8is/Re2Pix

TLDR: The paper introduces Re2Pix, a hierarchical video prediction framework that first forecasts scene semantics in a frozen vision foundation model's feature space and then renders photorealistic frames using a latent diffusion model, improving temporal consistency and perceptual quality in complex dynamic environments.

TLDR: 该论文介绍了一种分层视频预测框架Re2Pix,该框架首先在冻结的视觉基础模型的特征空间中预测场景语义,然后使用潜在扩散模型渲染逼真的帧,从而提高了复杂动态环境中的时间一致性和感知质量。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Efstathios Karypidis, Spyros Gidaris, Nikos Komodakis

POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable capabilities in cross-modal understanding and generation. However, the rapid growth of visual token sequences--especially in long-video and streaming scenarios--poses a major challenge to their scalability and real-world deployment. Thus, we introduce POINTS-Long, a native dual-mode MLLM featuring dynamic visual token scaling inspired by the human visual system. The model supports two complementary perception modes: focus mode and standby mode, enabling users to dynamically trade off efficiency and accuracy during inference. On fine-grained visual tasks, the focus mode retains the optimal performance, while on long-form general visual understanding, the standby mode retains 97.7-99.7% of the original accuracy using only 1/40-1/10th of the visual tokens. Moreover, POINTS-Long natively supports streaming visual understanding via a dynamically detachable KV-cache design, allowing efficient maintenance of ultra-long visual memory. Our work provides new insights into the design of future MLLMs and lays the foundation for adaptive and efficient long-form visual understanding.

TLDR: POINTS-Long introduces a dual-mode MLLM with adaptive visual token scaling for efficient long-form video and streaming visual understanding, dynamically trading off accuracy and efficiency.

TLDR: POINTS-Long 提出了一个双模式多模态大语言模型,具有自适应视觉 token 缩放功能,用于高效地处理长视频和流式视觉理解,可以在精度和效率之间动态权衡。

Relevance: (7/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Haicheng Wang, Yuan Liu, Yikun Liu, Zhemeng Yu, Zhongyin Zhao, Yangxiu You, Zilin Yu, Le Tian, Xiao Zhou, Jie Zhou, Weidi Xie, Yanfeng Wang

Continuous Adversarial Flow Models

We propose continuous adversarial flow models, a type of continuous-time flow model trained with an adversarial objective. Unlike flow matching, which uses a fixed mean-squared-error criterion, our approach introduces a learned discriminator to guide training. This change in objective induces a different generalized distribution, which empirically produces samples that are better aligned with the target data distribution. Our method is primarily proposed for post-training existing flow-matching models, although it can also train models from scratch. On the ImageNet 256px generation task, our post-training substantially improves the guidance-free FID of latent-space SiT from 8.26 to 3.63 and of pixel-space JiT from 7.17 to 3.57. It also improves guided generation, reducing FID from 2.06 to 1.53 for SiT and from 1.86 to 1.80 for JiT. We further evaluate our approach on text-to-image generation, where it achieves improved results on both the GenEval and DPG benchmarks.

TLDR: This paper introduces continuous adversarial flow models, which improve the sample quality of existing flow-matching models through adversarial training, demonstrating significant gains on ImageNet and text-to-image generation tasks.

TLDR: 本文提出了连续对抗流模型,通过对抗训练来提高现有流匹配模型的样本质量,并在 ImageNet 和文本到图像生成任务上展示了显著提升。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Shanchuan Lin, Ceyuan Yang, Zhijie Lin, Hao Chen, Haoqi Fan

HuiYanEarth-SAR: A Foundation Model for High-Fidelity and Low-Cost Global Remote Sensing Imagery Generation

Synthetic Aperture Radar (SAR) imagery generation is essential for deepening the study of scattering mechanisms, establishing trustworthy electromagnetic scene models, and fundamentally alleviating the data scarcity bottleneck that constrains development in this field. However, existing methods find it difficult to simultaneously ensure high fidelity in both global geospatial semantics and microscopic scattering mechanisms, resulting in severe challenges for global generation. To address this, we propose \textbf{HuiYanEarth-SAR}, the first foundational SAR imagery generation model based on AlphaEarth and integrated scattering mechanisms. By injecting geospatial priors to control macroscopic structures and utilizing implicit scattering characteristic modeling to ensure the authenticity of microscopic textures, we achieve the capability of generating high-fidelity SAR images for global locations solely based on geographic coordinates. This study not only constructs an efficient SAR scene simulator but also establishes a bridge connecting geography, scatter mechanism, and artificial intelligence from a methodological standpoint. It advances SAR research by expanding the paradigm from perception and understanding to simulation and creation, providing key technical support for constructing a high-confidence digital twin of the Earth.

TLDR: The paper introduces HuiYanEarth-SAR, a foundation model for high-fidelity SAR imagery generation from geographic coordinates, aiming to address data scarcity and improve simulation capabilities in remote sensing.

TLDR: 该论文介绍了HuiYanEarth-SAR,这是一个基于地理坐标的高保真合成孔径雷达(SAR)图像生成的基础模型,旨在解决数据稀缺问题并提高遥感领域的仿真能力。

Relevance: (7/10)
Novelty: (9/10)
Clarity: (8/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Yongxiang Liu, Jie Zhou, Yafei Song, Tianpeng Liu, Li Liu

Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

3D scene generation has long been dominated by 2D multi-view or video diffusion models. This is due not only to the lack of scene-level 3D latent representation, but also to the fact that most scene-level 3D visual data exists in the form of multi-view images or videos, which are naturally compatible with 2D diffusion architectures. Typically, these 2D-based approaches degrade 3D spatial extrapolation to 2D temporal extension, which introduces two fundamental issues: (i) representing 3D scenes via 2D views leads to significant representation redundancy, and (ii) latent space rooted in 2D inherently limits the spatial consistency of the generated 3D scenes. In this paper, we propose, for the first time, to perform 3D scene generation directly within an implicit 3D latent space to address these limitations. First, we repurpose frozen 2D representation encoders to construct our 3D Representation Autoencoder (3DRAE), which grounds view-coupled 2D semantic representations into a view-decoupled 3D latent representation. This enables representing 3D scenes observed from arbitrary numbers of views--at any resolution and aspect ratio--with fixed complexity and rich semantics. Then we introduce 3D Diffusion Transformer (3DDiT), which performs diffusion modeling in this 3D latent space, achieving remarkably efficient and spatially consistent 3D scene generation while supporting diverse conditioning configurations. Moreover, since our approach directly generates a 3D scene representation, it can be decoded to images and optional point maps along arbitrary camera trajectories without requiring per-trajectory diffusion sampling pass, which is common in 2D-based approaches.

TLDR: The paper introduces a novel approach to 3D scene generation by leveraging a 3D latent space and diffusion transformer to address limitations of existing 2D-based methods, achieving efficient and spatially consistent results.

TLDR: 该论文介绍了一种新颖的3D场景生成方法,通过利用3D潜在空间和扩散Transformer来解决现有基于2D方法的局限性,实现了高效且空间一致的结果。

Relevance: (8/10)
Novelty: (9/10)
Clarity: (8/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Dongxu Wei, Qi Xu, Zhiqi Li, Hangning Zhou, Cong Qiu, Hailong Qin, Mu Yang, Zhaopeng Cui, Peidong Liu

Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding

Advances in Multimodal Large Language Models (MLLMs) are transforming video captioning from a descriptive endpoint into a semantic interface for both video understanding and generation. However, the dominant paradigm still casts videos as monolithic narrative paragraphs that entangle visual, auditory, and identity information. This dense coupling not only compromises representational fidelity but also limits scalability, since even local edits can trigger global rewrites. To address this structural bottleneck, we propose Multi-Stream Scene Script (MTSS), a novel paradigm that replaces monolithic text with factorized and explicitly grounded scene descriptions. MTSS is built on two core principles: Stream Factorization, which decouples a video into complementary streams (Reference, Shot, Event, and Global), and Relational Grounding, which reconnects these isolated streams through explicit identity and temporal links to maintain holistic video consistency. Extensive experiments demonstrate that MTSS consistently enhances video understanding across various models, achieving an average reduction of 25% in the total error rate on Video-SALMONN-2 and an average performance gain of 67% on the Daily-Omni reasoning benchmark. It also narrows the performance gap between smaller and larger MLLMs, indicating a substantially more learnable caption interface. Finally, even without architectural adaptation, replacing monolithic prompts with MTSS in multi-shot video generation yields substantial human-rated improvements: a 45% boost in cross-shot identity consistency, a 56% boost in audio-visual alignment, and a 71% boost in temporal controllability.

TLDR: The paper introduces Multi-Stream Scene Script (MTSS), a novel video captioning paradigm that decomposes video into factorized streams with relational grounding, leading to improvements in video understanding and generation, particularly in identity consistency, audio-visual alignment, and temporal controllability.

TLDR: 本文介绍了一种新的视频字幕范式 Multi-Stream Scene Script (MTSS),它将视频分解为具有关系基础的分解流,从而改进了视频理解和生成,尤其是在身份一致性、视听对齐和时间可控性方面。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (8/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Tencent Hunyuan Team

Structured State-Space Regularization for Compact and Generation-Friendly Image Tokenization

Image tokenizers are central to modern vision models as they often operate in latent spaces. An ideal latent space must be simultaneously compact and generation-friendly: it should capture image's essential content compactly while remaining easy to model with generative approaches. In this work, we introduce a novel regularizer to align latent spaces with these two objectives. The key idea is to guide tokenizers to mimic the hidden state dynamics of state-space models (SSMs), thereby transferring their critical property, frequency awareness, to latent features. Grounded in a theoretical analysis of SSMs, our regularizer enforces encoding of fine spatial structures and frequency-domain cues into compact latent features; leading to more effective use of representation capacity and improved generative modelability. Experiments demonstrate that our method improves generation quality in diffusion models while incurring only minimal loss in reconstruction fidelity.

TLDR: This paper introduces a novel regularization method for image tokenizers, leveraging state-space models (SSMs) to achieve compact and generation-friendly latent spaces, improving generation quality in diffusion models with minimal reconstruction loss.

TLDR: 本文提出了一种新的图像标记器正则化方法,利用状态空间模型(SSM)来实现紧凑且对生成友好的潜在空间,从而在最小重建损失的情况下提高扩散模型的生成质量。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (8/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Jinsung Lee, Jaemin Oh, Namhun Kim, Dongwon Kim, Byung-Jun Yoon, Suha Kwak

FlowCoMotion: Text-to-Motion Generation via Token-Latent Flow Modeling

Text-to-motion generation is driven by learning motion representations for semantic alignment with language. Existing methods rely on either continuous or discrete motion representations. However, continuous representations entangle semantics with dynamics, while discrete representations lose fine-grained motion details. In this context, we propose FlowCoMotion, a novel motion generation framework that unifies both treatments from a modeling perspective. Specifically, FlowCoMotion employs token-latent coupling to capture both semantic content and high-fidelity motion details. In the latent branch, we apply multi-view distillation to regularize the continuous latent space, while in the token branch we use discrete temporal resolution quantization to extract high-level semantic cues. The motion latent is then obtained by combining the representations from the two branches through a token-latent coupling network. Subsequently, a velocity field is predicted based on the textual conditions. An ODE solver integrates this velocity field from a simple prior, thereby guiding the sample to the potential state of the target motion. Extensive experiments show that FlowCoMotion achieves competitive performance on text-to-motion benchmarks, including HumanML3D and SnapMoGen.

TLDR: The paper introduces FlowCoMotion, a text-to-motion generation framework that combines continuous and discrete motion representations using token-latent coupling and velocity field prediction for improved performance on text-to-motion benchmarks.

TLDR: 该论文提出了FlowCoMotion,一种文本到运动生成框架,它结合了连续和离散的运动表示,利用token-latent耦合和速度场预测,以提高文本到运动基准测试的性能。

Relevance: (8/10)
Novelty: (9/10)
Clarity: (8/10)
Potential Impact: (7/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Dawei Guan, Di Yang, Chengjie Jin, Jiangtao Wang

Pair2Scene: Learning Local Object Relations for Procedural Scene Generation

Generating high-fidelity 3D indoor scenes remains a significant challenge due to data scarcity and the complexity of modeling intricate spatial relations. Current methods often struggle to scale beyond training distribution to dense scenes or rely on LLMs/VLMs that lack the ability for precise spatial reasoning. Building on top of the observation that object placement relies mainly on local dependencies instead of information-redundant global distributions, in this paper, we propose Pair2Scene, a novel procedural generation framework that integrates learned local rules with scene hierarchies and physics-based algorithms. These rules mainly capture two types of inter-object relations, namely support relations that follow physical hierarchies, and functional relations that reflect semantic links. We model these rules through a network, which estimates spatial position distributions of dependent objects conditioned on position and geometry of the anchor ones. Accordingly, we curate a dataset 3D-Pairs from existing scene data to train the model. During inference, our framework can generate scenes by recursively applying our model within a hierarchical structure, leveraging collision-aware rejection sampling to align local rules into coherent global layouts. Extensive experiments demonstrate that our framework outperforms existing methods in generating complex environments that go beyond training data while maintaining physical and semantic plausibility.

TLDR: The paper introduces Pair2Scene, a procedural scene generation framework that learns local object relationships to create realistic 3D indoor scenes, outperforming existing methods in complexity, physical plausibility, and semantic consistency.

TLDR: 该论文介绍了一种名为Pair2Scene的程序化场景生成框架,该框架学习局部对象关系以创建逼真的3D室内场景,在复杂性、物理合理性和语义一致性方面优于现有方法。

Relevance: (6/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (7/10)
Overall: (7/10)
Read Paper (PDF)

Authors: Xingjian Ran, Shujie Zhang, Weipeng Zhong, Li Luo, Bo Dai

FineEdit: Fine-Grained Image Edit with Bounding Box Guidance

Diffusion-based image editing models have achieved significant progress in real world applications. However, conventional models typically rely on natural language prompts, which often lack the precision required to localize target objects. Consequently, these models struggle to maintain background consistency due to their global image regeneration paradigm. Recognizing that visual cues provide an intuitive means for users to highlight specific areas of interest, we utilize bounding boxes as guidance to explicitly define the editing target. This approach ensures that the diffusion model can accurately localize the target while preserving background consistency. To achieve this, we propose FineEdit, a multi-level bounding box injection method that enables the model to utilize spatial conditions more effectively. To support this high precision guidance, we present FineEdit-1.2M, a large scale, fine-grained dataset comprising 1.2 million image editing pairs with precise bounding box annotations. Furthermore, we construct a comprehensive benchmark, termed FineEdit-Bench, which includes 1,000 images across 10 subjects to effectively evaluate region based editing capabilities. Evaluations on FineEdit-Bench demonstrate that our model significantly outperforms state-of-the-art open-source models (e.g., Qwen-Image-Edit and LongCat-Image-Edit) in instruction compliance and background preservation. Further assessments on open benchmarks (GEdit and ImgEdit Bench) confirm its superior generalization and robustness.

TLDR: The paper introduces FineEdit, a diffusion-based image editing model that uses bounding box guidance for more precise and background-consistent edits, supported by a new large-scale dataset and benchmark.

TLDR: 该论文介绍了 FineEdit,一种基于扩散的图像编辑模型,它使用边界框引导来实现更精确和背景一致的编辑,并由一个新的大规模数据集和基准支持。

Relevance: (7/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (7/10)
Overall: (7/10)
Read Paper (PDF)

Authors: Haohang Xu, Lin Liu, Zhibo Zhang, Rong Cong, Xiaopeng Zhang, Qi Tian