ArXiv CS.CV Papers (Image/Video Generation)

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

The performance of unified multimodal models for image generation and editing is fundamentally constrained by the quality and comprehensiveness of their training data. While existing datasets have covered basic tasks like style transfer and simple object manipulation, they often lack the systematic structure and challenging scenarios required for real-world applications. To address this bottleneck, we introduce OpenGPT-4o-Image, a large-scale dataset constructed using a novel methodology that combines hierarchical task taxonomy with automated data generation. Our taxonomy not only includes fundamental capabilities such as text rendering and style control but also introduces highly practical yet challenging categories like scientific imagery for chemistry illustrations and complex instruction editing requiring simultaneous execution of multiple operations. Through an automated pipeline leveraging structured resource pools and GPT-4o, we generate 80k high-quality instruction-image pairs with controlled diversity, covering 11 major domains and 51 subtasks. Extensive experiments show that fine-tuning leading models on our dataset achieves significant performance gains across multiple benchmarks, with improvements of up to 18\% on editing tasks (UniWorld-V1 on ImgEdit-Bench) and 13% on generation tasks (Harmon on GenEval). Our work demonstrates that systematic data construction is key to advancing multimodal AI capabilities.

TLDR: The paper introduces OpenGPT-4o-Image, a large-scale dataset for training multimodal models in image generation and editing, constructed using a novel hierarchical taxonomy and automated data generation pipeline, demonstrating significant performance gains when fine-tuning leading models.

TLDR: 该论文介绍了OpenGPT-4o-Image，一个用于训练多模态图像生成和编辑模型的大规模数据集。该数据集通过一种新的层级分类法和自动数据生成流程构建，实验表明在微调领先模型时能显著提高性能。

Relevance: (10/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Zhihong Chen, Xuehai Bai, Yang Shi, Chaoyou Fu, Huanyu Zhang, Haotian Wang, Xiaoyan Sun, Zhang Zhang, Liang Wang, Yuanxing Zhang, Pengfei Wan, Yi-Fan Zhang

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.

TLDR: SANA-Video is a small diffusion model that efficiently generates high-resolution, minute-long videos with strong text-video alignment, leveraging linear attention and a constant-memory KV cache, and demonstrating significant speedups and reduced training costs compared to existing models.

TLDR: SANA-Video是一个小型扩散模型，它利用线性注意力和恒定内存KV缓存，高效生成具有强文本-视频对齐的高分辨率、长达一分钟的视频。该模型与现有模型相比，显著提高了速度并降低了训练成本。

Relevance: (10/10)

Novelty: (9/10)

Clarity: (9/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, Enze Xie

Learning Object-Centric Representations Based on Slots in Real World Scenarios

A central goal in AI is to represent scenes as compositions of discrete objects, enabling fine-grained, controllable image and video generation. Yet leading diffusion models treat images holistically and rely on text conditioning, creating a mismatch for object-level editing. This thesis introduces a framework that adapts powerful pretrained diffusion models for object-centric synthesis while retaining their generative capacity. We identify a core challenge: balancing global scene coherence with disentangled object control. Our method integrates lightweight, slot-based conditioning into pretrained models, preserving their visual priors while providing object-specific manipulation. For images, SlotAdapt augments diffusion models with a register token for background/style and slot-conditioned modules for objects, reducing text-conditioning bias and achieving state-of-the-art results in object discovery, segmentation, compositional editing, and controllable image generation. We further extend the framework to video. Using Invariant Slot Attention (ISA) to separate object identity from pose and a Transformer-based temporal aggregator, our approach maintains consistent object representations and dynamics across frames. This yields new benchmarks in unsupervised video object segmentation and reconstruction, and supports advanced editing tasks such as object removal, replacement, and insertion without explicit supervision. Overall, this work establishes a general and scalable approach to object-centric generative modeling for images and videos. By bridging human object-based perception and machine learning, it expands the design space for interactive, structured, and user-driven generative tools in creative, scientific, and practical domains.

TLDR: This paper presents a framework for object-centric image and video generation by integrating slot-based conditioning into pretrained diffusion models, achieving state-of-the-art results in object discovery, segmentation, and editing tasks.

TLDR: 本文提出了一种基于预训练扩散模型的以对象为中心的图像和视频生成框架，通过将基于槽的条件控制集成到模型中，在对象发现、分割和编辑任务中取得了最先进的结果。

Relevance: (10/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Adil Kaan Akan

CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models

Flow map models such as Consistency Models (CM) and Mean Flow (MF) enable few-step generation by learning the long jump of the ODE solution of diffusion models, yet training remains unstable, sensitive to hyperparameters, and costly. Initializing from a pre-trained diffusion model helps, but still requires converting infinitesimal steps into a long-jump map, leaving instability unresolved. We introduce mid-training, the first concept and practical method that inserts a lightweight intermediate stage between the (diffusion) pre-training and the final flow map training (i.e., post-training) for vision generation. Concretely, Consistency Mid-Training (CMT) is a compact and principled stage that trains a model to map points along a solver trajectory from a pre-trained model, starting from a prior sample, directly to the solver-generated clean sample. It yields a trajectory-consistent and stable initialization. This initializer outperforms random and diffusion-based baselines and enables fast, robust convergence without heuristics. Initializing post-training with CMT weights further simplifies flow map learning. Empirically, CMT achieves state of the art two step FIDs: 1.97 on CIFAR-10, 1.32 on ImageNet 64x64, and 1.84 on ImageNet 512x512, while using up to 98% less training data and GPU time, compared to CMs. On ImageNet 256x256, CMT reaches 1-step FID 3.34 while cutting total training time by about 50% compared to MF from scratch (FID 3.43). This establishes CMT as a principled, efficient, and general framework for training flow map models.

TLDR: This paper introduces Consistency Mid-Training (CMT), a novel intermediate training stage for flow map models used in image generation. It significantly improves training efficiency and stability, achieving state-of-the-art FID scores with reduced training costs.

TLDR: 该论文介绍了一致性中期训练 (CMT)，这是一种用于图像生成中流映射模型的新型中间训练阶段。它显著提高了训练效率和稳定性，并以更低的训练成本实现了最先进的 FID 分数。

Relevance: (9/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Zheyuan Hu, Chieh-Hsin Lai, Yuki Mitsufuji, Stefano Ermon

UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark

Generative diffusion models are developing rapidly and attracting increasing attention due to their wide range of applications. Image-to-Video (I2V) generation has become a major focus in the field of video synthesis. However, existing evaluation benchmarks primarily focus on aspects such as video quality and temporal consistency, while largely overlooking the model's ability to understand the semantics of specific subjects in the input image or to ensure that the generated video aligns with physical laws and human commonsense. To address this gap, we propose UI2V-Bench, a novel benchmark for evaluating I2V models with a focus on semantic understanding and reasoning. It introduces four primary evaluation dimensions: spatial understanding, attribute binding, category understanding, and reasoning. To assess these dimensions, we design two evaluation methods based on Multimodal Large Language Models (MLLMs): an instance-level pipeline for fine-grained semantic understanding, and a feedback-based reasoning pipeline that enables step-by-step causal assessment for more accurate evaluation. UI2V-Bench includes approximately 500 carefully constructed text-image pairs and evaluates a range of both open source and closed-source I2V models across all defined dimensions. We further incorporate human evaluations, which show strong alignment with the proposed MLLM-based metrics. Overall, UI2V-Bench fills a critical gap in I2V evaluation by emphasizing semantic comprehension and reasoning ability, offering a robust framework and dataset to support future research and model development in the field.

TLDR: The paper introduces UI2V-Bench, a new benchmark for Image-to-Video (I2V) generation models that focuses on evaluating semantic understanding and reasoning, addressing a gap in existing benchmarks.

TLDR: 该论文介绍了UI2V-Bench，一个新的图像到视频（I2V）生成模型的基准，重点评估语义理解和推理能力，解决了现有基准测试中的一个空白。

Relevance: (10/10)

Novelty: (9/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Ailing Zhang, Lina Lei, Dehong Kong, Zhixin Wang, Jiaqi Xu, Fenglong Song, Chun-Le Guo, Chang Liu, Fan Li, Jie Chen

NeRV-Diffusion: Diffuse Implicit Neural Representations for Video Synthesis

We present NeRV-Diffusion, an implicit latent video diffusion model that synthesizes videos via generating neural network weights. The generated weights can be rearranged as the parameters of a convolutional neural network, which forms an implicit neural representation (INR), and decodes into videos with frame indices as the input. Our framework consists of two stages: 1) A hypernetworkbased tokenizer that encodes raw videos from pixel space to neural parameter space, where the bottleneck latent serves as INR weights to decode. 2) An implicit diffusion transformer that denoises on the latent INR weights. In contrast to traditional video tokenizers that encode videos into frame-wise feature maps, NeRV-Diffusion compresses and generates a video holistically as a unified neural network. This enables efficient and high-quality video synthesis via obviating temporal cross-frame attentions in the denoiser and decoding video latent with dedicated decoders. To achieve Gaussian-distributed INR weights with high expressiveness, we reuse the bottleneck latent across all NeRV layers, as well as reform its weight assignment, upsampling connection and input coordinates. We also introduce SNR-adaptive loss weighting and scheduled sampling for effective training of the implicit diffusion model. NeRV-Diffusion reaches superior video generation quality over previous INR-based models and comparable performance to most recent state-of-the-art non-implicit models on real-world video benchmarks including UCF-101 and Kinetics-600. It also brings a smooth INR weight space that facilitates seamless interpolations between frames or videos.

TLDR: NeRV-Diffusion is a video synthesis method that uses a diffusion model operating on implicit neural representation (INR) weights, achieving high-quality video generation with efficient encoding and decoding.

TLDR: NeRV-Diffusion 是一种视频合成方法，它使用在隐式神经表示（INR）权重上运行的扩散模型，通过高效的编码和解码实现高质量的视频生成。

Relevance: (10/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Yixuan Ren, Hanyu Wang, Hao Chen, Bo He, Abhinav Shrivastava

Hyperspherical Latents Improve Continuous-Token Autoregressive Generation

Autoregressive (AR) models are promising for image generation, yet continuous-token AR variants often trail latent diffusion and masked-generation models. The core issue is heterogeneous variance in VAE latents, which is amplified during AR decoding, especially under classifier-free guidance (CFG), and can cause variance collapse. We propose SphereAR to address this issue. Its core design is to constrain all AR inputs and outputs -- including after CFG -- to lie on a fixed-radius hypersphere (constant $\ell_2$ norm), leveraging hyperspherical VAEs. Our theoretical analysis shows that hyperspherical constraint removes the scale component (the primary cause of variance collapse), thereby stabilizing AR decoding. Empirically, on ImageNet generation, SphereAR-H (943M) sets a new state of the art for AR models, achieving FID 1.34. Even at smaller scales, SphereAR-L (479M) reaches FID 1.54 and SphereAR-B (208M) reaches 1.92, matching or surpassing much larger baselines such as MAR-H (943M, 1.55) and VAR-d30 (2B, 1.92). To our knowledge, this is the first time a pure next-token AR image generator with raster order surpasses diffusion and masked-generation models at comparable parameter scales.

TLDR: The paper proposes SphereAR, a continuous-token autoregressive model for image generation that utilizes hyperspherical VAEs to address variance collapse, achieving state-of-the-art FID scores compared to other AR, diffusion, and masked-generation models.

TLDR: 该论文提出了SphereAR，一种用于图像生成的连续token自回归模型，它利用超球面VAE来解决方差崩溃问题，与其他AR、扩散和掩码生成模型相比，实现了最先进的FID分数。

Relevance: (9/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Guolin Ke, Hui Xue

UniVid: The Open-Source Unified Video Model

Unified video modeling that combines generation and understanding capabilities is increasingly important but faces two key challenges: maintaining semantic faithfulness during flow-based generation due to text-visual token imbalance and the limitations of uniform cross-modal attention across the flow trajectory, and efficiently extending image-centric MLLMs to video without costly retraining. We present UniVid, a unified architecture that couples an MLLM with a diffusion decoder through a lightweight adapter, enabling both video understanding and generation. We introduce Temperature Modality Alignment to improve prompt adherence and Pyramid Reflection for efficient temporal reasoning via dynamic keyframe selection. Extensive experiments on standard benchmarks demonstrate state-of-the-art performance, achieving a 2.2% improvement on VBench-Long total score compared to EasyAnimateV5.1, and 1.0% and 3.3% accuracy gains on MSVD-QA and ActivityNet-QA, respectively, compared with the best prior 7B baselines.

TLDR: UniVid is a unified video model combining generation and understanding via MLLM and diffusion, improving performance on video benchmarks with techniques like Temperature Modality Alignment and Pyramid Reflection.

TLDR: UniVid是一个统一的视频模型，通过MLLM和扩散结合了生成和理解能力，并通过温度模态对齐和金字塔反射等技术提高了视频基准测试的性能。

Relevance: (10/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Jiabin Luo, Junhui Lin, Zeyu Zhang, Biao Wu, Meng Fang, Ling Chen, Hao Tang

Score Distillation of Flow Matching Models

Diffusion models achieve high-quality image generation but are limited by slow iterative sampling. Distillation methods alleviate this by enabling one- or few-step generation. Flow matching, originally introduced as a distinct framework, has since been shown to be theoretically equivalent to diffusion under Gaussian assumptions, raising the question of whether distillation techniques such as score distillation transfer directly. We provide a simple derivation -- based on Bayes' rule and conditional expectations -- that unifies Gaussian diffusion and flow matching without relying on ODE/SDE formulations. Building on this view, we extend Score identity Distillation (SiD) to pretrained text-to-image flow-matching models, including SANA, SD3-Medium, SD3.5-Medium/Large, and FLUX.1-dev, all with DiT backbones. Experiments show that, with only modest flow-matching- and DiT-specific adjustments, SiD works out of the box across these models, in both data-free and data-aided settings, without requiring teacher finetuning or architectural changes. This provides the first systematic evidence that score distillation applies broadly to text-to-image flow matching models, resolving prior concerns about stability and soundness and unifying acceleration techniques across diffusion- and flow-based generators. We will make the PyTorch implementation publicly available.

TLDR: This paper extends score distillation techniques, specifically Score identity Distillation (SiD), to various pre-trained text-to-image flow-matching models, demonstrating its broad applicability without significant model-specific adjustments. This unifies acceleration techniques across diffusion- and flow-based generative models.

TLDR: 本文将分数蒸馏技术，特别是分数身份蒸馏（SiD），扩展到各种预训练的文本到图像的流动匹配模型，证明了它广泛的适用性，无需进行重大的模型特定调整。这统一了基于扩散和基于流的生成模型的加速技术。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Mingyuan Zhou, Yi Gu, Huangjie Zheng, Liangchen Song, Guande He, Yizhe Zhang, Wenze Hu, Yinfei Yang

STAGE: Stable and Generalizable GRPO for Autoregressive Image Generation

Reinforcement learning has recently been explored to improve text-to-image generation, yet applying existing GRPO algorithms to autoregressive (AR) image models remains challenging. The instability of the training process easily disrupts the pretrained model capability during long runs, resulting in marginal gains, degraded image quality, and poor generalization. In this work, we revisit GRPO for AR image generation and identify two key issues: contradictory gradients from unnecessary tokens and unstable policy entropy dynamics. To address these, we introduce STAGE, a stable and generalizable framework that leverages two targeted solutions: 1) Advantage/KL reweighting. Similarity-aware reweighting to alleviate conflicting updates; and 2) Entropy reward. An entropy-based reward corresponding to reference model to stabilize learning. With the help of alleviating conflicts between tokens and an entropy reward for stabilizing training, we reduce disruption of the pretrained distribution and mitigate reward hacking, which in turn improves generalization and transfer better to other benchmarks. Experiments across multiple benchmarks show that STAGE consistently improves visual quality, stability, and cross-task generalization compared to baseline GRPO.

TLDR: The paper introduces STAGE, a stable and generalizable GRPO framework for autoregressive image generation that addresses contradictory gradients and unstable policy entropy dynamics using advantage/KL reweighting and an entropy reward.

TLDR: 本文介绍了一种用于自回归图像生成的稳定且可泛化的GRPO框架STAGE，该框架通过优势/KL重加权和熵奖励来解决矛盾梯度和不稳定的策略熵动态。

Relevance: (9/10)

Novelty: (7/10)

Clarity: (8/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Xiaoxiao Ma, Haibo Qiu, Guohui Zhang, Zhixiong Zeng, Siqi Yang, Lin Ma, Feng Zhao

PanoWorld-X: Generating Explorable Panoramic Worlds via Sphere-Aware Video Diffusion

Generating a complete and explorable 360-degree visual world enables a wide range of downstream applications. While prior works have advanced the field, they remain constrained by either narrow field-of-view limitations, which hinder the synthesis of continuous and holistic scenes, or insufficient camera controllability that restricts free exploration by users or autonomous agents. To address this, we propose PanoWorld-X, a novel framework for high-fidelity and controllable panoramic video generation with diverse camera trajectories. Specifically, we first construct a large-scale dataset of panoramic video-exploration route pairs by simulating camera trajectories in virtual 3D environments via Unreal Engine. As the spherical geometry of panoramic data misaligns with the inductive priors from conventional video diffusion, we then introduce a Sphere-Aware Diffusion Transformer architecture that reprojects equirectangular features onto the spherical surface to model geometric adjacency in latent space, significantly enhancing visual fidelity and spatiotemporal continuity. Extensive experiments demonstrate that our PanoWorld-X achieves superior performance in various aspects, including motion range, control precision, and visual quality, underscoring its potential for real-world applications.

TLDR: PanoWorld-X introduces a sphere-aware diffusion transformer for generating high-fidelity, controllable panoramic videos with diverse camera trajectories, addressing limitations in existing panoramic video generation methods.

TLDR: PanoWorld-X 提出了一种球体感知扩散变换器，用于生成具有多样相机轨迹的高保真、可控全景视频，解决了现有全景视频生成方法的局限性。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yuyang Yin, HaoXiang Guo, Fangfu Liu, Mengyu Wang, Hanwen Liang, Eric Li, Yikai Wang, Xiaojie Jin, Yao Zhao, Yunchao Wei

Wan-Alpha: High-Quality Text-to-Video Generation with Alpha Channel

RGBA video generation, which includes an alpha channel to represent transparency, is gaining increasing attention across a wide range of applications. However, existing methods often neglect visual quality, limiting their practical usability. In this paper, we propose \textit{Wan-Alpha}, a new framework that generates transparent videos by learning both RGB and alpha channels jointly. We design an effective variational autoencoder (VAE) that encodes the alpha channel into the RGB latent space. Then, to support the training of our diffusion transformer, we construct a high-quality and diverse RGBA video dataset. Compared with state-of-the-art methods, our model demonstrates superior performance in visual quality, motion realism, and transparency rendering. Notably, our model can generate a wide variety of semi-transparent objects, glowing effects, and fine-grained details such as hair strands. The released model is available on our website: \href{https://donghaotian123.github.io/Wan-Alpha/}{https://donghaotian123.github.io/Wan-Alpha/}.

TLDR: The paper introduces Wan-Alpha, a new text-to-RGBA video generation framework with a focus on high visual quality and realistic transparency rendering, achieved through a VAE-enhanced diffusion transformer trained on a newly constructed high-quality dataset.

TLDR: 该论文介绍了Wan-Alpha，一个新的文本到RGBA视频生成框架，专注于高质量的视觉效果和逼真的透明度渲染，通过VAE增强的扩散transformer，并使用新构建的高质量数据集进行训练。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Haotian Dong, Wenjing Wang, Chen Li, Di Lin

Scalable GANs with Transformers

Scalability has driven recent advances in generative modeling, yet its principles remain underexplored for adversarial learning. We investigate the scalability of Generative Adversarial Networks (GANs) through two design choices that have proven to be effective in other types of generative models: training in a compact Variational Autoencoder latent space and adopting purely transformer-based generators and discriminators. Training in latent space enables efficient computation while preserving perceptual fidelity, and this efficiency pairs naturally with plain transformers, whose performance scales with computational budget. Building on these choices, we analyze failure modes that emerge when naively scaling GANs. Specifically, we find issues as underutilization of early layers in the generator and optimization instability as the network scales. Accordingly, we provide simple and scale-friendly solutions as lightweight intermediate supervision and width-aware learning-rate adjustment. Our experiments show that GAT, a purely transformer-based and latent-space GANs, can be easily trained reliably across a wide range of capacities (S through XL). Moreover, GAT-XL/2 achieves state-of-the-art single-step, class-conditional generation performance (FID of 2.96) on ImageNet-256 in just 40 epochs, 6x fewer epochs than strong baselines.

TLDR: This paper introduces a scalable GAN architecture (GAT) using transformers in a VAE latent space, addressing scaling issues and achieving state-of-the-art results on ImageNet-256 with improved training efficiency.

TLDR: 本文介绍了一种可扩展的GAN架构（GAT），它在VAE潜在空间中使用Transformer，解决了缩放问题，并在ImageNet-256上实现了最先进的结果，并提高了训练效率。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Sangeek Hyun, MinKyu Lee, Jae-Pil Heo

Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer

Transformer-based video diffusion models (VDMs) deliver state-of-the-art video generation quality but are constrained by the quadratic cost of self-attention, making long sequences and high resolutions computationally expensive. While linear attention offers sub-quadratic complexity, prior attempts fail to match the expressiveness of softmax attention without costly retraining. We introduce \textit{Attention Surgery}, an efficient framework for \textit{linearizing} or \textit{hybridizing} attention in pretrained VDMs without training from scratch. Inspired by recent advances in language models, our method combines a novel hybrid attention mechanism-mixing softmax and linear tokens-with a lightweight distillation and fine-tuning pipeline requiring only a few GPU-days. Additionally, we incorporate a cost-aware block-rate strategy to balance expressiveness and efficiency across layers. Applied to Wan2.1 1.3B, a state-of-the-art DiT-based VDM, Attention Surgery achieves the first competitive sub-quadratic attention video diffusion models, reducing attention cost by up to 40\% in terms of FLOPs, while maintaining generation quality as measured on the standard VBench and VBench-2.0 benchmarks.

TLDR: The paper introduces "Attention Surgery," a method to efficiently linearize attention mechanisms in video diffusion models without significant retraining, achieving a 40% reduction in FLOPs while maintaining generation quality.

TLDR: 本文介绍了一种名为“注意力手术”的方法，可以有效地线性化视频扩散模型中的注意力机制，而无需大量的重新训练，实现了40%的FLOPs减少，同时保持了生成质量。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Mohsen Ghafoorian, Denis Korzhenkov, Amirhossein Habibian

Environment-Aware Satellite Image Generation with Diffusion Models

Diffusion-based foundation models have recently garnered much attention in the field of generative modeling due to their ability to generate images of high quality and fidelity. Although not straightforward, their recent application to the field of remote sensing signaled the first successful trials towards harnessing the large volume of publicly available datasets containing multimodal information. Despite their success, existing methods face considerable limitations: they rely on limited environmental context, struggle with missing or corrupted data, and often fail to reliably reflect user intentions in generated outputs. In this work, we propose a novel diffusion model conditioned on environmental context, that is able to generate satellite images by conditioning from any combination of three different control signals: a) text, b) metadata, and c) visual data. In contrast to previous works, the proposed method is i) to our knowledge, the first of its kind to condition satellite image generation on dynamic environmental conditions as part of its control signals, and ii) incorporating a metadata fusion strategy that models attribute embedding interactions to account for partially corrupt and/or missing observations. Our method outperforms previous methods both qualitatively (robustness to missing metadata, higher responsiveness to control inputs) and quantitatively (higher fidelity, accuracy, and quality of generations measured using 6 different metrics) in the trials of single-image and temporal generation. The reported results support our hypothesis that conditioning on environmental context can improve the performance of foundation models for satellite imagery, and render our model a promising candidate for usage in downstream tasks. The collected 3-modal dataset is to our knowledge, the first publicly-available dataset to combine data from these three different mediums.

TLDR: This paper introduces a novel diffusion model for satellite image generation conditioned on environmental context using text, metadata, and visual data, outperforming existing methods in robustness and generation quality.

TLDR: 本文介绍了一种新的扩散模型，用于生成以环境上下文为条件的卫星图像，利用文本、元数据和视觉数据，在鲁棒性和生成质量方面优于现有方法。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Nikos Kostagiolas, Pantelis Georgiades, Yannis Panagakis, Mihalis A. Nicolaou

Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation

We present Causal-Adapter, a modular framework that adapts frozen text-to-image diffusion backbones for counterfactual image generation. Our method enables causal interventions on target attributes, consistently propagating their effects to causal dependents without altering the core identity of the image. In contrast to prior approaches that rely on prompt engineering without explicit causal structure, Causal-Adapter leverages structural causal modeling augmented with two attribute regularization strategies: prompt-aligned injection, which aligns causal attributes with textual embeddings for precise semantic control, and a conditioned token contrastive loss to disentangle attribute factors and reduce spurious correlations. Causal-Adapter achieves state-of-the-art performance on both synthetic and real-world datasets, with up to 91\% MAE reduction on Pendulum for accurate attribute control and 87\% FID reduction on ADNI for high-fidelity MRI image generation. These results show that our approach enables robust, generalizable counterfactual editing with faithful attribute modification and strong identity preservation.

TLDR: The paper introduces Causal-Adapter, a novel framework for counterfactual image generation using text-to-image diffusion models, achieving state-of-the-art performance by leveraging structural causal modeling and attribute regularization strategies.

TLDR: 该论文介绍了Causal-Adapter，一种用于反事实图像生成的新框架，它使用文本到图像的扩散模型，通过利用结构因果建模和属性正则化策略实现了最先进的性能。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Lei Tong, Zhihua Liu, Chaochao Lu, Dino Oglic, Tom Diethe, Philip Teare, Sotirios A. Tsaftaris, Chen Jin

VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning

Video-conditioned sound and speech generation, encompassing video-to-sound (V2S) and visual text-to-speech (VisualTTS) tasks, are conventionally addressed as separate tasks, with limited exploration to unify them within a signle framework. Recent attempts to unify V2S and VisualTTS face challenges in handling distinct condition types (e.g., heterogeneous video and transcript conditions) and require complex training stages. Unifying these two tasks remains an open problem. To bridge this gap, we present VSSFlow, which seamlessly integrates both V2S and VisualTTS tasks into a unified flow-matching framework. VSSFlow uses a novel condition aggregation mechanism to handle distinct input signals. We find that cross-attention and self-attention layer exhibit different inductive biases in the process of introducing condition. Therefore, VSSFlow leverages these inductive biases to effectively handle different representations: cross-attention for ambiguous video conditions and self-attention for more deterministic speech transcripts. Furthermore, contrary to the prevailing belief that joint training on the two tasks requires complex training strategies and may degrade performance, we find that VSSFlow benefits from the end-to-end joint learning process for sound and speech generation without extra designs on training stages. Detailed analysis attributes it to the learned general audio prior shared between tasks, which accelerates convergence, enhances conditional generation, and stabilizes the classifier-free guidance process. Extensive experiments demonstrate that VSSFlow surpasses the state-of-the-art domain-specific baselines on both V2S and VisualTTS benchmarks, underscoring the critical potential of unified generative models.

TLDR: The paper introduces VSSFlow, a unified flow-matching framework for video-conditioned sound and speech generation tasks (V2S and VisualTTS), demonstrating improved performance through joint learning and a novel condition aggregation mechanism.

TLDR: 该论文介绍了VSSFlow，一个统一的流匹配框架，用于视频条件下的声音和语音生成任务（V2S和VisualTTS），通过联合学习和一种新颖的条件聚合机制，展示了性能的提升。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Xin Cheng, Yuyue Wang, Xihua Wang, Yihan Wu, Kaisi Guan, Yijing Chen, Peng Zhang, Xiaojiang Liu, Meng Cao, Ruihua Song

Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility

Diffusion models can generate realistic videos, but existing methods rely on implicitly learning physical reasoning from large-scale text-video datasets, which is costly, difficult to scale, and still prone to producing implausible motions that violate fundamental physical laws. We introduce a training-free framework that improves physical plausibility at inference time by explicitly reasoning about implausibility and guiding the generation away from it. Specifically, we employ a lightweight physics-aware reasoning pipeline to construct counterfactual prompts that deliberately encode physics-violating behaviors. Then, we propose a novel Synchronized Decoupled Guidance (SDG) strategy, which leverages these prompts through synchronized directional normalization to counteract lagged suppression and trajectory-decoupled denoising to mitigate cumulative trajectory bias, ensuring that implausible content is suppressed immediately and consistently throughout denoising. Experiments across different physical domains show that our approach substantially enhances physical fidelity while maintaining photorealism, despite requiring no additional training. Ablation studies confirm the complementary effectiveness of both the physics-aware reasoning component and SDG. In particular, the aforementioned two designs of SDG are also individually validated to contribute critically to the suppression of implausible content and the overall gains in physical plausibility. This establishes a new and plug-and-play physics-aware paradigm for video generation.

TLDR: This paper introduces a training-free method that improves the physical plausibility of generated videos by explicitly reasoning about and suppressing implausible motions using physics-aware reasoning and a novel Synchronized Decoupled Guidance strategy.

TLDR: 本文介绍了一种无需训练的方法，通过显式推理和抑制不合理的运动，利用物理感知推理和一种新的同步解耦指导策略来提高生成视频的物理合理性。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yutong Hao, Chen Chen, Ajmal Saeed Mian, Chang Xu, Daochang Liu

Instruction Guided Multi Object Image Editing with Quantity and Layout Consistency

Instruction driven image editing with standard CLIP text encoders often fails in complex scenes with many objects. We present QL-Adapter, a framework for multiple object editing that tackles two challenges: enforcing object counts and spatial layouts, and accommodating diverse categories. QL-Adapter consists of two core modules: the Image-Layout Fusion Module (ILFM) and the Cross-Modal Augmentation Module (CMAM). ILFM fuses layout priors with ViT patch tokens from the CLIP image encoder to strengthen spatial structure understanding. CMAM injects image features into the text branch to enrich textual embeddings and improve instruction following. We further build QL-Dataset, a benchmark that spans broad category, layout, and count variations, and define the task of quantity and layout consistent image editing (QL-Edit). Extensive experiments show that QL-Adapter achieves state of the art performance on QL-Edit and significantly outperforms existing models.

TLDR: The paper introduces QL-Adapter, a novel framework for instruction-guided multi-object image editing that addresses challenges in enforcing object counts and spatial layouts. It outperforms existing models on a new benchmark dataset, QL-Edit.

TLDR: 该论文介绍了QL-Adapter，一种用于指令引导的多对象图像编辑的新框架，解决了在执行对象计数和空间布局方面的挑战。它在名为QL-Edit的新的基准数据集上优于现有模型。

Relevance: (8/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Jiaqi Tan, Fangyu Li, Yang Liu

Mitigating Visual Hallucinations via Semantic Curriculum Preference Optimization in MLLMs

Multimodal Large Language Models (MLLMs) have significantly improved the performance of various tasks, but continue to suffer from visual hallucinations, a critical issue where generated responses contradict visual evidence. While Direct Preference Optimization(DPO) is widely used for alignment, its application to MLLMs often fails to capture fine-grained semantic differences and encourages shortcut learning. To address these challenges, we propose Semantic Curriculum Preference Optimization (SCPO), a novel framework for MLLM alignment. SCPO employs a progressive, easy-to-hard curriculum built upon our Semantic Curriculum Preference Pairs dataset, which provides fine-grained semantic contrasts sorted by difficulty. This curriculum is trained with a dynamic reference model and a novel symmetric, bidirectional objective to facilitate simultaneous learning from both textual and visual preferences. To our knowledge, SCPO is the first framework to unify semantics, symmetry, and curriculum for MLLMs alignment, effectively mitigating visual hallucinations. Extensive experiments on LLaVA models across various scales and versions validate that SCPO demonstrates superior performance compared to baseline models on multiple hallucination benchmarks, reducing the hallucination rate by up to 62.9%. Moreover, evaluations on generalized benchmarks show that SCPO improves factuality while preserving general capabilities, with its performance remaining stable across general vision-language benchmarks.

TLDR: This paper introduces Semantic Curriculum Preference Optimization (SCPO), a novel framework for mitigating visual hallucinations in Multimodal Large Language Models (MLLMs) using a curriculum learning approach based on semantic preference pairs. It achieves significant reduction in hallucination rates while preserving general capabilities.

TLDR: 本文介绍了一种名为语义课程偏好优化 (SCPO) 的新型框架，该框架通过基于语义偏好对的课程学习方法来减轻多模态大型语言模型 (MLLM) 中的视觉幻觉。它在保持一般能力的同时，显著降低了幻觉率。

Relevance: (7/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yuanshuai Li, Yuping Yan, Junfeng Tang, Yunxuan Li, Zeqi Zheng, Yaochu Jin

NeoWorld: Neural Simulation of Explorable Virtual Worlds via Progressive 3D Unfolding

We introduce NeoWorld, a deep learning framework for generating interactive 3D virtual worlds from a single input image. Inspired by the on-demand worldbuilding concept in the science fiction novel Simulacron-3 (1964), our system constructs expansive environments where only the regions actively explored by the user are rendered with high visual realism through object-centric 3D representations. Unlike previous approaches that rely on global world generation or 2D hallucination, NeoWorld models key foreground objects in full 3D, while synthesizing backgrounds and non-interacted regions in 2D to ensure efficiency. This hybrid scene structure, implemented with cutting-edge representation learning and object-to-3D techniques, enables flexible viewpoint manipulation and physically plausible scene animation, allowing users to control object appearance and dynamics using natural language commands. As users interact with the environment, the virtual world progressively unfolds with increasing 3D detail, delivering a dynamic, immersive, and visually coherent exploration experience. NeoWorld significantly outperforms existing 2D and depth-layered 2.5D methods on the WorldScore benchmark.

TLDR: NeoWorld generates interactive 3D virtual worlds from a single image, using a hybrid 2D/3D approach for efficiency and progressive unfolding as the user explores. It outperforms existing methods on the WorldScore benchmark.

TLDR: NeoWorld从单张图像生成交互式3D虚拟世界，采用混合2D/3D方法提高效率，并随着用户探索逐步展开。它在WorldScore基准测试中优于现有方法。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yanpeng Zhao, Shanyan Guan, Yunbo Wang, Yanhao Ge, Wei Li, Xiaokang Yang

CLQ: Cross-Layer Guided Orthogonal-based Quantization for Diffusion Transformers

Visual generation quality has been greatly promoted with the rapid advances in diffusion transformers (DiTs), which is attributed to the scaling of model size and complexity. However, these attributions also hinder the practical deployment of DiTs on edge devices, limiting their development and application. Serve as an efficient model compression technique, model post-training quantization (PTQ) can reduce the memory consumption and speed up the inference, with inevitable performance degradation. To alleviate the degradation, we propose CLQ, a cross-layer guided orthogonal-based quantization method for DiTs. To be specific, CLQ consists of three key designs. First, we observe that the calibration data used by most of the PTQ methods can not honestly represent the distribution of the activations. Therefore, we propose cross-block calibration (CBC) to obtain accurate calibration data, with which the quantization can be better guided. Second, we propose orthogonal-based smoothing (OBS), which quantifies the outlier score of each channel and leverages block Hadamard matrix to smooth the outliers with negligible overhead. Third, we propose cross-layer parameter searching (CLPS) to search. We evaluate CLQ with both image generation and video generation models and successfully compress the model into W4A4 with negligible degradation in visual quality and metrics. CLQ achieves 3.98x memory saving and 3.95x speedup. Our code is available at \hyperlink{https://github.com/Kai-Liu001/CLQ}{https://github.com/Kai-Liu001/CLQ}.

TLDR: The paper proposes CLQ, a post-training quantization method for diffusion transformers that uses cross-layer guidance and orthogonal-based techniques to achieve high compression (W4A4) with minimal performance degradation, resulting in significant memory savings and speedup.

TLDR: 该论文提出了一种用于扩散Transformer的后训练量化方法CLQ，该方法利用跨层指导和基于正交的技术，以实现高性能压缩（W4A4），并以最小的性能降降低 memory 占用和提高推理速度。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Kai Liu, Shaoqiu Zhang, Linghe Kong, Yulun Zhang

RapidMV: Leveraging Spatio-Angular Representations for Efficient and Consistent Text-to-Multi-View Synthesis

Generating synthetic multi-view images from a text prompt is an essential bridge to generating synthetic 3D assets. In this work, we introduce RapidMV, a novel text-to-multi-view generative model that can produce 32 multi-view synthetic images in just around 5 seconds. In essence, we propose a novel spatio-angular latent space, encoding both the spatial appearance and angular viewpoint deviations into a single latent for improved efficiency and multi-view consistency. We achieve effective training of RapidMV by strategically decomposing our training process into multiple steps. We demonstrate that RapidMV outperforms existing methods in terms of consistency and latency, with competitive quality and text-image alignment.

TLDR: RapidMV is a new text-to-multi-view image generation model that leverages a spatio-angular latent space to generate 32 consistent multi-view images in approximately 5 seconds, outperforming existing methods in consistency and latency.

TLDR: RapidMV是一种新的文本到多视角图像生成模型，它利用时空角潜在空间在约5秒内生成32个一致的多视角图像，在一致性和延迟方面优于现有方法。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Seungwook Kim, Yichun Shi, Kejie Li, Minsu Cho, Peng Wang

Uni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models

Unified Multimodal Models (UMMs) built on shared autoregressive (AR) transformers are attractive for their architectural simplicity. However, we identify a critical limitation: when trained on multimodal inputs, modality-shared transformers suffer from severe gradient conflicts between vision and text, particularly in shallow and deep layers. We trace this issue to the fundamentally different low-level statistical properties of images and text, while noting that conflicts diminish in middle layers where representations become more abstract and semantically aligned. To overcome this challenge, we propose Uni-X, a two-end-separated, middle-shared architecture. Uni-X dedicates its initial and final layers to modality-specific processing, while maintaining shared parameters in the middle layers for high-level semantic fusion. This X-shaped design not only eliminates gradient conflicts at both ends but also further alleviates residual conflicts in the shared layers. Extensive experiments validate the effectiveness of Uni-X. Under identical training conditions, Uni-X achieves superior training efficiency compared to strong baselines. When scaled to 3B parameters with larger training data, Uni-X matches or surpasses 7B AR-based UMMs, achieving a GenEval score of 82 for image generation alongside strong performance in text and vision understanding tasks. These results establish Uni-X as a parameter-efficient and scalable foundation for future unified multimodal modeling. Our code is available at https://github.com/CURRENTF/Uni-X

TLDR: The paper introduces Uni-X, a two-end-separated architecture for unified multimodal modeling, addressing gradient conflicts between vision and text modalities, achieving better training efficiency and performance compared to existing models, especially in image generation.

TLDR: 该论文介绍了Uni-X，一种用于统一多模态建模的两端分离架构，解决了视觉和文本模态之间的梯度冲突，与现有模型相比，实现了更好的训练效率和性能，尤其是在图像生成方面。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Jitai Hao, Hao Liu, Xinyan Xiao, Qiang Huang, Jun Yu

Semantic Editing with Coupled Stochastic Differential Equations

Editing the content of an image with a pretrained text-to-image model remains challenging. Existing methods often distort fine details or introduce unintended artifacts. We propose using coupled stochastic differential equations (coupled SDEs) to guide the sampling process of any pre-trained generative model that can be sampled by solving an SDE, including diffusion and rectified flow models. By driving both the source image and the edited image with the same correlated noise, our approach steers new samples toward the desired semantics while preserving visual similarity to the source. The method works out-of-the-box-without retraining or auxiliary networks-and achieves high prompt fidelity along with near-pixel-level consistency. These results position coupled SDEs as a simple yet powerful tool for controlled generative AI.

TLDR: This paper presents a novel approach using coupled Stochastic Differential Equations (SDEs) to improve semantic image editing with pre-trained text-to-image models, achieving high prompt fidelity and pixel-level consistency without retraining.

TLDR: 本文提出了一种使用耦合随机微分方程（SDE）的新方法，以改进使用预训练文本到图像模型的语义图像编辑，无需重新训练即可实现高prompt保真度和像素级一致性。

Relevance: (8/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Jianxin Zhang, Clayton Scott

Unified Multi-Modal Interactive & Reactive 3D Motion Generation via Rectified Flow

Generating realistic, context-aware two-person motion conditioned on diverse modalities remains a central challenge in computer graphics, animation, and human-computer interaction. We introduce DualFlow, a unified and efficient framework for multi-modal two-person motion generation. DualFlow conditions 3D motion synthesis on diverse inputs, including text, music, and prior motion sequences. Leveraging rectified flow, it achieves deterministic straight-line sampling paths between noise and data, reducing inference time and mitigating error accumulation common in diffusion-based models. To enhance semantic grounding, DualFlow employs a Retrieval-Augmented Generation (RAG) module that retrieves motion exemplars using music features and LLM-based text decompositions of spatial relations, body movements, and rhythmic patterns. We use contrastive objective that further strengthens alignment with conditioning signals and introduce synchronization loss that improves inter-person coordination. Extensive evaluations across text-to-motion, music-to-motion, and multi-modal interactive benchmarks show consistent gains in motion quality, responsiveness, and efficiency. DualFlow produces temporally coherent and rhythmically synchronized motions, setting state-of-the-art in multi-modal human motion generation.

TLDR: The paper introduces DualFlow, a rectified flow-based framework for generating realistic two-person motion conditioned on text, music, and prior motion, using RAG and synchronization loss for enhanced quality and efficiency.

TLDR: 该论文介绍了DualFlow，一个基于校正流的框架，用于生成逼真的双人运动，以文本、音乐和先前的运动为条件，使用RAG和同步损失来提高质量和效率。

Relevance: (8/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Prerit Gupta, Shourya Verma, Ananth Grama, Aniket Bera

Autoregressive Video Generation beyond Next Frames Prediction

Autoregressive models for video generation typically operate frame-by-frame, extending next-token prediction from language to video's temporal dimension. We question that unlike word as token is universally agreed in language if frame is a appropriate prediction unit? To address this, we present VideoAR, a unified framework that supports a spectrum of prediction units including full frames, key-detail frames, multiscale refinements, and spatiotemporal cubes. Among these designs, we find model video generation using \textit{spatiotemporal} cubes as prediction units, which allows autoregressive models to operate across both spatial and temporal dimensions simultaneously. This approach eliminates the assumption that frames are the natural atomic units for video autoregression. We evaluate VideoAR across diverse prediction strategies, finding that cube-based prediction consistently delivers superior quality, speed, and temporal coherence. By removing the frame-by-frame constraint, our video generator surpasses state-of-the-art baselines on VBench while achieving faster inference and enabling seamless scaling to minute-long sequences. We hope this work will motivate rethinking sequence decomposition in video and other spatiotemporal domains.

TLDR: The paper introduces VideoAR, a video generation framework that moves beyond frame-by-frame prediction by using spatiotemporal cubes as prediction units, achieving improved quality, speed, and temporal coherence compared to existing methods.

TLDR: 该论文介绍了VideoAR，一种视频生成框架，它通过使用时空立方体作为预测单元，超越了逐帧预测，与现有方法相比，实现了更高的质量、速度和时间连贯性。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Sucheng Ren, Chen Chen, Zhenbang Wang, Liangchen Song, Xiangxin Zhu, Alan Yuille, Yinfei Yang, Jiasen Lu

UniLat3D: Geometry-Appearance Unified Latents for Single-Stage 3D Generation

High-fidelity 3D asset generation is crucial for various industries. While recent 3D pretrained models show strong capability in producing realistic content, most are built upon diffusion models and follow a two-stage pipeline that first generates geometry and then synthesizes appearance. Such a decoupled design tends to produce geometry-texture misalignment and non-negligible cost. In this paper, we propose UniLat3D, a unified framework that encodes geometry and appearance in a single latent space, enabling direct single-stage generation. Our key contribution is a geometry-appearance Unified VAE, which compresses high-resolution sparse features into a compact latent representation -- UniLat. UniLat integrates structural and visual information into a dense low-resolution latent, which can be efficiently decoded into diverse 3D formats, e.g., 3D Gaussians and meshes. Based on this unified representation, we train a single flow-matching model to map Gaussian noise directly into UniLat, eliminating redundant stages. Trained solely on public datasets, UniLat3D produces high-quality 3D assets in seconds from a single image, achieving superior appearance fidelity and geometric quality. More demos \& code are available at https://unilat3d.github.io/

TLDR: UniLat3D proposes a single-stage 3D asset generation framework that unifies geometry and appearance into a single latent space using a novel VAE and flow-matching model, achieving high-quality results from a single image.

TLDR: UniLat3D 提出了一个单阶段3D资产生成框架，该框架使用一种新的VAE和流匹配模型将几何和外观统一到一个潜在空间中，从而从单个图像中获得高质量的结果。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Guanjun Wu, Jiemin Fang, Chen Yang, Sikuang Li, Taoran Yi, Jia Lu, Zanwei Zhou, Jiazhong Cen, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Xinggang Wang, Qi Tian

CharGen: Fast and Fluent Portrait Modification

Interactive editing of character images with diffusion models remains challenging due to the inherent trade-off between fine-grained control, generation speed, and visual fidelity. We introduce CharGen, a character-focused editor that combines attribute-specific Concept Sliders, trained to isolate and manipulate attributes such as facial feature size, expression, and decoration with the StreamDiffusion sampling pipeline for more interactive performance. To counteract the loss of detail that often accompanies accelerated sampling, we propose a lightweight Repair Step that reinstates fine textures without compromising structural consistency. Throughout extensive ablation studies and in comparison to open-source InstructPix2Pix and closed-source Google Gemini, and a comprehensive user study, CharGen achieves two-to-four-fold faster edit turnaround with precise editing control and identity-consistent results. Project page: https://chargen.jdihlmann.com/

TLDR: CharGen is a character image editor combining attribute-specific Concept Sliders with StreamDiffusion for faster, more controlled edits, and a Repair Step to maintain fine details. It outperforms existing methods regarding speed, control, and identity consistency.

TLDR: CharGen 是一款人物图像编辑器，结合了特定属性的概念滑块和 StreamDiffusion 以实现更快、更可控的编辑，并使用修复步骤来保持精细细节。在速度、控制和身份一致性方面，它优于现有方法。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Jan-Niklas Dihlmann, Arnela Killguss, Hendrik P. A. Lensch

Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation

Vision-Language Foundation Models (VLMs), trained on large-scale multimodal datasets, have driven significant advances in Artificial Intelligence by enabling rich cross-modal reasoning. Despite their success in general domains, applying these models to medical imaging remains challenging due to the limited availability of diverse imaging modalities and multilingual clinical data. Most existing medical VLMs are trained on a subset of imaging modalities and focus primarily on high-resource languages, thus limiting their generalizability and clinical utility. To address these limitations, we introduce a novel Vietnamese-language multimodal medical dataset comprising 1,567,062 paired CT-PET images and corresponding 2,757 full-length clinical reports. This dataset is designed to fill two pressing gaps in medical AI development: (1) the lack of PET/CT imaging data in existing VLMs training corpora, which hinders the development of models capable of handling functional imaging tasks; and (2) the underrepresentation of low-resource languages, particularly the Vietnamese language, in medical vision-language research. To the best of our knowledge, this is the first dataset to provide comprehensive PET/CT-report pairs in Vietnamese. We further introduce a training framework to enhance VLMs' learning, including data augmentation and expert-validated test sets. We conduct comprehensive experiments benchmarking state-of-the-art VLMs on downstream tasks, including medical report generation and visual question answering. The experimental results show that incorporating our dataset significantly improves the performance of existing VLMs. We believe this dataset and benchmark will serve as a pivotal step in advancing the development of more robust VLMs for medical imaging, particularly in low-resource languages, and improving their clinical relevance in Vietnamese healthcare.

TLDR: This paper introduces a new Vietnamese-language PET/CT imaging dataset paired with clinical reports for training vision-language foundation models, addressing the lack of such data and low-resource language representation in medical AI.

TLDR: 本文介绍了一个新的越南语 PET/CT 图像数据集，其中包含临床报告，用于训练视觉语言基础模型，解决了医学人工智能中此类数据和低资源语言表示的缺失问题。

Relevance: (6/10)

Novelty: (9/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Huu Tien Nguyen, Dac Thai Nguyen, The Minh Duc Nguyen, Trung Thanh Nguyen, Thao Nguyen Truong, Huy Hieu Pham, Johan Barthelemy, Minh Quan Tran, Thanh Tam Nguyen, Quoc Viet Hung Nguyen, Quynh Anh Chau, Hong Son Mai, Thanh Trung Nguyen, Phi Le Nguyen

GANji: A Framework for Introductory AI Image Generation

The comparative study of generative models often requires significant computational resources, creating a barrier for researchers and practitioners. This paper introduces GANji, a lightweight framework for benchmarking foundational AI image generation techniques using a dataset of 10,314 Japanese Kanji characters. It systematically compares the performance of a Variational Autoencoder (VAE), a Generative Adversarial Network (GAN), and a Denoising Diffusion Probabilistic Model (DDPM). The results demonstrate that while the DDPM achieves the highest image fidelity, with a Fr\'echet Inception Distance (FID) score of 26.2, its sampling time is over 2,000 times slower than the other models. The GANji framework is an effective and accessible tool for revealing the fundamental trade-offs between model architecture, computational cost, and visual quality, making it ideal for both educational and research purposes.

TLDR: The paper introduces GANji, a lightweight benchmarking framework for VAEs, GANs, and DDPMs on a Kanji character dataset, highlighting the trade-offs between image quality and computational cost. It is suitable for educational and research purposes.

TLDR: 该论文介绍了GANji，一个轻量级的基准测试框架，用于在汉字数据集上评估VAE、GAN和DDPM，突出了图像质量和计算成本之间的权衡。它适用于教育和研究目的。

Relevance: (7/10)

Novelty: (6/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Chandon Hamel, Mike Busch

LayerD: Decomposing Raster Graphic Designs into Layers

Designers craft and edit graphic designs in a layer representation, but layer-based editing becomes impossible once composited into a raster image. In this work, we propose LayerD, a method to decompose raster graphic designs into layers for re-editable creative workflow. LayerD addresses the decomposition task by iteratively extracting unoccluded foreground layers. We propose a simple yet effective refinement approach taking advantage of the assumption that layers often exhibit uniform appearance in graphic designs. As decomposition is ill-posed and the ground-truth layer structure may not be reliable, we develop a quality metric that addresses the difficulty. In experiments, we show that LayerD successfully achieves high-quality decomposition and outperforms baselines. We also demonstrate the use of LayerD with state-of-the-art image generators and layer-based editing.

TLDR: LayerD is a method to decompose raster images into editable layers, facilitating creative workflows by enabling layer-based editing on rasterized designs.

TLDR: LayerD是一种将栅格图像分解为可编辑图层的方法，通过对栅格化设计启用基于图层的编辑，从而促进创意工作流程。

Relevance: (5/10)

Novelty: (7/10)

Clarity: (8/10)

Potential Impact: (6/10)

Overall: (6/10)

Read Paper (PDF)

Authors: Tomoyuki Suzuki, Kang-Jun Liu, Naoto Inoue, Kota Yamaguchi

Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers

Video Joint Embedding Predictive Architectures (V-JEPA) learn generalizable off-the-shelf video representation by predicting masked regions in latent space with an exponential moving average (EMA)-updated teacher. While EMA prevents representation collapse, it complicates scalable model selection and couples teacher and student architectures. We revisit masked-latent prediction and show that a frozen teacher suffices. Concretely, we (i) train a target encoder with a simple pixel-reconstruction objective under V-JEPA masking, then (ii) freeze it and train a student to predict the teacher's latents on masked regions. This leads to a two-stage, unregularized scheme that we refer to as SALT (Static-teacher Asymmetric Latent Training). SALT decouples optimization into pixel reconstruction (teacher) and masked latent prediction (student), increasing transparency, efficiency, and scalability while preserving the ability of representation to generalize under frozen evaluation. Empirically, our student models outperform recently proposed V-JEPA 2 encoders under frozen backbone evaluation across diverse benchmarks. They are also more compute-optimal: at matched pretraining FLOPs, our method achieves higher probing accuracy, and its scaling curves dominate V-JEPA's accuracy-FLOPs Pareto frontier. Finally, we find that student quality is remarkably robust to teacher quality: high-performing students emerge even with small, sub-optimal teachers. This points to a compute budget allocation that should overwhelmingly favor the student. These results position SALT as a simple, scalable, and compute-efficient alternative to EMA-based self-distillation for video representation learning.

TLDR: The paper introduces SALT, a compute-efficient video SSL method using a frozen teacher network to improve video representation learning, outperforming V-JEPA with better scalability and robustness to teacher quality.

TLDR: 本文介绍了一种名为SALT的计算高效视频自监督学习方法，它使用冻结的教师网络来改进视频表示学习，在可扩展性和对教师质量的鲁棒性方面优于V-JEPA。

Relevance: (4/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (6/10)

Read Paper (PDF)

Authors: Xianhang Li, Chen Huang, Chun-Liang Li, Eran Malach, Josh Susskind, Vimal Thilak, Etai Littwin

AIGC Daily Papers

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Learning Object-Centric Representations Based on Slots in Real World Scenarios

CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models

UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark

NeRV-Diffusion: Diffuse Implicit Neural Representations for Video Synthesis

Hyperspherical Latents Improve Continuous-Token Autoregressive Generation

UniVid: The Open-Source Unified Video Model

Score Distillation of Flow Matching Models

STAGE: Stable and Generalizable GRPO for Autoregressive Image Generation

PanoWorld-X: Generating Explorable Panoramic Worlds via Sphere-Aware Video Diffusion

Wan-Alpha: High-Quality Text-to-Video Generation with Alpha Channel

Scalable GANs with Transformers

Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer

Environment-Aware Satellite Image Generation with Diffusion Models

Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation

VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning

Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility

Instruction Guided Multi Object Image Editing with Quantity and Layout Consistency

Mitigating Visual Hallucinations via Semantic Curriculum Preference Optimization in MLLMs

NeoWorld: Neural Simulation of Explorable Virtual Worlds via Progressive 3D Unfolding

CLQ: Cross-Layer Guided Orthogonal-based Quantization for Diffusion Transformers

RapidMV: Leveraging Spatio-Angular Representations for Efficient and Consistent Text-to-Multi-View Synthesis

Uni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models

Semantic Editing with Coupled Stochastic Differential Equations

Unified Multi-Modal Interactive & Reactive 3D Motion Generation via Rectified Flow

Autoregressive Video Generation beyond Next Frames Prediction

UniLat3D: Geometry-Appearance Unified Latents for Single-Stage 3D Generation

CharGen: Fast and Fluent Portrait Modification

Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation

GANji: A Framework for Introductory AI Image Generation

LayerD: Decomposing Raster Graphic Designs into Layers

Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers