AIGC Daily Papers

Daily papers related to Image/Video/Multimodal Generation from cs.CV

September 20, 2025

Lynx: Towards High-Fidelity Personalized Video Generation

We present Lynx, a high-fidelity model for personalized video synthesis from a single input image. Built on an open-source Diffusion Transformer (DiT) foundation model, Lynx introduces two lightweight adapters to ensure identity fidelity. The ID-adapter employs a Perceiver Resampler to convert ArcFace-derived facial embeddings into compact identity tokens for conditioning, while the Ref-adapter integrates dense VAE features from a frozen reference pathway, injecting fine-grained details across all transformer layers through cross-attention. These modules collectively enable robust identity preservation while maintaining temporal coherence and visual realism. Through evaluation on a curated benchmark of 40 subjects and 20 unbiased prompts, which yielded 800 test cases, Lynx has demonstrated superior face resemblance, competitive prompt following, and strong video quality, thereby advancing the state of personalized video generation.

TLDR: The paper introduces Lynx, a high-fidelity personalized video generation model leveraging Diffusion Transformers and lightweight adapters for enhanced identity preservation and video quality with demonstrated improvement over existing methods.

TLDR: 该论文介绍了 Lynx,一个高保真个性化视频生成模型,它利用扩散变换器和轻量级适配器来增强身份保持和视频质量,并展示了相对于现有方法的改进。

Relevance: (10/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (9/10)
Overall: (9/10)
Read Paper (PDF)

Authors: Shen Sang, Tiancheng Zhi, Tianpei Gu, Jing Liu, Linjie Luo

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.

TLDR: The paper introduces Manzano, a unified multimodal LLM that uses a hybrid vision tokenizer and a tailored training recipe to achieve state-of-the-art performance in both image understanding and generation, while mitigating performance trade-offs.

TLDR: 该论文介绍了Manzano,一种统一的多模态LLM,它使用混合视觉标记器和定制的训练配方,在图像理解和生成方面都实现了最先进的性能,同时减轻了性能权衡。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Yanghao Li, Rui Qian, Bowen Pan, Haotian Zhang, Haoshuo Huang, Bowen Zhang, Jialing Tong, Haoxuan You, Xianzhi Du, Zhe Gan, Hyunjik Kim, Chao Jia, Zhenbang Wang, Yinfei Yang, Mingfei Gao, Zi-Yi Dou, Wenze Hu, Chang Gao, Dongxu Li, Philipp Dufter, Zirui Wang, Guoli Yin, Zhengdong Zhang, Chen Chen, Yang Zhao, Ruoming Pang, Zhifeng Chen

AcT2I: Evaluating and Improving Action Depiction in Text-to-Image Models

Text-to-Image (T2I) models have recently achieved remarkable success in generating images from textual descriptions. However, challenges still persist in accurately rendering complex scenes where actions and interactions form the primary semantic focus. Our key observation in this work is that T2I models frequently struggle to capture nuanced and often implicit attributes inherent in action depiction, leading to generating images that lack key contextual details. To enable systematic evaluation, we introduce AcT2I, a benchmark designed to evaluate the performance of T2I models in generating images from action-centric prompts. We experimentally validate that leading T2I models do not fare well on AcT2I. We further hypothesize that this shortcoming arises from the incomplete representation of the inherent attributes and contextual dependencies in the training corpora of existing T2I models. We build upon this by developing a training-free, knowledge distillation technique utilizing Large Language Models to address this limitation. Specifically, we enhance prompts by incorporating dense information across three dimensions, observing that injecting prompts with temporal details significantly improves image generation accuracy, with our best model achieving an increase of 72%. Our findings highlight the limitations of current T2I methods in generating images that require complex reasoning and demonstrate that integrating linguistic knowledge in a systematic way can notably advance the generation of nuanced and contextually accurate images.

TLDR: The paper introduces AcT2I, a benchmark for evaluating Text-to-Image models on action depiction, and proposes a knowledge distillation technique using LLMs to improve performance, achieving significant accuracy gains by incorporating temporal information in prompts.

TLDR: 该论文介绍了AcT2I,一个用于评估文本到图像模型在动作描述上的基准,并提出了一种利用大型语言模型的知识蒸馏技术来提高性能,通过在提示中加入时间信息,实现了显著的准确率提升。

Relevance: (9/10)
Novelty: (7/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Vatsal Malaviya, Agneet Chatterjee, Maitreya Patel, Yezhou Yang, Chitta Baral

Dynamic Classifier-Free Diffusion Guidance via Online Feedback

Classifier-free guidance (CFG) is a cornerstone of text-to-image diffusion models, yet its effectiveness is limited by the use of static guidance scales. This "one-size-fits-all" approach fails to adapt to the diverse requirements of different prompts; moreover, prior solutions like gradient-based correction or fixed heuristic schedules introduce additional complexities and fail to generalize. In this work, we challeng this static paradigm by introducing a framework for dynamic CFG scheduling. Our method leverages online feedback from a suite of general-purpose and specialized small-scale latent-space evaluations, such as CLIP for alignment, a discriminator for fidelity and a human preference reward model, to assess generation quality at each step of the reverse diffusion process. Based on this feedback, we perform a greedy search to select the optimal CFG scale for each timestep, creating a unique guidance schedule tailored to every prompt and sample. We demonstrate the effectiveness of our approach on both small-scale models and the state-of-the-art Imagen 3, showing significant improvements in text alignment, visual quality, text rendering and numerical reasoning. Notably, when compared against the default Imagen 3 baseline, our method achieves up to 53.8% human preference win-rate for overall preference, a figure that increases up to to 55.5% on prompts targeting specific capabilities like text rendering. Our work establishes that the optimal guidance schedule is inherently dynamic and prompt-dependent, and provides an efficient and generalizable framework to achieve it.

TLDR: The paper introduces a dynamic classifier-free guidance (CFG) method for text-to-image diffusion models that uses online feedback from latent-space evaluations to optimize the CFG scale at each timestep, resulting in improved generation quality and human preference win-rates.

TLDR: 该论文介绍了一种动态的无分类器引导(CFG)方法,用于文本到图像的扩散模型,该方法利用来自潜在空间评估的在线反馈来优化每个时间步的CFG尺度,从而提高生成质量和人类偏好胜率。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Pinelopi Papalampidi, Olivia Wiles, Ira Ktena, Aleksandar Shtedritski, Emanuele Bugliarello, Ivana Kajic, Isabela Albuquerque, Aida Nematzadeh

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Online reinforcement learning (RL) has been central to post-training language models, but its extension to diffusion models remains challenging due to intractable likelihoods. Recent works discretize the reverse sampling process to enable GRPO-style training, yet they inherit fundamental drawbacks, including solver restrictions, forward-reverse inconsistency, and complicated integration with classifier-free guidance (CFG). We introduce Diffusion Negative-aware FineTuning (DiffusionNFT), a new online RL paradigm that optimizes diffusion models directly on the forward process via flow matching. DiffusionNFT contrasts positive and negative generations to define an implicit policy improvement direction, naturally incorporating reinforcement signals into the supervised learning objective. This formulation enables training with arbitrary black-box solvers, eliminates the need for likelihood estimation, and requires only clean images rather than sampling trajectories for policy optimization. DiffusionNFT is up to $25\times$ more efficient than FlowGRPO in head-to-head comparisons, while being CFG-free. For instance, DiffusionNFT improves the GenEval score from 0.24 to 0.98 within 1k steps, while FlowGRPO achieves 0.95 with over 5k steps and additional CFG employment. By leveraging multiple reward models, DiffusionNFT significantly boosts the performance of SD3.5-Medium in every benchmark tested.

TLDR: The paper introduces DiffusionNFT, a new online reinforcement learning paradigm for diffusion models that optimizes directly on the forward process using flow matching, achieving significant efficiency gains over existing methods like FlowGRPO.

TLDR: 该论文介绍了一种新的扩散模型在线强化学习范式 DiffusionNFT,它利用流动匹配直接在正向过程中进行优化,与现有方法 FlowGRPO 相比,效率显著提高。

Relevance: (8/10)
Novelty: (9/10)
Clarity: (7/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, Ming-Yu Liu

Vision-Language Models as Differentiable Semantic and Spatial Rewards for Text-to-3D Generation

Score Distillation Sampling (SDS) enables high-quality text-to-3D generation by supervising 3D models through the denoising of multi-view 2D renderings, using a pretrained text-to-image diffusion model to align with the input prompt and ensure 3D consistency. However, existing SDS-based methods face two fundamental limitations: (1) their reliance on CLIP-style text encoders leads to coarse semantic alignment and struggles with fine-grained prompts; and (2) 2D diffusion priors lack explicit 3D spatial constraints, resulting in geometric inconsistencies and inaccurate object relationships in multi-object scenes. To address these challenges, we propose VLM3D, a novel text-to-3D generation framework that integrates large vision-language models (VLMs) into the SDS pipeline as differentiable semantic and spatial priors. Unlike standard text-to-image diffusion priors, VLMs leverage rich language-grounded supervision that enables fine-grained prompt alignment. Moreover, their inherent vision language modeling provides strong spatial understanding, which significantly enhances 3D consistency for single-object generation and improves relational reasoning in multi-object scenes. We instantiate VLM3D based on the open-source Qwen2.5-VL model and evaluate it on the GPTeval3D benchmark. Experiments across diverse objects and complex scenes show that VLM3D significantly outperforms prior SDS-based methods in semantic fidelity, geometric coherence, and spatial correctness.

TLDR: The paper introduces VLM3D, a novel text-to-3D generation framework utilizing Vision-Language Models (VLMs) within the Score Distillation Sampling (SDS) pipeline to enhance semantic fidelity, geometric coherence, and spatial correctness in generated 3D models, particularly for complex scenes and fine-grained prompts.

TLDR: 该论文介绍了一种名为 VLM3D 的新型文本到 3D 生成框架,该框架利用视觉语言模型 (VLM) 在分数蒸馏采样 (SDS) 管道中来提高生成 3D 模型中的语义保真度、几何一致性和空间正确性,尤其是在复杂场景和细粒度提示的情况下。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Weimin Bai, Yubo Li, Weijian Luo, Wenzheng Chen, He Sun

Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification

Generative modeling, representation learning, and classification are three core problems in machine learning (ML), yet their state-of-the-art (SoTA) solutions remain largely disjoint. In this paper, we ask: Can a unified principle address all three? Such unification could simplify ML pipelines and foster greater synergy across tasks. We introduce Latent Zoning Network (LZN) as a step toward this goal. At its core, LZN creates a shared Gaussian latent space that encodes information across all tasks. Each data type (e.g., images, text, labels) is equipped with an encoder that maps samples to disjoint latent zones, and a decoder that maps latents back to data. ML tasks are expressed as compositions of these encoders and decoders: for example, label-conditional image generation uses a label encoder and image decoder; image embedding uses an image encoder; classification uses an image encoder and label decoder. We demonstrate the promise of LZN in three increasingly complex scenarios: (1) LZN can enhance existing models (image generation): When combined with the SoTA Rectified Flow model, LZN improves FID on CIFAR10 from 2.76 to 2.59-without modifying the training objective. (2) LZN can solve tasks independently (representation learning): LZN can implement unsupervised representation learning without auxiliary loss functions, outperforming the seminal MoCo and SimCLR methods by 9.3% and 0.2%, respectively, on downstream linear classification on ImageNet. (3) LZN can solve multiple tasks simultaneously (joint generation and classification): With image and label encoders/decoders, LZN performs both tasks jointly by design, improving FID and achieving SoTA classification accuracy on CIFAR10. The code and trained models are available at https://github.com/microsoft/latent-zoning-networks. The project website is at https://zinanlin.me/blogs/latent_zoning_networks.html.

TLDR: The paper introduces Latent Zoning Network (LZN), a unified framework using shared latent spaces to address generative modeling, representation learning, and classification, achieving state-of-the-art results in several tasks.

TLDR: 该论文介绍了潜在分区网络 (LZN),这是一个统一的框架,使用共享潜在空间来解决生成建模、表征学习和分类问题,并在多个任务中取得了最先进的结果。

Relevance: (7/10)
Novelty: (9/10)
Clarity: (8/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Zinan Lin, Enshu Liu, Xuefei Ning, Junyi Zhu, Wenyu Wang, Sergey Yekhanin

SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models

World models allow agents to simulate the consequences of actions in imagined environments for planning, control, and long-horizon decision-making. However, existing autoregressive world models struggle with visually coherent predictions due to disrupted spatial structure, inefficient decoding, and inadequate motion modeling. In response, we propose \textbf{S}cale-wise \textbf{A}utoregression with \textbf{M}otion \textbf{P}r\textbf{O}mpt (\textbf{SAMPO}), a hybrid framework that combines visual autoregressive modeling for intra-frame generation with causal modeling for next-frame generation. Specifically, SAMPO integrates temporal causal decoding with bidirectional spatial attention, which preserves spatial locality and supports parallel decoding within each scale. This design significantly enhances both temporal consistency and rollout efficiency. To further improve dynamic scene understanding, we devise an asymmetric multi-scale tokenizer that preserves spatial details in observed frames and extracts compact dynamic representations for future frames, optimizing both memory usage and model performance. Additionally, we introduce a trajectory-aware motion prompt module that injects spatiotemporal cues about object and robot trajectories, focusing attention on dynamic regions and improving temporal consistency and physical realism. Extensive experiments show that SAMPO achieves competitive performance in action-conditioned video prediction and model-based control, improving generation quality with 4.4$\times$ faster inference. We also evaluate SAMPO's zero-shot generalization and scaling behavior, demonstrating its ability to generalize to unseen tasks and benefit from larger model sizes.

TLDR: The paper introduces SAMPO, a novel autoregressive world model that uses scale-wise autoregression with motion prompts for more coherent and efficient video prediction, particularly in action-conditioned scenarios.

TLDR: 该论文介绍了SAMPO,一种新型自回归世界模型,它使用具有运动提示的尺度自回归来实现更连贯、高效的视频预测,尤其是在动作条件场景中。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (8/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Sen Wang, Jingyi Tian, Le Wang, Zhimin Liao, Jiayi Li, Huaiyi Dong, Kun Xia, Sanping Zhou, Wei Tang, Hua Gang

OpenViGA: Video Generation for Automotive Driving Scenes by Streamlining and Fine-Tuning Open Source Models with Public Data

Recent successful video generation systems that predict and create realistic automotive driving scenes from short video inputs assign tokenization, future state prediction (world model), and video decoding to dedicated models. These approaches often utilize large models that require significant training resources, offer limited insight into design choices, and lack publicly available code and datasets. In this work, we address these deficiencies and present OpenViGA, an open video generation system for automotive driving scenes. Our contributions are: Unlike several earlier works for video generation, such as GAIA-1, we provide a deep analysis of the three components of our system by separate quantitative and qualitative evaluation: Image tokenizer, world model, video decoder. Second, we purely build upon powerful pre-trained open source models from various domains, which we fine-tune by publicly available automotive data (BDD100K) on GPU hardware at academic scale. Third, we build a coherent video generation system by streamlining interfaces of our components. Fourth, due to public availability of the underlying models and data, we allow full reproducibility. Finally, we also publish our code and models on Github. For an image size of 256x256 at 4 fps we are able to predict realistic driving scene videos frame-by-frame with only one frame of algorithmic latency.

TLDR: OpenViGA is an open-source video generation system for automotive driving scenes, built upon publicly available pre-trained models and datasets, offering reproducibility and detailed component analysis.

TLDR: OpenViGA 是一个开源的汽车驾驶场景视频生成系统,它基于公开的预训练模型和数据集构建,提供可重复性和详细的组件分析。

Relevance: (9/10)
Novelty: (7/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Björn Möller, Zhengyang Li, Malte Stelzer, Thomas Graave, Fabian Bettels, Muaaz Ataya, Tim Fingscheidt

MaskAttn-SDXL: Controllable Region-Level Text-To-Image Generation

Text-to-image diffusion models achieve impressive realism but often suffer from compositional failures on prompts with multiple objects, attributes, and spatial relations, resulting in cross-token interference where entities entangle, attributes mix across objects, and spatial cues are violated. To address these failures, we propose MaskAttn-SDXL,a region-level gating mechanism applied to the cross-attention logits of Stable Diffusion XL(SDXL)'s UNet. MaskAttn-SDXL learns a binary mask per layer, injecting it into each cross-attention logit map before softmax to sparsify token-to-latent interactions so that only semantically relevant connections remain active. The method requires no positional encodings, auxiliary tokens, or external region masks, and preserves the original inference path with negligible overhead. In practice, our model improves spatial compliance and attribute binding in multi-object prompts while preserving overall image quality and diversity. These findings demonstrate that logit-level maksed cross-attention is an data-efficient primitve for enforcing compositional control, and our method thus serves as a practical extension for spatial control in text-to-image generation.

TLDR: The paper proposes MaskAttn-SDXL, a region-level gating mechanism for Stable Diffusion XL that uses masked cross-attention logits to improve compositional control in text-to-image generation, particularly for multi-object prompts.

TLDR: 该论文提出了 MaskAttn-SDXL,一种用于 Stable Diffusion XL 的区域级门控机制,它使用 masked cross-attention logits 来提高文本到图像生成中的组合控制,特别是对于多对象提示。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Yu Chang, Jiahao Chen, Anzhe Cheng, Paul Bogdan

LowDiff: Efficient Diffusion Sampling with Low-Resolution Condition

Diffusion models have achieved remarkable success in image generation but their practical application is often hindered by the slow sampling speed. Prior efforts of improving efficiency primarily focus on compressing models or reducing the total number of denoising steps, largely neglecting the possibility to leverage multiple input resolutions in the generation process. In this work, we propose LowDiff, a novel and efficient diffusion framework based on a cascaded approach by generating increasingly higher resolution outputs. Besides, LowDiff employs a unified model to progressively refine images from low resolution to the desired resolution. With the proposed architecture design and generation techniques, we achieve comparable or even superior performance with much fewer high-resolution sampling steps. LowDiff is applicable to diffusion models in both pixel space and latent space. Extensive experiments on both conditional and unconditional generation tasks across CIFAR-10, FFHQ and ImageNet demonstrate the effectiveness and generality of our method. Results show over 50% throughput improvement across all datasets and settings while maintaining comparable or better quality. On unconditional CIFAR-10, LowDiff achieves an FID of 2.11 and IS of 9.87, while on conditional CIFAR-10, an FID of 1.94 and IS of 10.03. On FFHQ 64x64, LowDiff achieves an FID of 2.43, and on ImageNet 256x256, LowDiff built on LightningDiT-B/1 produces high-quality samples with a FID of 4.00 and an IS of 195.06, together with substantial efficiency gains.

TLDR: The paper introduces LowDiff, a novel diffusion framework that leverages a cascaded approach with a unified model to efficiently generate high-resolution images from low-resolution conditions, achieving significant throughput improvements while maintaining comparable or superior quality.

TLDR: 该论文介绍了LowDiff,一种新颖的扩散框架,它利用级联方法和统一模型,从低分辨率条件高效生成高分辨率图像,在保持相当或更好质量的同时,实现了显著的吞吐量提升。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Jiuyi Xu, Qing Jin, Meida Chen, Andrew Feng, Yang Sui, Yangming Shi

Kuramoto Orientation Diffusion Models

Orientation-rich images, such as fingerprints and textures, often exhibit coherent angular directional patterns that are challenging to model using standard generative approaches based on isotropic Euclidean diffusion. Motivated by the role of phase synchronization in biological systems, we propose a score-based generative model built on periodic domains by leveraging stochastic Kuramoto dynamics in the diffusion process. In neural and physical systems, Kuramoto models capture synchronization phenomena across coupled oscillators -- a behavior that we re-purpose here as an inductive bias for structured image generation. In our framework, the forward process performs \textit{synchronization} among phase variables through globally or locally coupled oscillator interactions and attraction to a global reference phase, gradually collapsing the data into a low-entropy von Mises distribution. The reverse process then performs \textit{desynchronization}, generating diverse patterns by reversing the dynamics with a learned score function. This approach enables structured destruction during forward diffusion and a hierarchical generation process that progressively refines global coherence into fine-scale details. We implement wrapped Gaussian transition kernels and periodicity-aware networks to account for the circular geometry. Our method achieves competitive results on general image benchmarks and significantly improves generation quality on orientation-dense datasets like fingerprints and textures. Ultimately, this work demonstrates the promise of biologically inspired synchronization dynamics as structured priors in generative modeling.

TLDR: The paper introduces a novel score-based generative model, Kuramoto Orientation Diffusion Models, inspired by biological synchronization, to generate orientation-rich images such as fingerprints and textures more effectively than standard methods.

TLDR: 该论文介绍了一种新颖的基于分数的生成模型,Kuramoto Orientation Diffusion Models,灵感来自生物同步,旨在比标准方法更有效地生成方向丰富的图像,如指纹和纹理。

Relevance: (7/10)
Novelty: (9/10)
Clarity: (8/10)
Potential Impact: (7/10)
Overall: (7/10)
Read Paper (PDF)

Authors: Yue Song, T. Anderson Keller, Sevan Brodjian, Takeru Miyato, Yisong Yue, Pietro Perona, Max Welling