ArXiv CS.CV Papers (Image/Video Generation)

Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models

We introduce Equilibrium Matching (EqM), a generative modeling framework built from an equilibrium dynamics perspective. EqM discards the non-equilibrium, time-conditional dynamics in traditional diffusion and flow-based generative models and instead learns the equilibrium gradient of an implicit energy landscape. Through this approach, we can adopt an optimization-based sampling process at inference time, where samples are obtained by gradient descent on the learned landscape with adjustable step sizes, adaptive optimizers, and adaptive compute. EqM surpasses the generation performance of diffusion/flow models empirically, achieving an FID of 1.90 on ImageNet 256$\times$256. EqM is also theoretically justified to learn and sample from the data manifold. Beyond generation, EqM is a flexible framework that naturally handles tasks including partially noised image denoising, OOD detection, and image composition. By replacing time-conditional velocities with a unified equilibrium landscape, EqM offers a tighter bridge between flow and energy-based models and a simple route to optimization-driven inference.

TLDR: This paper introduces Equilibrium Matching (EqM), a generative modeling framework using an implicit energy landscape for optimization-based sampling, achieving state-of-the-art image generation performance and offering a unified approach to various image tasks.

TLDR: 这篇论文介绍了平衡匹配（EqM），一种使用隐性能量景观进行基于优化的采样的生成建模框架，实现了最先进的图像生成性能，并为各种图像任务提供了一种统一的方法。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Runqian Wang, Yilun Du

Pack and Force Your Memory: Long-form and Consistent Video Generation

Long-form video generation presents a dual challenge: models must capture long-range dependencies while preventing the error accumulation inherent in autoregressive decoding. To address these challenges, we make two contributions. First, for dynamic context modeling, we propose MemoryPack, a learnable context-retrieval mechanism that leverages both textual and image information as global guidance to jointly model short- and long-term dependencies, achieving minute-level temporal consistency. This design scales gracefully with video length, preserves computational efficiency, and maintains linear complexity. Second, to mitigate error accumulation, we introduce Direct Forcing, an efficient single-step approximating strategy that improves training-inference alignment and thereby curtails error propagation during inference. Together, MemoryPack and Direct Forcing substantially enhance the context consistency and reliability of long-form video generation, advancing the practical usability of autoregressive video models.

TLDR: This paper introduces MemoryPack and Direct Forcing to improve long-form video generation by enhancing temporal consistency and mitigating error accumulation, respectively.

TLDR: 该论文介绍了MemoryPack和Direct Forcing，分别用于提高长视频生成的时间一致性和减少误差累积，从而改善长视频生成质量。

Relevance: (10/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Xiaofei Wu, Guozhen Zhang, Zhiyong Xu, Yuan Zhou, Qinglin Lu, Xuming He

DisCo: Reinforcement with Diversity Constraints for Multi-Human Generation

State-of-the-art text-to-image models excel at realism but collapse on multi-human prompts - duplicating faces, merging identities, and miscounting individuals. We introduce DisCo (Reinforcement with Diversity Constraints), the first RL-based framework to directly optimize identity diversity in multi-human generation. DisCo fine-tunes flow-matching models via Group-Relative Policy Optimization (GRPO) with a compositional reward that (i) penalizes intra-image facial similarity, (ii) discourages cross-sample identity repetition, (iii) enforces accurate person counts, and (iv) preserves visual fidelity through human preference scores. A single-stage curriculum stabilizes training as complexity scales, requiring no extra annotations. On the DiverseHumans Testset, DisCo achieves 98.6 Unique Face Accuracy and near-perfect Global Identity Spread - surpassing both open-source and proprietary methods (e.g., Gemini, GPT-Image) while maintaining competitive perceptual quality. Our results establish DisCo as a scalable, annotation-free solution that resolves the long-standing identity crisis in generative models and sets a new benchmark for compositional multi-human generation.

TLDR: The paper introduces DisCo, a reinforcement learning framework with diversity constraints, to improve identity diversity in multi-human image generation, outperforming existing methods without requiring extra annotations.

TLDR: 该论文介绍了DisCo，一个具有多样性约束的强化学习框架，旨在提高多人图像生成中的身份多样性，无需额外注释即可胜过现有方法。

Relevance: (9/10)

Novelty: (9/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Shubhankar Borse, Farzad Farhadzadeh, Munawar Hayat, Fatih Porikli

NoiseShift: Resolution-Aware Noise Recalibration for Better Low-Resolution Image Generation

Text-to-image diffusion models trained on a fixed set of resolutions often fail to generalize, even when asked to generate images at lower resolutions than those seen during training. High-resolution text-to-image generators are currently unable to easily offer an out-of-the-box budget-efficient alternative to their users who might not need high-resolution images. We identify a key technical insight in diffusion models that when addressed can help tackle this limitation: Noise schedulers have unequal perceptual effects across resolutions. The same level of noise removes disproportionately more signal from lower-resolution images than from high-resolution images, leading to a train-test mismatch. We propose NoiseShift, a training-free method that recalibrates the noise level of the denoiser conditioned on resolution size. NoiseShift requires no changes to model architecture or sampling schedule and is compatible with existing models. When applied to Stable Diffusion 3, Stable Diffusion 3.5, and Flux-Dev, quality at low resolutions is significantly improved. On LAION-COCO, NoiseShift improves SD3.5 by 15.89%, SD3 by 8.56%, and Flux-Dev by 2.44% in FID on average. On CelebA, NoiseShift improves SD3.5 by 10.36%, SD3 by 5.19%, and Flux-Dev by 3.02% in FID on average. These results demonstrate the effectiveness of NoiseShift in mitigating resolution-dependent artifacts and enhancing the quality of low-resolution image generation.

TLDR: The paper presents NoiseShift, a training-free method to recalibrate noise levels in diffusion models conditioned on resolution, improving low-resolution image generation quality without model changes, and demonstrating significant FID improvements on Stable Diffusion models and Flux-Dev.

TLDR: 该论文提出了 NoiseShift，一种无需训练的方法，可根据分辨率重新校准扩散模型中的噪声水平，从而提高低分辨率图像的生成质量，无需更改模型，并展示了 Stable Diffusion 模型和 Flux-Dev 在 FID 方面的显着改进。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Ruozhen He, Moayed Haji-Ali, Ziyan Yang, Vicente Ordonez

Continual Personalization for Diffusion Models

Updating diffusion models in an incremental setting would be practical in real-world applications yet computationally challenging. We present a novel learning strategy of Concept Neuron Selection (CNS), a simple yet effective approach to perform personalization in a continual learning scheme. CNS uniquely identifies neurons in diffusion models that are closely related to the target concepts. In order to mitigate catastrophic forgetting problems while preserving zero-shot text-to-image generation ability, CNS finetunes concept neurons in an incremental manner and jointly preserves knowledge learned of previous concepts. Evaluation of real-world datasets demonstrates that CNS achieves state-of-the-art performance with minimal parameter adjustments, outperforming previous methods in both single and multi-concept personalization works. CNS also achieves fusion-free operation, reducing memory storage and processing time for continual personalization.

TLDR: The paper introduces Concept Neuron Selection (CNS), a novel continual learning approach for personalizing diffusion models by selectively fine-tuning concept-related neurons, mitigating catastrophic forgetting and preserving zero-shot generation capabilities with minimal parameter adjustments.

TLDR: 该论文介绍了一种名为概念神经元选择 (CNS) 的新型持续学习方法，通过选择性地微调与概念相关的神经元来个性化扩散模型，从而减少灾难性遗忘并保留零样本生成能力，且只需最少的参数调整。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yu-Chien Liao, Jr-Jen Chen, Chi-Pin Huang, Ci-Siang Lin, Meng-Lin Wu, Yu-Chiang Frank Wang

Test-Time Anchoring for Discrete Diffusion Posterior Sampling

We study the problem of posterior sampling using pretrained discrete diffusion foundation models, aiming to recover images from noisy measurements without retraining task-specific models. While diffusion models have achieved remarkable success in generative modeling, most advances rely on continuous Gaussian diffusion. In contrast, discrete diffusion offers a unified framework for jointly modeling categorical data such as text and images. Beyond unification, discrete diffusion provides faster inference, finer control, and principled training-free Bayesian inference, making it particularly well-suited for posterior sampling. However, existing approaches to discrete diffusion posterior sampling face severe challenges: derivative-free guidance yields sparse signals, continuous relaxations limit applicability, and split Gibbs samplers suffer from the curse of dimensionality. To overcome these limitations, we introduce Anchored Posterior Sampling (APS) for masked diffusion foundation models, built on two key innovations -- quantized expectation for gradient-like guidance in discrete embedding space, and anchored remasking for adaptive decoding. Our approach achieves state-of-the-art performance among discrete diffusion samplers across linear and nonlinear inverse problems on the standard benchmarks. We further demonstrate the benefits of our approach in training-free stylization and text-guided editing.

TLDR: The paper introduces Anchored Posterior Sampling (APS) for discrete diffusion models, a novel method for posterior sampling that addresses the limitations of existing techniques and achieves state-of-the-art performance in inverse problems and editing tasks.

TLDR: 该论文介绍了锚定后验采样 (APS)，这是一种用于离散扩散模型的新颖后验采样方法，解决了现有技术的局限性，并在反问题和编辑任务中实现了最先进的性能。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Litu Rout, Andreas Lugmayr, Yasamin Jafarian, Srivatsan Varadharajan, Constantine Caramanis, Sanjay Shakkottai, Ira Kemelmacher-Shlizerman

MultiModal Action Conditioned Video Generation

Current video models fail as world model as they lack fine-graiend control. General-purpose household robots require real-time fine motor control to handle delicate tasks and urgent situations. In this work, we introduce fine-grained multimodal actions to capture such precise control. We consider senses of proprioception, kinesthesia, force haptics, and muscle activation. Such multimodal senses naturally enables fine-grained interactions that are difficult to simulate with text-conditioned generative models. To effectively simulate fine-grained multisensory actions, we develop a feature learning paradigm that aligns these modalities while preserving the unique information each modality provides. We further propose a regularization scheme to enhance causality of the action trajectory features in representing intricate interaction dynamics. Experiments show that incorporating multimodal senses improves simulation accuracy and reduces temporal drift. Extensive ablation studies and downstream applications demonstrate the effectiveness and practicality of our work.

TLDR: The paper introduces a multimodal action-conditioned video generation model that incorporates proprioception, kinesthesia, force haptics, and muscle activation to achieve fine-grained control in simulated environments, demonstrating improved simulation accuracy and reduced temporal drift.

TLDR: 该论文介绍了一种多模态动作条件视频生成模型，该模型结合了本体感受、动觉、力触觉和肌肉激活，以在模拟环境中实现细粒度控制，从而提高了模拟精度并减少了时间漂移。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yichen Li, Antonio Torralba

Learning to Generate Object Interactions with Physics-Guided Video Diffusion

Recent models for video generation have achieved remarkable progress and are now deployed in film, social media production, and advertising. Beyond their creative potential, such models also hold promise as world simulators for robotics and embodied decision making. Despite strong advances, however, current approaches still struggle to generate physically plausible object interactions and lack physics-grounded control mechanisms. To address this limitation, we introduce KineMask, an approach for physics-guided video generation that enables realistic rigid body control, interactions, and effects. Given a single image and a specified object velocity, our method generates videos with inferred motions and future object interactions. We propose a two-stage training strategy that gradually removes future motion supervision via object masks. Using this strategy we train video diffusion models (VDMs) on synthetic scenes of simple interactions and demonstrate significant improvements of object interactions in real scenes. Furthermore, KineMask integrates low-level motion control with high-level textual conditioning via predictive scene descriptions, leading to effective support for synthesis of complex dynamical phenomena. Extensive experiments show that KineMask achieves strong improvements over recent models of comparable size. Ablation studies further highlight the complementary roles of low- and high-level conditioning in VDMs. Our code, model, and data will be made publicly available.

TLDR: The paper introduces KineMask, a physics-guided video diffusion approach for generating realistic object interactions with rigid body control, demonstrating improved performance on synthetic and real-world scenes.

TLDR: 该论文介绍了KineMask，一种基于物理引导的视频扩散方法，用于生成具有刚体控制的逼真物体交互，并在合成和真实场景中表现出改进的性能。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: David Romero, Ariana Bermudez, Hao Li, Fabio Pizzati, Ivan Laptev

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality. However, their reliance on transformer architectures incurs prohibitively high computational costs, particularly when extending generation to long videos. Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers. Nevertheless, given that teacher models cannot synthesize long videos, the extrapolation of student models beyond their training horizon often leads to pronounced quality degradation, arising from the compounding of errors within the continuous latent space. In this paper, we propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets. Our approach centers on exploiting the rich knowledge of teacher models to provide guidance for the student model through sampled segments drawn from self-generated long videos. Our method maintains temporal consistency while scaling video length by up to 20x beyond teacher's capability, avoiding common issues such as over-exposure and error-accumulation without recomputing overlapping frames like previous methods. When scaling up the computation, our method shows the capability of generating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of the maximum span supported by our base model's position embedding and more than 50x longer than that of our baseline model. Experiments on standard benchmarks and our proposed improved benchmark demonstrate that our approach substantially outperforms baseline methods in both fidelity and consistency. Our long-horizon videos demo can be found at https://self-forcing-plus-plus.github.io/

TLDR: This paper introduces Self-Forcing++, a method to improve the quality and temporal consistency of long-horizon video generation using diffusion models, achieving up to 4-minute videos without long-video teacher supervision or retraining.

TLDR: 本文介绍了 Self-Forcing++，一种用于改进长时程视频生成的质量和时间一致性的方法，它使用扩散模型，无需长视频教师监督或重新训练即可生成长达 4 分钟的视频。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, Cho-Jui Hsieh

DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing

Drag-based image editing has long suffered from distortions in the target region, largely because the priors of earlier base models, Stable Diffusion, are insufficient to project optimized latents back onto the natural image manifold. With the shift from UNet-based DDPMs to more scalable DiT with flow matching (e.g., SD3.5, FLUX), generative priors have become significantly stronger, enabling advances across diverse editing tasks. However, drag-based editing has yet to benefit from these stronger priors. This work proposes the first framework to effectively harness FLUX's rich prior for drag-based editing, dubbed DragFlow, achieving substantial gains over baselines. We first show that directly applying point-based drag editing to DiTs performs poorly: unlike the highly compressed features of UNets, DiT features are insufficiently structured to provide reliable guidance for point-wise motion supervision. To overcome this limitation, DragFlow introduces a region-based editing paradigm, where affine transformations enable richer and more consistent feature supervision. Additionally, we integrate pretrained open-domain personalization adapters (e.g., IP-Adapter) to enhance subject consistency, while preserving background fidelity through gradient mask-based hard constraints. Multimodal large language models (MLLMs) are further employed to resolve task ambiguities. For evaluation, we curate a novel Region-based Dragging benchmark (ReD Bench) featuring region-level dragging instructions. Extensive experiments on DragBench-DR and ReD Bench show that DragFlow surpasses both point-based and region-based baselines, setting a new state-of-the-art in drag-based image editing. Code and datasets will be publicly available upon publication.

TLDR: DragFlow is a new drag-based image editing framework leveraging the strong priors of DiT models like FLUX, using region-based supervision and integration of personalization adapters to achieve state-of-the-art results.

TLDR: DragFlow是一个新的基于拖拽的图像编辑框架，它利用了DiT模型（如FLUX）的强大先验知识，通过基于区域的监督和个性化适配器的集成，实现了最先进的结果。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Zihan Zhou, Shilin Lu, Shuli Leng, Shaocong Zhang, Zhuming Lian, Xinlei Yu, Adams Wai-Kin Kong

FreeViS: Training-free Video Stylization with Inconsistent References

Video stylization plays a key role in content creation, but it remains a challenging problem. Na\"ively applying image stylization frame-by-frame hurts temporal consistency and reduces style richness. Alternatively, training a dedicated video stylization model typically requires paired video data and is computationally expensive. In this paper, we propose FreeViS, a training-free video stylization framework that generates stylized videos with rich style details and strong temporal coherence. Our method integrates multiple stylized references to a pretrained image-to-video (I2V) model, effectively mitigating the propagation errors observed in prior works, without introducing flickers and stutters. In addition, it leverages high-frequency compensation to constrain the content layout and motion, together with flow-based motion cues to preserve style textures in low-saliency regions. Through extensive evaluations, FreeViS delivers higher stylization fidelity and superior temporal consistency, outperforming recent baselines and achieving strong human preference. Our training-free pipeline offers a practical and economic solution for high-quality, temporally coherent video stylization. The code and videos can be accessed via https://xujiacong.github.io/FreeViS/

TLDR: The paper introduces FreeViS, a training-free video stylization framework that leverages multiple stylized references and a pretrained image-to-video model to achieve temporally coherent and stylistically rich video stylization.

TLDR: 该论文介绍了一种名为FreeViS的免训练视频风格化框架，该框架利用多个风格化参考和预训练的图像到视频模型，以实现时间连贯且风格丰富的视频风格化。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Jiacong Xu, Yiqun Mei, Ke Zhang, Vishal M. Patel

Growing Visual Generative Capacity for Pre-Trained MLLMs

Multimodal large language models (MLLMs) extend the success of language models to visual understanding, and recent efforts have sought to build unified MLLMs that support both understanding and generation. However, constructing such models remains challenging: hybrid approaches combine continuous embeddings with diffusion or flow-based objectives, producing high-quality images but breaking the autoregressive paradigm, while pure autoregressive approaches unify text and image prediction over discrete visual tokens but often face trade-offs between semantic alignment and pixel-level fidelity. In this work, we present Bridge, a pure autoregressive unified MLLM that augments pre-trained visual understanding models with generative ability through a Mixture-of-Transformers architecture, enabling both image understanding and generation within a single next-token prediction framework. To further improve visual generation fidelity, we propose a semantic-to-pixel discrete representation that integrates compact semantic tokens with fine-grained pixel tokens, achieving strong language alignment and precise description of visual details with only a 7.9% increase in sequence length. Extensive experiments across diverse multimodal benchmarks demonstrate that Bridge achieves competitive or superior results in both understanding and generation benchmarks, while requiring less training data and reduced training time compared to prior unified MLLMs.

TLDR: The paper introduces Bridge, a unified MLLM using a Mixture-of-Transformers architecture for both image understanding and generation, achieving competitive performance with less training data.

TLDR: 该论文介绍了Bridge，一种统一的MLLM，使用混合Transformer架构进行图像理解和生成，并在更少训练数据的情况下实现了有竞争力的性能。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Hanyu Wang, Jiaming Han, Ziyan Yang, Qi Zhao, Shanchuan Lin, Xiangyu Yue, Abhinav Shrivastava, Zhenheng Yang, Hao Chen

Towards Better Optimization For Listwise Preference in Diffusion Models

Reinforcement learning from human feedback (RLHF) has proven effectiveness for aligning text-to-image (T2I) diffusion models with human preferences. Although Direct Preference Optimization (DPO) is widely adopted for its computational efficiency and avoidance of explicit reward modeling, its applications to diffusion models have primarily relied on pairwise preferences. The precise optimization of listwise preferences remains largely unaddressed. In practice, human feedback on image preferences often contains implicit ranked information, which conveys more precise human preferences than pairwise comparisons. In this work, we propose Diffusion-LPO, a simple and effective framework for Listwise Preference Optimization in diffusion models with listwise data. Given a caption, we aggregate user feedback into a ranked list of images and derive a listwise extension of the DPO objective under the Plackett-Luce model. Diffusion-LPO enforces consistency across the entire ranking by encouraging each sample to be preferred over all of its lower-ranked alternatives. We empirically demonstrate the effectiveness of Diffusion-LPO across various tasks, including text-to-image generation, image editing, and personalized preference alignment. Diffusion-LPO consistently outperforms pairwise DPO baselines on visual quality and preference alignment.

TLDR: The paper introduces Diffusion-LPO, a method for optimizing diffusion models using listwise preferences derived from ranked image feedback, outperforming pairwise DPO baselines in text-to-image generation, image editing and personalized preference alignment.

TLDR: 该论文介绍了Diffusion-LPO，一种使用来自排序图像反馈的列表式偏好优化扩散模型的方法，在文本到图像生成、图像编辑和个性化偏好对齐方面优于成对DPO基线。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Jiamu Bai, Xin Yu, Meilong Xu, Weitao Lu, Xin Pan, Kiwan Maeng, Daniel Kifer, Jian Wang, Yu Wang

Purrception: Variational Flow Matching for Vector-Quantized Image Generation

We introduce Purrception, a variational flow matching approach for vector-quantized image generation that provides explicit categorical supervision while maintaining continuous transport dynamics. Our method adapts Variational Flow Matching to vector-quantized latents by learning categorical posteriors over codebook indices while computing velocity fields in the continuous embedding space. This combines the geometric awareness of continuous methods with the discrete supervision of categorical approaches, enabling uncertainty quantification over plausible codes and temperature-controlled generation. We evaluate Purrception on ImageNet-1k 256x256 generation. Training converges faster than both continuous flow matching and discrete flow matching baselines while achieving competitive FID scores with state-of-the-art models. This demonstrates that Variational Flow Matching can effectively bridge continuous transport and discrete supervision for improved training efficiency in image generation.

TLDR: Purrception combines variational flow matching with vector-quantized image generation, achieving faster convergence and competitive FID scores on ImageNet-1k. It bridges continuous transport and discrete supervision for efficient image generation.

TLDR: Purrception将变分流匹配与向量量化图像生成相结合，在ImageNet-1k上实现了更快的收敛速度和具有竞争力的FID分数。它弥合了连续传输和离散监督之间的差距，从而提高了图像生成的效率。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Răzvan-Andrei Matişan, Vincent Tao Hu, Grigory Bartosh, Björn Ommer, Cees G. M. Snoek, Max Welling, Jan-Willem van de Meent, Mohammad Mahdi Derakhshani, Floor Eijkelboom

Image Generation Based on Image Style Extraction

Image generation based on text-to-image generation models is a task with practical application scenarios that fine-grained styles cannot be precisely described and controlled in natural language, while the guidance information of stylized reference images is difficult to be directly aligned with the textual conditions of traditional textual guidance generation. This study focuses on how to maximize the generative capability of the pretrained generative model, by obtaining fine-grained stylistic representations from a single given stylistic reference image, and injecting the stylistic representations into the generative body without changing the structural framework of the downstream generative model, so as to achieve fine-grained controlled stylized image generation. In this study, we propose a three-stage training style extraction-based image generation method, which uses a style encoder and a style projection layer to align the style representations with the textual representations to realize fine-grained textual cue-based style guide generation. In addition, this study constructs the Style30k-captions dataset, whose samples contain a triad of images, style labels, and text descriptions, to train the style encoder and style projection layer in this experiment.

TLDR: This paper proposes a three-stage training method for fine-grained stylized image generation using a style encoder and projection layer, trained on a newly constructed dataset (Style30k-captions) to align style representations with textual cues.

TLDR: 本文提出了一种三阶段训练方法，用于生成精细风格化图像。该方法使用风格编码器和投影层，并在新建的数据集（Style30k-captions）上进行训练，以将风格表示与文本线索对齐。

Relevance: (9/10)

Novelty: (7/10)

Clarity: (8/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Shuochen Chang

4DGS-Craft: Consistent and Interactive 4D Gaussian Splatting Editing

Recent advances in 4D Gaussian Splatting (4DGS) editing still face challenges with view, temporal, and non-editing region consistency, as well as with handling complex text instructions. To address these issues, we propose 4DGS-Craft, a consistent and interactive 4DGS editing framework. We first introduce a 4D-aware InstructPix2Pix model to ensure both view and temporal consistency. This model incorporates 4D VGGT geometry features extracted from the initial scene, enabling it to capture underlying 4D geometric structures during editing. We further enhance this model with a multi-view grid module that enforces consistency by iteratively refining multi-view input images while jointly optimizing the underlying 4D scene. Furthermore, we preserve the consistency of non-edited regions through a novel Gaussian selection mechanism, which identifies and optimizes only the Gaussians within the edited regions. Beyond consistency, facilitating user interaction is also crucial for effective 4DGS editing. Therefore, we design an LLM-based module for user intent understanding. This module employs a user instruction template to define atomic editing operations and leverages an LLM for reasoning. As a result, our framework can interpret user intent and decompose complex instructions into a logical sequence of atomic operations, enabling it to handle intricate user commands and further enhance editing performance. Compared to related works, our approach enables more consistent and controllable 4D scene editing. Our code will be made available upon acceptance.

TLDR: This paper introduces 4DGS-Craft, a framework for consistent and interactive editing of 4D Gaussian Splatting scenes using a 4D-aware InstructPix2Pix model, a multi-view grid module, and an LLM-based user intent understanding module.

TLDR: 本文介绍了 4DGS-Craft，一个用于一致且可交互式编辑 4D 高斯溅射场景的框架，它使用了一个 4D 感知的 InstructPix2Pix 模型、一个多视图网格模块和一个基于 LLM 的用户意图理解模块。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Lei Liu, Can Wang, Zhenghao Chen, Dong Xu

$\text{G}^2$RPO: Granular GRPO for Precise Reward in Flow Models

The integration of online reinforcement learning (RL) into diffusion and flow models has recently emerged as a promising approach for aligning generative models with human preferences. Stochastic sampling via Stochastic Differential Equations (SDE) is employed during the denoising process to generate diverse denoising directions for RL exploration. While existing methods effectively explore potential high-value samples, they suffer from sub-optimal preference alignment due to sparse and narrow reward signals. To address these challenges, we propose a novel Granular-GRPO ($\text{G}^2$RPO ) framework that achieves precise and comprehensive reward assessments of sampling directions in reinforcement learning of flow models. Specifically, a Singular Stochastic Sampling strategy is introduced to support step-wise stochastic exploration while enforcing a high correlation between the reward and the injected noise, thereby facilitating a faithful reward for each SDE perturbation. Concurrently, to eliminate the bias inherent in fixed-granularity denoising, we introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales, producing a more comprehensive and robust evaluation of the sampling directions. Experiments conducted on various reward models, including both in-domain and out-of-domain evaluations, demonstrate that our $\text{G}^2$RPO significantly outperforms existing flow-based GRPO baselines,highlighting its effectiveness and robustness.

TLDR: The paper introduces $\text{G}^2$RPO, a novel framework that improves reward assessment in reinforcement learning for flow models by using a Singular Stochastic Sampling strategy and a Multi-Granularity Advantage Integration module. Experiments show it outperforms existing methods.

TLDR: 该论文介绍了$\text{G}^2$RPO，一种新颖的框架，通过使用单数随机抽样策略和多粒度优势集成模块，改进了流模型强化学习中的奖励评估。实验表明，该方法优于现有方法。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Yujie Zhou, Pengyang Ling, Jiazi Bu, Yibin Wang, Yuhang Zang, Jiaqi Wang, Li Niu, Guangtao Zhai

Unsupervised Dynamic Feature Selection for Robust Latent Spaces in Vision Tasks

Latent representations are critical for the performance and robustness of machine learning models, as they encode the essential features of data in a compact and informative manner. However, in vision tasks, these representations are often affected by noisy or irrelevant features, which can degrade the model's performance and generalization capabilities. This paper presents a novel approach for enhancing latent representations using unsupervised Dynamic Feature Selection (DFS). For each instance, the proposed method identifies and removes misleading or redundant information in images, ensuring that only the most relevant features contribute to the latent space. By leveraging an unsupervised framework, our approach avoids reliance on labeled data, making it broadly applicable across various domains and datasets. Experiments conducted on image datasets demonstrate that models equipped with unsupervised DFS achieve significant improvements in generalization performance across various tasks, including clustering and image generation, while incurring a minimal increase in the computational cost.

TLDR: This paper introduces an unsupervised Dynamic Feature Selection (DFS) method to improve the robustness and generalization of latent representations in vision tasks by removing noisy or irrelevant features. Experiments show improvements in clustering and image generation.

TLDR: 本文提出了一种无监督的动态特征选择（DFS）方法，通过去除噪声或不相关的特征，来提高视觉任务中潜在表示的鲁棒性和泛化能力。实验表明，该方法在聚类和图像生成方面有所改进。

Relevance: (7/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Bruno Corcuera, Carlos Eiras-Franco, Brais Cancela

AIGC Daily Papers

Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models

Pack and Force Your Memory: Long-form and Consistent Video Generation

DisCo: Reinforcement with Diversity Constraints for Multi-Human Generation

NoiseShift: Resolution-Aware Noise Recalibration for Better Low-Resolution Image Generation

Continual Personalization for Diffusion Models

Test-Time Anchoring for Discrete Diffusion Posterior Sampling

MultiModal Action Conditioned Video Generation

Learning to Generate Object Interactions with Physics-Guided Video Diffusion

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing

FreeViS: Training-free Video Stylization with Inconsistent References

Growing Visual Generative Capacity for Pre-Trained MLLMs

Towards Better Optimization For Listwise Preference in Diffusion Models

Purrception: Variational Flow Matching for Vector-Quantized Image Generation

Image Generation Based on Image Style Extraction

4DGS-Craft: Consistent and Interactive 4D Gaussian Splatting Editing

$\text{G}^2$RPO: Granular GRPO for Precise Reward in Flow Models

Unsupervised Dynamic Feature Selection for Robust Latent Spaces in Vision Tasks