ArXiv CS.CV Papers (Image/Video Generation)

SPATIALALIGN: Aligning Dynamic Spatial Relationships in Video Generation

Most text-to-video (T2V) generators prioritize aesthetic quality, but often ignoring the spatial constraints in the generated videos. In this work, we present SPATIALALIGN, a self-improvement framework that enhances T2V models capabilities to depict Dynamic Spatial Relationships (DSR) specified in text prompts. We present a zeroth-order regularized Direct Preference Optimization (DPO) to fine-tune T2V models towards better alignment with DSR. Specifically, we design DSR-SCORE, a geometry-based metric that quantitatively measures the alignment between generated videos and the specified DSRs in prompts, which is a step forward from prior works that rely on VLM for evaluation. We also conduct a dataset of text-video pairs with diverse DSRs to facilitate the study. Extensive experiments demonstrate that our fine-tuned model significantly out performs the baseline in spatial relationships. The code will be released in Link.

TLDR: SPATIALALIGN is a self-improvement framework using DPO and a geometry-based metric (DSR-SCORE) to fine-tune text-to-video models for better alignment with dynamic spatial relationships specified in text prompts, also introducing a new dataset for this task.

TLDR: SPATIALALIGN是一个自改进框架，它使用DPO和基于几何的度量（DSR-SCORE）来微调文本到视频的模型，从而更好地与文本提示中指定的动态空间关系对齐。该研究还提出了一个新的数据集用于此任务。

Relevance: (10/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Fengming Liu, Tat-Jen Cham, Chuanxia Zheng

Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling

Generative models trained on sensitive image datasets risk memorizing and reproducing individual training examples, making strong privacy guarantees essential. While differential privacy (DP) provides a principled framework for such guarantees, standard DP finetuning (e.g., with DP-SGD) often results in severe degradation of image quality, particularly in high-frequency textures, due to the indiscriminate addition of noise across all model parameters. In this work, we propose a spectral DP framework based on the hypothesis that the most privacy-sensitive portions of an image are often low-frequency components in the wavelet space (e.g., facial features and object shapes) while high-frequency components are largely generic and public. Based on this hypothesis, we propose the following two-stage framework for DP image generation with coarse image intermediaries: (1) DP finetune an autoregressive spectral image tokenizer model on the low-resolution wavelet coefficients of the sensitive images, and (2) perform high-resolution upsampling using a publicly pretrained super-resolution model. By restricting the privacy budget to the global structures of the image in the first stage, and leveraging the post-processing property of DP for detail refinement, we achieve promising trade-offs between privacy and utility. Experiments on the MS-COCO and MM-CelebA-HQ datasets show that our method generates images with improved quality and style capture relative to other leading DP image frameworks.

TLDR: This paper introduces a two-stage differentially private image generation framework that leverages wavelet decomposition, applying the privacy budget primarily to low-frequency components and using a public super-resolution model for upsampling, achieving improved privacy-utility trade-offs.

TLDR: 该论文介绍了一种两阶段差分隐私图像生成框架，该框架利用小波分解，主要将隐私预算应用于低频分量，并使用公共超分辨率模型进行上采样，从而实现改进的隐私-效用折衷。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Jasmine Bayrooti, Weiwei Kong, Natalia Ponomareva, Carlos Esteves, Ameesh Makadia, Amanda Prorok

ColoDiff: Integrating Dynamic Consistency With Content Awareness for Colonoscopy Video Generation

Colonoscopy video generation delivers dynamic, information-rich data critical for diagnosing intestinal diseases, particularly in data-scarce scenarios. High-quality video generation demands temporal consistency and precise control over clinical attributes, but faces challenges from irregular intestinal structures, diverse disease representations, and various imaging modalities. To this end, we propose ColoDiff, a diffusion-based framework that generates dynamic-consistent and content-aware colonoscopy videos, aiming to alleviate data shortage and assist clinical analysis. At the inter-frame level, our TimeStream module decouples temporal dependency from video sequences through a cross-frame tokenization mechanism, enabling intricate dynamic modeling despite irregular intestinal structures. At the intra-frame level, our Content-Aware module incorporates noise-injected embeddings and learnable prototypes to realize precise control over clinical attributes, breaking through the coarse guidance of diffusion models. Additionally, ColoDiff employs a non-Markovian sampling strategy that cuts steps by over 90% for real-time generation. ColoDiff is evaluated across three public datasets and one hospital database, based on both generation metrics and downstream tasks including disease diagnosis, modality discrimination, bowel preparation scoring, and lesion segmentation. Extensive experiments show ColoDiff generates videos with smooth transitions and rich dynamics. ColoDiff presents an effort in controllable colonoscopy video generation, revealing the potential of synthetic videos in complementing authentic representation and mitigating data scarcity in clinical settings.

TLDR: ColoDiff is a diffusion-based framework for generating dynamic and content-aware colonoscopy videos, addressing data scarcity and assisting clinical analysis with a novel TimeStream and Content-Aware module, plus a non-Markovian sampling strategy for speed.

TLDR: ColoDiff是一个基于扩散模型的框架，用于生成动态且内容感知的结肠镜检查视频，通过新颖的TimeStream和Content-Aware模块以及用于加速的非马尔可夫采样策略，从而解决数据稀缺问题并辅助临床分析。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Junhu Fu, Shuyu Liang, Wutong Li, Chen Ma, Peng Huang, Kehao Wang, Ke Chen, Shengli Lin, Pinghong Zhou, Zeju Li, Yuanyuan Wang, Yi Guo

UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models

World models based on video generation demonstrate remarkable potential for simulating interactive environments but face persistent difficulties in two key areas: maintaining long-term content consistency when scenes are revisited and enabling precise camera control from user-provided inputs. Existing methods based on explicit 3D reconstruction often compromise flexibility in unbounded scenarios and fine-grained structures. Alternative methods rely directly on previously generated frames without establishing explicit spatial correspondence, thereby constraining controllability and consistency. To address these limitations, we present UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism. To reduce computational overhead, we design an efficient dual-stream diffusion transformer for high-fidelity generation. Moreover, we introduce a scalable data curation strategy utilizing point-cloud-based rendering to simulate scene revisiting, facilitating training on over 500K monocular videos. Extensive experiments on real-world and synthetic benchmarks demonstrate that UCM significantly outperforms state-of-the-art methods in long-term scene consistency, while also achieving precise camera controllability in high-fidelity video generation.

TLDR: The paper introduces UCM, a new framework for world models that uses time-aware positional encoding warping to unify long-term memory and camera control in video generation, demonstrating improved consistency and controllability.

TLDR: 该论文介绍了 UCM，一种新的世界模型框架，它使用时间感知的位置编码扭曲来统一视频生成中的长期记忆和相机控制，展示了改进的一致性和可控性。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Tianxing Xu, Zixuan Wang, Guangyuan Wang, Li Hu, Zhongyi Zhang, Peng Zhang, Bang Zhang, Song-Hai Zhang

ToProVAR: Efficient Visual Autoregressive Modeling via Tri-Dimensional Entropy-Aware Semantic Analysis and Sparsity Optimization

Visual Autoregressive(VAR) models enhance generation quality but face a critical efficiency bottleneck in later stages. In this paper, we present a novel optimization framework for VAR models that fundamentally differs from prior approaches such as FastVAR and SkipVAR. Instead of relying on heuristic skipping strategies, our method leverages attention entropy to characterize the semantic projections across different dimensions of the model architecture. This enables precise identification of parameter dynamics under varying token granularity levels, semantic scopes, and generation scales. Building on this analysis, we further uncover sparsity patterns along three critical dimensions-token, layer, and scale-and propose a set of fine-grained optimization strategies tailored to these patterns. Extensive evaluation demonstrates that our approach achieves aggressive acceleration of the generation process while significantly preserving semantic fidelity and fine details, outperforming traditional methods in both efficiency and quality. Experiments on Infinity-2B and Infinity-8B models demonstrate that ToProVAR achieves up to 3.4x acceleration with minimal quality loss, effectively mitigating the issues found in prior work. Our code will be made publicly available.

TLDR: ToProVAR introduces a novel entropy-aware optimization framework for Visual Autoregressive models, achieving significant acceleration in generation speed with minimal quality loss by exploiting sparsity patterns across token, layer, and scale dimensions.

TLDR: ToProVAR 提出了一种新颖的、基于熵感知的视觉自回归模型优化框架。通过挖掘token、layer和scale三个维度上的稀疏性模式，该方法在很大程度上加速了生成过程，同时保持了最小的质量损失。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Jiayu Chen, Ruoyu Lin, Zihao Zheng, Jingxin Li, Maoliang Li, Guojie Luo, Xiang chen

PhotoAgent: Agentic Photo Editing with Exploratory Visual Aesthetic Planning

With the recent fast development of generative models, instruction-based image editing has shown great potential in generating high-quality images. However, the quality of editing highly depends on carefully designed instructions, placing the burden of task decomposition and sequencing entirely on the user. To achieve autonomous image editing, we present PhotoAgent, a system that advances image editing through explicit aesthetic planning. Specifically, PhotoAgent formulates autonomous image editing as a long-horizon decision-making problem. It reasons over user aesthetic intent, plans multi-step editing actions via tree search, and iteratively refines results through closed-loop execution with memory and visual feedback, without requiring step-by-step user prompts. To support reliable evaluation in real-world scenarios, we introduce UGC-Edit, an aesthetic evaluation benchmark consisting of 7,000 photos and a learned aesthetic reward model. We also construct a test set containing 1,017 photos to systematically assess autonomous photo editing performance. Extensive experiments demonstrate that PhotoAgent consistently improves both instruction adherence and visual quality compared with baseline methods. The project page is https://github.com/mdyao/PhotoAgent.

TLDR: PhotoAgent introduces an agentic system for autonomous image editing with explicit aesthetic planning, using tree search and a learned aesthetic reward model, and evaluated on a new aesthetic benchmark.

TLDR: PhotoAgent 提出了一个具有显式美学规划的自主图像编辑代理系统，该系统使用树搜索和学习的美学奖励模型，并在一个新的美学基准上进行了评估。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Mingde Yao, Zhiyuan You, Tam-King Man, Menglu Wang, Tianfan Xue

Instruction-based Image Editing with Planning, Reasoning, and Generation

Editing images via instruction provides a natural way to generate interactive content, but it is a big challenge due to the higher requirement of scene understanding and generation. Prior work utilizes a chain of large language models, object segmentation models, and editing models for this task. However, the understanding models provide only a single modality ability, restricting the editing quality. We aim to bridge understanding and generation via a new multi-modality model that provides the intelligent abilities to instruction-based image editing models for more complex cases. To achieve this goal, we individually separate the instruction editing task with the multi-modality chain of thought prompts, i.e., Chain-of-Thought (CoT) planning, editing region reasoning, and editing. For Chain-of-Thought planning, the large language model could reason the appropriate sub-prompts considering the instruction provided and the ability of the editing network. For editing region reasoning, we train an instruction-based editing region generation network with a multi-modal large language model. Finally, a hint-guided instruction-based editing network is proposed for editing image generations based on the sizeable text-to-image diffusion model to accept the hints for generation. Extensive experiments demonstrate that our method has competitive editing abilities on complex real-world images.

TLDR: This paper introduces a new instruction-based image editing method that leverages a multi-modality chain-of-thought approach for enhanced scene understanding and generation, leading to improved editing quality in complex real-world images.

TLDR: 本文提出了一种新的基于指令的图像编辑方法，该方法利用多模态思维链方法来增强场景理解和生成，从而提高复杂现实世界图像的编辑质量。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Liya Ji, Chenyang Qi, Qifeng Chen

Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation

Classifier-free guidance (CFG) has helped diffusion models achieve great conditional generation in various fields. Recently, more diffusion guidance methods have emerged with improved generation quality and human preference. However, can these emerging diffusion guidance methods really achieve solid and significant improvements? In this paper, we rethink recent progress on diffusion guidance. Our work mainly consists of four contributions. First, we reveal a critical evaluation pitfall that common human preference models exhibit a strong bias towards large guidance scales. Simply increasing the CFG scale can easily improve quantitative evaluation scores due to strong semantic alignment, even if image quality is severely damaged (e.g., oversaturation and artifacts). Second, we introduce a novel guidance-aware evaluation (GA-Eval) framework that employs effective guidance scale calibration to enable fair comparison between current guidance methods and CFG by identifying the effects orthogonal and parallel to CFG effects. Third, motivated by the evaluation pitfall, we design Transcendent Diffusion Guidance (TDG) method that can significantly improve human preference scores in the conventional evaluation framework but actually does not work in practice. Fourth, in extensive experiments, we empirically evaluate recent eight diffusion guidance methods within the conventional evaluation framework and the proposed GA-Eval framework. Notably, simply increasing the CFG scales can compete with most studied diffusion guidance methods, while all methods suffer severely from winning rate degradation over standard CFG. Our work would strongly motivate the community to rethink the evaluation paradigm and future directions of this field.

TLDR: This paper identifies a critical evaluation bias in text-to-image generation related to classifier-free guidance (CFG) scale, proposes a guidance-aware evaluation framework, and demonstrates that simply increasing CFG scale can outperform more complex methods. The work underscores the necessity of rethinking evaluation paradigms in the field.

TLDR: 该论文指出文本到图像生成中与分类器无关引导（CFG）尺度有关的一个关键评估偏差，提出了一种引导感知评估框架，并表明简单地增加 CFG 尺度可以胜过更复杂的方法。这项工作强调了重新思考该领域评估范式的必要性。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Dian Xie, Shitong Shao, Lichen Bai, Zikai Zhou, Bojun Cheng, Shuo Yang, Jun Wu, Zeke Xie

DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation

Synthesis of diverse driving scenes serves as a crucial data augmentation technique for validating the robustness and generalizability of autonomous driving systems. Current methods aggregate high-definition (HD) maps and 3D bounding boxes as geometric conditions in diffusion models for conditional scene generation. However, implicit inter-condition dependency causes generation failures when control conditions change independently. Additionally, these methods suffer from insufficient details in both semantic and structural aspects. Specifically, brief and view-invariant captions restrict semantic contexts, resulting in weak background modeling. Meanwhile, the standard denoising loss with uniform spatial weighting neglects foreground structural details, causing visual distortions and blurriness. To address these challenges, we propose DrivePTS, which incorporates three key innovations. Firstly, our framework adopts a progressive learning strategy to mitigate inter-dependency between geometric conditions, reinforced by an explicit mutual information constraint. Secondly, a Vision-Language Model is utilized to generate multi-view hierarchical descriptions across six semantic aspects, providing fine-grained textual guidance. Thirdly, a frequency-guided structure loss is introduced to strengthen the model's sensitivity to high-frequency elements, improving foreground structural fidelity. Extensive experiments demonstrate that our DrivePTS achieves state-of-the-art fidelity and controllability in generating diverse driving scenes. Notably, DrivePTS successfully generates rare scenes where prior methods fail, highlighting its strong generalization ability.

TLDR: DrivePTS introduces a progressive learning framework with enhanced textual and structural guidance for generating diverse and high-fidelity driving scenes, addressing limitations of existing methods in controllability and detail.

TLDR: DrivePTS 提出了一种渐进式学习框架，通过增强文本和结构指导来生成多样且高保真的驾驶场景，解决了现有方法在可控性和细节方面的局限性。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Zhechao Wang, Yiming Zeng, Lufan Ma, Zeqing Fu, Chen Bai, Ziyao Lin, Cheng Lu

Solaris: Building a Multiplayer Video World Model in Minecraft

Existing action-conditioned video generation models (video world models) are limited to single-agent perspectives, failing to capture the multi-agent interactions of real-world environments. We introduce Solaris, a multiplayer video world model that simulates consistent multi-view observations. To enable this, we develop a multiplayer data system designed for robust, continuous, and automated data collection on video games such as Minecraft. Unlike prior platforms built for single-player settings, our system supports coordinated multi-agent interaction and synchronized videos + actions capture. Using this system, we collect 12.64 million multiplayer frames and propose an evaluation framework for multiplayer movement, memory, grounding, building, and view consistency. We train Solaris using a staged pipeline that progressively transitions from single-player to multiplayer modeling, combining bidirectional, causal, and Self Forcing training. In the final stage, we introduce Checkpointed Self Forcing, a memory-efficient Self Forcing variant that enables a longer-horizon teacher. Results show our architecture and training design outperform existing baselines. Through open-sourcing our system and models, we hope to lay the groundwork for a new generation of multi-agent world models.

TLDR: The paper introduces Solaris, a multiplayer video world model trained on a new Minecraft dataset with 12.64 million frames, enabling consistent multi-view simulations and outperforming existing baselines with a novel Checkpointed Self Forcing technique.

TLDR: 该论文介绍了Solaris，一个在新的Minecraft数据集上训练的多人视频世界模型，该数据集包含1264万帧，它能够实现一致的多视角模拟，并利用一种新颖的Checkpointed Self Forcing技术超越了现有基线。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, Saining Xie

Multidimensional Task Learning: A Unified Tensor Framework for Computer Vision Tasks

This paper introduces Multidimensional Task Learning (MTL), a unified mathematical framework based on Generalized Einstein MLPs (GE-MLPs) that operate directly on tensors via the Einstein product. We argue that current computer vision task formulations are inherently constrained by matrix-based thinking: standard architectures rely on matrix-valued weights and vectorvalued biases, requiring structural flattening that restricts the space of naturally expressible tasks. GE-MLPs lift this constraint by operating with tensor-valued parameters, enabling explicit control over which dimensions are preserved or contracted without information loss. Through rigorous mathematical derivations, we demonstrate that classification, segmentation, and detection are special cases of MTL, differing only in their dimensional configuration within a formally defined task space. We further prove that this task space is strictly larger than what matrix-based formulations can natively express, enabling principled task configurations such as spatiotemporal or cross modal predictions that require destructive flattening under conventional approaches. This work provides a mathematical foundation for understanding, comparing, and designing computer vision tasks through the lens of tensor algebra.

TLDR: The paper introduces Multidimensional Task Learning (MTL), a tensor-based framework for computer vision that unifies classification, segmentation, and detection, claiming to overcome limitations of matrix-based architectures and enabling novel task configurations.

TLDR: 该论文介绍了多维任务学习（MTL），一种基于张量的计算机视觉框架，统一了分类、分割和检测，旨在克服基于矩阵的架构的局限性，并实现新的任务配置。

Relevance: (6/10)

Novelty: (8/10)

Clarity: (7/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Alaa El Ichi, Khalide Jbilou

OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis

Fingerspelling is a component of sign languages in which words are spelled out letter by letter using specific hand poses. Automatic fingerspelling recognition plays a crucial role in bridging the communication gap between Deaf and hearing communities, yet it remains challenging due to the signing-hand ambiguity issue, the lack of appropriate training losses, and the out-of-vocabulary (OOV) problem. Prior fingerspelling recognition methods rely on explicit signing-hand detection, which often leads to recognition failures, and on a connectionist temporal classification (CTC) loss, which exhibits the peaky behavior problem. To address these issues, we develop OpenFS, an open-source approach for fingerspelling recognition and synthesis. We propose a multi-hand-capable fingerspelling recognizer that supports both single- and multi-hand inputs and performs implicit signing-hand detection by incorporating a dual-level positional encoding and a signing-hand focus (SF) loss. The SF loss encourages cross-attention to focus on the signing hand, enabling implicit signing-hand detection during recognition. Furthermore, without relying on the CTC loss, we introduce a monotonic alignment (MA) loss that enforces the output letter sequence to follow the temporal order of the input pose sequence through cross-attention regularization. In addition, we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words. This generator enables the construction of a new synthetic benchmark, called FSNeo. Through comprehensive experiments, we demonstrate that our approach achieves state-of-the-art performance in recognition and validate the effectiveness of the proposed recognizer and generator. Codes and data are available in: https://github.com/JunukCha/OpenFS.

TLDR: The paper introduces OpenFS, a novel open-source system for fingerspelling recognition and synthesis that addresses issues like signing-hand ambiguity and the out-of-vocabulary problem, using a multi-hand-capable recognizer with implicit hand detection and a frame-wise letter-conditioned generator. It achieves state-of-the-art performance.

TLDR: 本文介绍OpenFS，一种新颖的开源手指拼写识别与合成系统，旨在解决诸如手势模糊和词汇外问题等挑战。该系统采用多手势识别器进行隐式手势检测和一个逐帧字母条件生成器，并实现了最先进的性能。

Relevance: (6/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Junuk Cha, Jihyeon Kim, Han-Mu Park

SceneTransporter: Optimal Transport-Guided Compositional Latent Diffusion for Single-Image Structured 3D Scene Generation

We introduce SceneTransporter, an end-to-end framework for structured 3D scene generation from a single image. While existing methods generate part-level 3D objects, they often fail to organize these parts into distinct instances in open-world scenes. Through a debiased clustering probe, we reveal a critical insight: this failure stems from the lack of structural constraints within the model's internal assignment mechanism. Based on this finding, we reframe the task of structured 3D scene generation as a global correlation assignment problem. To solve this, SceneTransporter formulates and solves an entropic Optimal Transport (OT) objective within the denoising loop of the compositional DiT model. This formulation imposes two powerful structural constraints. First, the resulting transport plan gates cross-attention to enforce an exclusive, one-to-one routing of image patches to part-level 3D latents, preventing entanglement. Second, the competitive nature of the transport encourages the grouping of similar patches, a process that is further regularized by an edge-based cost, to form coherent objects and prevent fragmentation. Extensive experiments show that SceneTransporter outperforms existing methods on open-world scene generation, significantly improving instance-level coherence and geometric fidelity. Code and models will be publicly available at https://2019epwl.github.io/SceneTransporter/.

TLDR: SceneTransporter uses Optimal Transport within a diffusion model to improve instance-level coherence and geometric fidelity in single-image structured 3D scene generation.

TLDR: SceneTransporter 在扩散模型中使用最优传输来提高单图像结构化 3D 场景生成中实例级别的连贯性和几何保真度。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Ling Wang, Hao-Xiang Guo, Xinzhou Wang, Fuchun Sun, Kai Sun, Pengkun Liu, Hang Xiao, Zhong Wang, Guangyuan Fu, Eric Li, Yang Liu, Yikai Wang

AIGC Daily Papers

SPATIALALIGN: Aligning Dynamic Spatial Relationships in Video Generation

Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling

ColoDiff: Integrating Dynamic Consistency With Content Awareness for Colonoscopy Video Generation

UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models

ToProVAR: Efficient Visual Autoregressive Modeling via Tri-Dimensional Entropy-Aware Semantic Analysis and Sparsity Optimization

PhotoAgent: Agentic Photo Editing with Exploratory Visual Aesthetic Planning

Instruction-based Image Editing with Planning, Reasoning, and Generation

Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation

DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation

Solaris: Building a Multiplayer Video World Model in Minecraft

Multidimensional Task Learning: A Unified Tensor Framework for Computer Vision Tasks

OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis

SceneTransporter: Optimal Transport-Guided Compositional Latent Diffusion for Single-Image Structured 3D Scene Generation