ArXiv CS.CV Papers (Image/Video Generation)

SymphoMotion: Joint Control of Camera Motion and Object Dynamics for Coherent Video Generation

Controlling both camera motion and object dynamics is essential for coherent and expressive video generation, yet current methods typically handle only one motion type or rely on ambiguous 2D cues that entangle camera-induced parallax with true object movement. We present SymphoMotion, a unified motion-control framework that jointly governs camera trajectories and object dynamics within a single model. SymphoMotion features a Camera Trajectory Control mechanism that integrates explicit camera paths with geometry-aware cues to ensure stable, structurally consistent viewpoint transitions, and an Object Dynamics Control mechanism that combines 2D visual guidance with 3D trajectory embeddings to enable depth-aware, spatially coherent object manipulation. To support large-scale training and evaluation, we further construct RealCOD-25K, a comprehensive real-world dataset containing paired camera poses and object-level 3D trajectories across diverse indoor and outdoor scenes, addressing a key data gap in unified motion control. Extensive experiments and user studies show that SymphoMotion significantly outperforms existing methods in visual fidelity, camera controllability, and object-motion accuracy, establishing a new benchmark for unified motion control in video generation.Codes and data are publicly available at https://grenoble-zhang.github.io/SymphoMotion/.

TLDR: SymphoMotion introduces a unified framework for controlling both camera and object motion in video generation, along with a new real-world dataset, demonstrating superior performance in visual fidelity and motion control.

TLDR: SymphoMotion 提出了一个统一的框架，用于控制视频生成中的相机和物体运动，并发布了一个新的真实世界数据集，在视觉保真度和运动控制方面表现出卓越的性能。

Relevance: (10/10)

Novelty: (9/10)

Clarity: (10/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Guiyu Zhang, Yabo Chen, Xunzhi Xiang, Junchao Huang, Zhongyu Wang, Li Jiang

A Generative Foundation Model for Multimodal Histopathology

Accurate diagnosis and treatment of complex diseases require integrating histological, molecular, and clinical data, yet in practice these modalities are often incomplete owing to tissue scarcity, assay cost, and workflow constraints. Existing computational approaches attempt to impute missing modalities from available data but rely on task-specific models trained on narrow, single source-target pairs, limiting their generalizability. Here we introduce MuPD (Multimodal Pathology Diffusion), a generative foundation model that embeds hematoxylin and eosin (H&E)-stained histology, molecular RNA profiles, and clinical text into a shared latent space through a diffusion transformer with decoupled cross-modal attention. Pretrained on 100 million histology image patches, 1.6 million text-histology pairs, and 10.8 million RNA-histology pairs spanning 34 human organs, MuPD supports diverse cross-modal synthesis tasks with minimal or no task-specific fine-tuning. For text-conditioned and image-to-image generation, MuPD synthesizes histologically faithful tissue architectures, reducing Fréchet inception distance (FID) scores by 50% relative to domain-specific models and improving few-shot classification accuracy by up to 47% through synthetic data augmentation. For RNA-conditioned histology generation, MuPD reduces FID by 23% compared with the next-best method while preserving cell-type distributions across five cancer types. As a virtual stainer, MuPD translates H&E images to immunohistochemistry and multiplex immunofluorescence, improving average marker correlation by 37% over existing approaches. These results demonstrate that a single, unified generative model pretrained across heterogeneous pathology modalities can substantially outperform specialized alternatives, providing a scalable computational framework for multimodal histopathology.

TLDR: This paper introduces MuPD, a multimodal generative foundation model for histopathology that integrates histology images, RNA profiles, and clinical text, and demonstrates its superior performance in cross-modal synthesis tasks compared to specialized models.

TLDR: 该论文介绍了MuPD，一种用于组织病理学多模态生成的基础模型，它整合了组织学图像、RNA谱和临床文本，并证明其在跨模态合成任务中优于专门模型。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (9/10)

Overall: (9/10)

Read Paper (PDF)

Authors: Jinxi Xiang, Mingjie Li, Siyu Hou, Yijiang Chen, Xiangde Luo, Yuanfeng Ji, Xiang Zhou, Ehsan Adeli, Akshay Chaudhari, Curtis P. Langlotz, Kilian M. Pohl, Ruijiang Li

Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation

Recent proprietary models such as Sora2 demonstrate promising progress in generating multi-shot videos conditioned on multiple reference characters. However, academic research on this problem remains limited. We study this task and identify a core challenge: when reference images exhibit highly similar appearances, the model often suffers from reference confusion, where semantically similar tokens degrade the model's ability to retrieve the correct context. To address this, we introduce PoCo (Position Embedding as a Context Controller), which incorporates position encoding as additional context control beyond semantic retrieval. By employing side information of tokens, PoCo enables precise token-level matching while preserving implicit semantic consistency modeling. Building on PoCo, we develop a multi-reference and multi-shot video generation model capable of reliably controlling characters with extremely similar visual traits. Extensive experiments demonstrate that PoCo improves cross-shot consistency and reference fidelity compared with various baselines.

TLDR: The paper introduces PoCo, a novel method using position embeddings to improve multi-reference video generation by mitigating reference confusion, especially when characters have similar appearances.

TLDR: 该论文介绍了PoCo，一种使用位置嵌入的新方法，通过减轻参考混淆来改进多参考视频生成，尤其是在角色外观相似时。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Binyuan Huang, Yuning Lu, Weinan Jia, Hualiang Wang, Mu Liu, Daiqing Yang

DiffSparse: Accelerating Diffusion Transformers with Learned Token Sparsity

Diffusion models demonstrate outstanding performance in image generation, but their multi-step inference mechanism requires immense computational cost. Previous works accelerate inference by leveraging layer or token cache techniques to reduce computational cost. However, these methods fail to achieve superior acceleration performance in few-step diffusion transformer models due to inefficient feature caching strategies, manually designed sparsity allocation, and the practice of retaining complete forward computations in several steps in these token cache methods. To tackle these challenges, we propose a differentiable layer-wise sparsity optimization framework for diffusion transformer models, leveraging token caching to reduce token computation costs and enhance acceleration. Our method optimizes layer-wise sparsity allocation in an end-to-end manner through a learnable network combined with a dynamic programming solver. Additionally, our proposed two-stage training strategy eliminates the need for full-step processing in existing methods, further improving efficiency. We conducted extensive experiments on a range of diffusion-transformer models, including DiT-XL/2, PixArt-$α$, FLUX, and Wan2.1. Across these architectures, our method consistently improves efficiency without degrading sample quality. For example, on PixArt-$α$ with 20 sampling steps, we reduce computational cost by $54\%$ while achieving generation metrics that surpass those of the original model, substantially outperforming prior approaches. These results demonstrate that our method delivers large efficiency gains while often improving generation quality.

TLDR: This paper introduces DiffSparse, a method to accelerate diffusion transformer models by learning layer-wise token sparsity, reducing computational cost and often improving generation quality without full-step computation.

TLDR: 该论文介绍了DiffSparse，一种通过学习层间token稀疏性来加速扩散Transformer模型的方法，降低了计算成本，并且经常能够在不进行全步计算的情况下提高生成质量。

Relevance: (8/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Haowei Zhu, Ji Liu, Ziqiong Liu, Dong Li, Junhai Yong, Bin Wang, Emad Barsoum

CRAFT: Video Diffusion for Bimanual Robot Data Generation

Bimanual robot learning from demonstrations is fundamentally limited by the cost and narrow visual diversity of real-world data, which constrains policy robustness across viewpoints, object configurations, and embodiments. We present Canny-guided Robot Data Generation using Video Diffusion Transformers (CRAFT), a video diffusion-based framework for scalable bimanual demonstration generation that synthesizes temporally coherent manipulation videos while producing action labels. By conditioning video diffusion on edge-based structural cues extracted from simulator-generated trajectories, CRAFT produces physically plausible trajectory variations and supports a unified augmentation pipeline spanning object pose changes, camera viewpoints, lighting and background variations, cross-embodiment transfer, and multi-view synthesis. We leverage a pre-trained video diffusion model to convert simulated videos, along with action labels from the simulation trajectories, into action-consistent demonstrations. Starting from only a few real-world demonstrations, CRAFT generates a large, visually diverse set of photorealistic training data, bypassing the need to replay demonstrations on the real robot (Sim2Real). Across simulated and real-world bimanual tasks, CRAFT improves success rates over existing augmentation strategies and straightforward data scaling, demonstrating that diffusion-based video generation can substantially expand demonstration diversity and improve generalization for dual-arm manipulation tasks. Our project website is available at: https://craftaug.github.io/

TLDR: The paper introduces CRAFT, a video diffusion framework that generates diverse bimanual robot manipulation demonstrations from simulator trajectories, improving policy robustness and generalization by bypassing the need for extensive real-world data collection.

TLDR: 该论文介绍了一种名为CRAFT的视频扩散框架，它能从模拟器轨迹生成多样化的双臂机器人操作演示，通过避免大量真实世界数据的收集，从而提高策略的鲁棒性和泛化能力。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Jason Chen, I-Chun Arthur Liu, Gaurav Sukhatme, Daniel Seita

ExpressEdit: Fast Editing of Stylized Facial Expressions with Diffusion Models in Photoshop

Facial expressions of characters are a vital component of visual storytelling. While current AI image editing models hold promise for assisting artists in the task of stylized expression editing, these models introduce global noise and pixel drift into the edited image, preventing the integration of these models into professional image editing software and workflows. To bridge this gap, we introduce ExpressEdit, a fully open-source Photoshop plugin that is free from common artifacts of proprietary image editing models and robustly synergizes with native Photoshop operations such as Liquify. ExpressEdit seamlessly edits an expression within 3 seconds on a single consumer-grade GPU, significantly faster than popular proprietary models. Moreover, to support the generation of diverse expressions according to different narrative needs, we compile a comprehensive expression database of 135 expression tags enriched with example stories and images designed for retrieval-augmented generation. We open source the code and dataset to facilitate future research and artistic exploration.

TLDR: ExpressEdit is a fast, open-source Photoshop plugin for stylized facial expression editing using diffusion models, addressing artifact issues and integrating with existing workflows. They also provide a comprehensive expression database.

TLDR: ExpressEdit是一个快速、开源的Photoshop插件，使用扩散模型进行风格化面部表情编辑，解决了伪影问题并与现有工作流程集成。他们还提供了一个全面的表情数据库。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Kenan Tang, Jiasheng Guo, Jeffrey Lin, Yao Qin

AIGC Daily Papers

SymphoMotion: Joint Control of Camera Motion and Object Dynamics for Coherent Video Generation

A Generative Foundation Model for Multimodal Histopathology

Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation

DiffSparse: Accelerating Diffusion Transformers with Learned Token Sparsity

CRAFT: Video Diffusion for Bimanual Robot Data Generation

ExpressEdit: Fast Editing of Stylized Facial Expressions with Diffusion Models in Photoshop