ArXiv CS.CV Papers (Image/Video Generation)

Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians

Humans excel at forecasting the future dynamics of a scene given just a single image. Video generation models that can mimic this ability are an essential component for intelligent systems. Recent approaches have improved temporal coherence and 3D consistency in single-image-conditioned video generation. However, these methods often lack robust user controllability, such as modifying the camera path, limiting their applicability in real-world applications. Most existing camera-controlled image-to-video models struggle with accurately modeling camera motion, maintaining temporal consistency, and preserving geometric integrity. Leveraging explicit intermediate 3D representations offers a promising solution by enabling coherent video generation aligned with a given camera trajectory. Although these methods often use 3D point clouds to render scenes and introduce object motion in a later stage, this two-step process still falls short in achieving full temporal consistency, despite allowing precise control over camera movement. We propose a novel framework that constructs a 3D Gaussian scene representation and samples plausible object motion, given a single image in a single forward pass. This enables fast, camera-guided video generation without the need for iterative denoising to inject object motion into render frames. Extensive experiments on the KITTI, Waymo, RealEstate10K and DL3DV-10K datasets demonstrate that our method achieves state-of-the-art video quality and inference efficiency. The project page is available at https://melonienimasha.github.io/Pixel-to-4D-Website.

TLDR: This paper introduces a novel framework called Pixel-to-4D for camera-controlled image-to-video generation using dynamic 3D Gaussians, achieving state-of-the-art video quality and inference efficiency on several datasets.

TLDR: 本文介绍了一种名为 Pixel-to-4D 的新型框架，该框架使用动态 3D 高斯函数进行相机控制的图像到视频生成，并在多个数据集上实现了最先进的视频质量和推理效率。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Melonie de Almeida, Daniela Ivanova, Tong Shi, John H. Williamson, Paul Henderson

AEGIS: Exploring the Limit of World Knowledge Capabilities for Unified Mulitmodal Models

The capability of Unified Multimodal Models (UMMs) to apply world knowledge across diverse tasks remains a critical, unresolved challenge. Existing benchmarks fall short, offering only siloed, single-task evaluations with limited diagnostic power. To bridge this gap, we propose AEGIS (\emph{i.e.}, \textbf{A}ssessing \textbf{E}diting, \textbf{G}eneration, \textbf{I}nterpretation-Understanding for \textbf{S}uper-intelligence), a comprehensive multi-task benchmark covering visual understanding, generation, editing, and interleaved generation. AEGIS comprises 1,050 challenging, manually-annotated questions spanning 21 topics (including STEM, humanities, daily life, etc.) and 6 reasoning types. To concretely evaluate the performance of UMMs in world knowledge scope without ambiguous metrics, we further propose Deterministic Checklist-based Evaluation (DCE), a protocol that replaces ambiguous prompt-based scoring with atomic ``Y/N'' judgments, to enhance evaluation reliability. Our extensive experiments reveal that most UMMs exhibit severe world knowledge deficits and that performance degrades significantly with complex reasoning. Additionally, simple plug-in reasoning modules can partially mitigate these vulnerabilities, highlighting a promising direction for future research. These results highlight the importance of world-knowledge-based reasoning as a critical frontier for UMMs.

TLDR: The paper introduces AEGIS, a new multi-task benchmark and deterministic evaluation protocol (DCE) to evaluate world knowledge capabilities of Unified Multimodal Models (UMMs) across visual understanding, generation, and editing tasks, revealing significant knowledge deficits in current models.

TLDR: 该论文提出了AEGIS，一个新的多任务基准和确定性评估协议(DCE)，用于评估统一多模态模型(UMMs)在视觉理解、生成和编辑任务中的世界知识能力，揭示了当前模型中存在的显著知识缺陷。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Jintao Lin, Bowen Dong, Weikang Shi, Chenyang Lei, Suiyun Zhang, Rui Liu, Xihui Liu

DynaDrag: Dynamic Drag-Style Image Editing by Motion Prediction

To achieve pixel-level image manipulation, drag-style image editing which edits images using points or trajectories as conditions is attracting widespread attention. Most previous methods follow move-and-track framework, in which miss tracking and ambiguous tracking are unavoidable challenging issues. Other methods under different frameworks suffer from various problems like the huge gap between source image and target edited image as well as unreasonable intermediate point which can lead to low editability. To avoid these problems, we propose DynaDrag, the first dragging method under predict-and-move framework. In DynaDrag, Motion Prediction and Motion Supervision are performed iteratively. In each iteration, Motion Prediction first predicts where the handle points should move, and then Motion Supervision drags them accordingly. We also propose to dynamically adjust the valid handle points to further improve the performance. Experiments on face and human datasets showcase the superiority over previous works.

TLDR: The paper introduces DynaDrag, a novel drag-style image editing method that uses a predict-and-move framework with motion prediction and supervision to improve the editability and tracking accuracy compared to previous methods.

TLDR: 该论文介绍了 DynaDrag，一种新颖的拖拽式图像编辑方法，它使用预测和移动框架，通过运动预测和监督来提高可编辑性和跟踪精度，优于以前的方法。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Jiacheng Sui, Yujie Zhou, Li Niu

AIGC Daily Papers

Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians

AEGIS: Exploring the Limit of World Knowledge Capabilities for Unified Mulitmodal Models

DynaDrag: Dynamic Drag-Style Image Editing by Motion Prediction