AIGC Daily Papers

Daily papers related to Image/Video/Multimodal Generation from cs.CV

December 06, 2025

TV2TV: A Unified Framework for Interleaved Language and Video Generation

Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to "think in words" about subsequent content before ``acting in pixels'' to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model's ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.

TLDR: The paper introduces TV2TV, a novel video generation framework that interleaves text and video generation using a Mixture-of-Transformers architecture to improve visual quality, prompt alignment, and controllability by enabling the model to reason about content in words before generating frames.

TLDR: 该论文介绍了TV2TV,一种新颖的视频生成框架,它使用混合Transformer架构交错文本和视频生成,通过使模型在生成帧之前用文字推理内容来提高视觉质量、提示对齐和可控性。

Relevance: (10/10)
Novelty: (9/10)
Clarity: (9/10)
Potential Impact: (9/10)
Overall: (9/10)
Read Paper (PDF)

Authors: Xiaochuang Han, Youssef Emad, Melissa Hall, John Nguyen, Karthik Padthe, Liam Robbins, Amir Bar, Delong Chen, Michal Drozdzal, Maha Elbayad, Yushi Hu, Shang-Wen Li, Sreya Dutta Roy, Jakob Verbeek, XuDong Wang, Marjan Ghazvininejad, Luke Zettlemoyer, Emily Dinan

Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression

Recent advances in autoregressive video diffusion have enabled real-time frame streaming, yet existing solutions still suffer from temporal repetition, drift, and motion deceleration. We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation. To overcome this, we introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning. Specifically, 1) Deep Sink dedicates half of the sliding window to persistent sink tokens and re-aligns their temporal RoPE phase to the current timeline, stabilizing global context during long rollouts. 2) Participative Compression performs importance-aware KV cache pruning that preserves only tokens actively participating in recent attention while safely discarding redundant and degraded history, minimizing error accumulation under out-of-distribution length generation. Together, these components enable over 12x extrapolation (e.g. 5s-trained to 60s+ generation) with better imaging quality than LongLive, better aesthetic quality than RollingForcing, almost maintaining overall consistency, and substantial gains in dynamic degree, all while maintaining real-time generation. Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.

TLDR: This paper introduces Deep Forcing, a training-free method for long video generation using Deep Sink and Participative Compression to address temporal repetition and drift issues in autoregressive video diffusion, enabling significant extrapolation and improved quality.

TLDR: 本文介绍了一种名为Deep Forcing的无需训练的长视频生成方法,它使用Deep Sink和Participative Compression来解决自回归视频扩散中的时间重复和漂移问题,从而实现了显著的外推和质量提升。

Relevance: (10/10)
Novelty: (9/10)
Clarity: (8/10)
Potential Impact: (9/10)
Overall: (9/10)
Read Paper (PDF)

Authors: Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, Seungryong Kim

Light-X: Generative 4D Video Rendering with Camera and Illumination Control

Recent advances in illumination control extend image-based methods to video, yet still facing a trade-off between lighting fidelity and temporal consistency. Moving beyond relighting, a key step toward generative modeling of real-world scenes is the joint control of camera trajectory and illumination, since visual dynamics are inherently shaped by both geometry and lighting. To this end, we present Light-X, a video generation framework that enables controllable rendering from monocular videos with both viewpoint and illumination control. 1) We propose a disentangled design that decouples geometry and lighting signals: geometry and motion are captured via dynamic point clouds projected along user-defined camera trajectories, while illumination cues are provided by a relit frame consistently projected into the same geometry. These explicit, fine-grained cues enable effective disentanglement and guide high-quality illumination. 2) To address the lack of paired multi-view and multi-illumination videos, we introduce Light-Syn, a degradation-based pipeline with inverse-mapping that synthesizes training pairs from in-the-wild monocular footage. This strategy yields a dataset covering static, dynamic, and AI-generated scenes, ensuring robust training. Extensive experiments show that Light-X outperforms baseline methods in joint camera-illumination control and surpasses prior video relighting methods under both text- and background-conditioned settings.

TLDR: Light-X is a video generation framework enabling controllable rendering from monocular videos with both viewpoint and illumination control, using disentangled geometry and lighting signals and a novel data synthesis pipeline called Light-Syn.

TLDR: Light-X是一个视频生成框架,通过解耦几何和光照信号以及一个名为Light-Syn的新型数据合成流程,实现了从单目视频中生成具有视点和光照控制的可控渲染。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (8/10)
Potential Impact: (7/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Tianqi Liu, Zhaoxi Chen, Zihao Huang, Shaocong Xu, Saining Zhang, Chongjie Ye, Bohan Li, Zhiguo Cao, Wei Li, Hao Zhao, Ziwei Liu

DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation

Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model's inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.

TLDR: The paper introduces DraCo, a novel text-to-image generation method leveraging interleaved textual and visual CoT reasoning with draft image previews for improved planning, verification, and rare concept generation, demonstrating significant performance gains.

TLDR: 该论文介绍了一种名为 DraCo 的新型文本到图像生成方法,该方法利用交错的文本和视觉 CoT 推理以及草图图像预览,以改进规划、验证和罕见概念的生成,并展示了显著的性能提升。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Dongzhi Jiang, Renrui Zhang, Haodong Li, Zhuofan Zong, Ziyu Guo, Jun He, Claire Guo, Junyan Ye, Rongyao Fang, Weijia Li, Rui Liu, Hongsheng Li

BulletTime: Decoupled Control of Time and Camera Pose for Video Generation

Emerging video diffusion models achieve high visual fidelity but fundamentally couple scene dynamics with camera motion, limiting their ability to provide precise spatial and temporal control. We introduce a 4D-controllable video diffusion framework that explicitly decouples scene dynamics from camera pose, enabling fine-grained manipulation of both scene dynamics and camera viewpoint. Our framework takes continuous world-time sequences and camera trajectories as conditioning inputs, injecting them into the video diffusion model through a 4D positional encoding in the attention layer and adaptive normalizations for feature modulation. To train this model, we curate a unique dataset in which temporal and camera variations are independently parameterized; this dataset will be made public. Experiments show that our model achieves robust real-world 4D control across diverse timing patterns and camera trajectories, while preserving high generation quality and outperforming prior work in controllability. See our website for video results: https://19reborn.github.io/Bullet4D/

TLDR: This paper introduces a video diffusion framework, BulletTime, that decouples scene dynamics from camera pose, enabling more precise control over video generation through 4D positional encoding and adaptive normalizations, trained on a new dataset with independent temporal and camera variations.

TLDR: 这篇论文介绍了一个名为BulletTime的视频扩散框架,该框架将场景动态与相机姿态解耦,通过4D位置编码和自适应归一化实现对视频生成的更精确控制,并使用一个具有独立时间和相机变化的新的数据集进行训练。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Yiming Wang, Qihang Zhang, Shengqu Cai, Tong Wu, Jan Ackermann, Zhengfei Kuang, Yang Zheng, Frano Rajič, Siyu Tang, Gordon Wetzstein