Daily papers related to Image/Video/Multimodal Generation from cs.CV
January 22, 2026
Recent advancements in 3D object generation using diffusion models have achieved remarkable success, but generating realistic 3D urban scenes remains challenging. Existing methods relying solely on 3D diffusion models tend to suffer a degradation in appearance details, while those utilizing only 2D diffusion models typically compromise camera controllability. To overcome this limitation, we propose ScenDi, a method for urban scene generation that integrates both 3D and 2D diffusion models. We first train a 3D latent diffusion model to generate 3D Gaussians, enabling the rendering of images at a relatively low resolution. To enable controllable synthesis, this 3DGS generation process can be optionally conditioned by specifying inputs such as 3d bounding boxes, road maps, or text prompts. Then, we train a 2D video diffusion model to enhance appearance details conditioned on rendered images from the 3D Gaussians. By leveraging the coarse 3D scene as guidance for 2D video diffusion, ScenDi generates desired scenes based on input conditions and successfully adheres to accurate camera trajectories. Experiments on two challenging real-world datasets, Waymo and KITTI-360, demonstrate the effectiveness of our approach.
TLDR: The paper introduces ScenDi, a novel approach for urban scene generation that combines 3D Gaussian diffusion models for coarse scene layout and 2D video diffusion models for enhancing appearance details, achieving controllable and realistic urban environments.
TLDR: 本文介绍了一种名为ScenDi的新型城市场景生成方法,该方法结合了用于粗略场景布局的3D高斯扩散模型和用于增强外观细节的2D视频扩散模型,实现了可控且逼真的城市环境。
Read Paper (PDF)Autoregressive (AR) visual generators model images as sequences of discrete tokens and are trained with next token likelihood. This strict causality supervision optimizes each step only by its immediate next token, which diminishes global coherence and slows convergence. We ask whether foresight, training signals that originate from later tokens, can help AR visual generation. We conduct a series of controlled diagnostics along the injection level, foresight layout, and foresight source axes, unveiling a key insight: aligning foresight to AR models' internal representation on the 2D image grids improves causality modeling. We formulate this insight with Mirai (meaning "future" in Japanese), a general framework that injects future information into AR training with no architecture change and no extra inference overhead: Mirai-E uses explicit foresight from multiple future positions of unidirectional representations, whereas Mirai-I leverages implicit foresight from matched bidirectional representations. Extensive experiments show that Mirai significantly accelerates convergence and improves generation quality. For instance, Mirai can speed up LlamaGen-B's convergence by up to 10$\times$ and reduce the generation FID from 5.34 to 4.34 on the ImageNet class-condition image generation benchmark. Our study highlights that visual autoregressive models need foresight.
TLDR: The paper introduces Mirai, a framework for improving autoregressive visual generation by injecting 'foresight' (future information) into the training process, leading to faster convergence and improved generation quality.
TLDR: 这篇论文介绍了 Mirai,一个通过注入“远见”(未来信息)到训练过程中来提升自回归视觉生成效果的框架,从而加速收敛并提高生成质量。
Read Paper (PDF)Models for image representation learning are typically designed for either recognition or generation. Various forms of contrastive learning help models learn to convert images to embeddings that are useful for classification, detection, and segmentation. On the other hand, models can be trained to reconstruct images with pixel-wise, perceptual, and adversarial losses in order to learn a latent space that is useful for image generation. We seek to unify these two directions with a first-of-its-kind model that learns representations which are simultaneously useful for recognition and generation. We train our model as a hyper-network for implicit neural representation, which learns to map images to model weights for fast, accurate reconstruction. We further integrate our INR hyper-network with knowledge distillation to improve its generalization and performance. Beyond the novel training design, the model also learns an unprecedented compressed embedding space with outstanding performance for various visual tasks. The complete model competes with state-of-the-art results for image representation learning, while also enabling generative capabilities with its high-quality tiny embeddings. The code is available at https://github.com/tiktok/huvr.
TLDR: This paper introduces a novel image representation learning model based on Implicit Neural Representation (INR) that unifies recognition and generation by learning compressed embeddings useful for various visual tasks, achieving competitive results.
TLDR: 本文提出了一种基于隐式神经表示 (INR) 的新型图像表示学习模型,该模型通过学习可用于各种视觉任务的压缩嵌入,统一了识别和生成,并取得了具有竞争力的结果。
Read Paper (PDF)Videos convey richer information than images or text, capturing both spatial and temporal dynamics. However, most existing video customization methods rely on reference images or task-specific temporal priors, failing to fully exploit the rich spatio-temporal information inherent in videos, thereby limiting flexibility and generalization in video generation. To address these limitations, we propose OmniTransfer, a unified framework for spatio-temporal video transfer. It leverages multi-view information across frames to enhance appearance consistency and exploits temporal cues to enable fine-grained temporal control. To unify various video transfer tasks, OmniTransfer incorporates three key designs: Task-aware Positional Bias that adaptively leverages reference video information to improve temporal alignment or appearance consistency; Reference-decoupled Causal Learning separating reference and target branches to enable precise reference transfer while improving efficiency; and Task-adaptive Multimodal Alignment using multimodal semantic guidance to dynamically distinguish and tackle different tasks. Extensive experiments show that OmniTransfer outperforms existing methods in appearance (ID and style) and temporal transfer (camera movement and video effects), while matching pose-guided methods in motion transfer without using pose, establishing a new paradigm for flexible, high-fidelity video generation.
TLDR: The paper introduces OmniTransfer, a unified framework for spatio-temporal video transfer that leverages multi-view information and temporal cues for flexible and high-fidelity video generation, achieving state-of-the-art results across various transfer tasks.
TLDR: 该论文介绍了一个名为 OmniTransfer 的统一时空视频转换框架,它利用多视角信息和时间线索,实现灵活且高保真的视频生成,并在各种转换任务中取得领先成果。
Read Paper (PDF)We present Soft Tail-dropping Adaptive Tokenizer (STAT), a 1D discrete visual tokenizer that adaptively chooses the number of output tokens per image according to its structural complexity and level of detail. STAT encodes an image into a sequence of discrete codes together with per-token keep probabilities. Beyond standard autoencoder objectives, we regularize these keep probabilities to be monotonically decreasing along the sequence and explicitly align their distribution with an image-level complexity measure. As a result, STAT produces length-adaptive 1D visual tokens that are naturally compatible with causal 1D autoregressive (AR) visual generative models. On ImageNet-1k, equipping vanilla causal AR models with STAT yields competitive or superior visual generation quality compared to other probabilistic model families, while also exhibiting favorable scaling behavior that has been elusive in prior vanilla AR visual generation attempts.
TLDR: The paper introduces Soft Tail-dropping Adaptive Tokenizer (STAT), a novel 1D discrete visual tokenizer that adaptively determines the number of tokens per image based on complexity, improving the performance of autoregressive visual generative models.
TLDR: 该论文介绍了软尾部丢弃自适应分词器 (STAT),这是一种新型的 1D 离散视觉分词器,它根据复杂性自适应地确定每个图像的 token 数量,从而提高自回归视觉生成模型的性能。
Read Paper (PDF)Traditional data masking techniques such as anonymization cannot achieve the expected privacy protection while ensuring data utility for privacy-preserving machine learning. Synthetic data plays an increasingly important role as it generates a large number of training samples and prevents information leakage in real data. The existing methods suffer from the repeating trade-off processes between privacy and utility. We propose a novel framework for differential privacy generation, which employs an Error Feedback Stochastic Gradient Descent(EFSGD) method and introduces a reconstruction loss and noise injection mechanism into the training process. We generate images with higher quality and usability under the same privacy budget as the related work. Extensive experiments demonstrate the effectiveness and generalization of our proposed framework for both grayscale and RGB images. We achieve state-of-the-art results over almost all metrics on three benchmarks: MNIST, Fashion-MNIST, and CelebA.
TLDR: This paper proposes a differential privacy image generation framework using Error Feedback SGD with reconstruction loss and noise injection, achieving state-of-the-art results on standard image datasets while addressing the privacy-utility trade-off.
TLDR: 本文提出了一种基于误差反馈随机梯度下降 (EFSGD) 的差分隐私图像生成框架,该框架结合了重建损失和噪声注入机制,在标准图像数据集上实现了最先进的结果,同时解决了隐私-效用之间的权衡问题。
Read Paper (PDF)Artificial Intelligence-Generated Content (AIGC) has made significant strides, with high-resolution text-to-image (T2I) generation becoming increasingly critical for improving users' Quality of Experience (QoE). Although resource-constrained edge computing adequately supports fast low-resolution T2I generations, achieving high-resolution output still faces the challenge of ensuring image fidelity at the cost of latency. To address this, we first investigate the performance of super-resolution (SR) methods for image enhancement, confirming a fundamental trade-off that lightweight learning-based SR struggles to recover fine details, while diffusion-based SR achieves higher fidelity at a substantial computational cost. Motivated by these observations, we propose an end-edge collaborative generation-enhancement framework. Upon receiving a T2I generation task, the system first generates a low-resolution image based on adaptively selected denoising steps and super-resolution scales at the edge side, which is then partitioned into patches and processed by a region-aware hybrid SR policy. This policy applies a diffusion-based SR model to foreground patches for detail recovery and a lightweight learning-based SR model to background patches for efficient upscaling, ultimately stitching the enhanced ones into the high-resolution image. Experiments show that our system reduces service latency by 33% compared with baselines while maintaining competitive image quality.
TLDR: This paper proposes an end-edge collaborative framework for text-to-image generation that uses a hybrid super-resolution (SR) policy to balance image fidelity and latency by applying diffusion-based SR to foreground patches and lightweight SR to background patches.
TLDR: 该论文提出了一种端边协作的文本到图像生成框架,该框架采用混合超分辨率 (SR) 策略,通过对前景补丁应用基于扩散的 SR,对背景补丁应用轻量级 SR,从而平衡图像保真度和延迟。
Read Paper (PDF)We present Motion 3-to-4, a feed-forward framework for synthesising high-quality 4D dynamic objects from a single monocular video and an optional 3D reference mesh. While recent advances have significantly improved 2D, video, and 3D content generation, 4D synthesis remains difficult due to limited training data and the inherent ambiguity of recovering geometry and motion from a monocular viewpoint. Motion 3-to-4 addresses these challenges by decomposing 4D synthesis into static 3D shape generation and motion reconstruction. Using a canonical reference mesh, our model learns a compact motion latent representation and predicts per-frame vertex trajectories to recover complete, temporally coherent geometry. A scalable frame-wise transformer further enables robustness to varying sequence lengths. Evaluations on both standard benchmarks and a new dataset with accurate ground-truth geometry show that Motion 3-to-4 delivers superior fidelity and spatial consistency compared to prior work. Project page is available at https://motion3-to-4.github.io/.
TLDR: Motion 3-to-4 is a framework for 4D dynamic object synthesis from a monocular video and an optional 3D reference mesh, decomposing the problem into static 3D shape generation and motion reconstruction for improved fidelity and spatial consistency.
TLDR: Motion 3-to-4 是一种使用单目视频和可选的 3D 参考网格合成高质量 4D 动态对象的框架,它将问题分解为静态 3D 形状生成和运动重建,从而提高保真度和空间一致性。
Read Paper (PDF)