ArXiv CS.CV Papers (Image/Video Generation)

WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

Recent advances in video diffusion transformers have enabled interactive gaming world models that allow users to explore generated environments over extended horizons. However, existing approaches struggle with precise action control and long-horizon 3D consistency. Most prior works treat user actions as abstract conditioning signals, overlooking the fundamental geometric coupling between actions and the 3D world, whereby actions induce relative camera motions that accumulate into a global camera pose within a 3D world. In this paper, we establish camera pose as a unifying geometric representation to jointly ground immediate action control and long-term 3D consistency. First, we define a physics-based continuous action space and represent user inputs in the Lie algebra to derive precise 6-DoF camera poses, which are injected into the generative model via a camera embedder to ensure accurate action alignment. Second, we use global camera poses as spatial indices to retrieve relevant past observations, enabling geometrically consistent revisiting of locations during long-horizon navigation. To support this research, we introduce a large-scale dataset comprising 3,000 minutes of authentic human gameplay annotated with camera trajectories and textual descriptions. Extensive experiments show that our approach substantially outperforms state-of-the-art interactive gaming world models in action controllability, long-horizon visual quality, and 3D spatial consistency.

TLDR: The paper introduces WorldCam, a novel approach for interactive 3D gaming worlds using camera pose as a unifying geometric representation to improve action control and long-horizon 3D consistency, along with a new large-scale dataset.

TLDR: 该论文介绍了 WorldCam，一种新颖的交互式 3D 游戏世界方法，使用相机姿态作为统一的几何表示，以提高动作控制和长期 3D 一致性，并提供了一个新的大型数据集。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Jisu Nam, Yicong Hong, Chun-Hao Paul Huang, Feng Liu, JoungBin Lee, Jiyoung Kim, Siyoon Jin, Yunsung Lee, Jaeyoon Jung, Suhwan Choi, Seungryong Kim, Yang Zhou

Demystifing Video Reasoning

Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.

TLDR: This paper challenges the sequential frame-based reasoning assumption in diffusion-based video models, proposing a "Chain-of-Steps" mechanism where reasoning emerges during diffusion denoising. It also identifies emergent reasoning behaviors such as working memory and self-correction, and offers a training-free improvement strategy.

TLDR: 本文挑战了基于扩散的视频模型中基于顺序帧的推理假设，提出了一种“步骤链”机制，即推理在扩散去噪过程中产生。它还识别了诸如工作记忆和自我纠正等新兴推理行为，并提供一个无需训练的改进策略。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li, Ziqi Huang, Hokin Deng, Dahua Lin, Ziwei Liu, Lei Yang

V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising

Pixel-space diffusion has recently re-emerged as a strong alternative to latent diffusion, enabling high-quality generation without pretrained autoencoders. However, standard pixel-space diffusion models receive relatively weak semantic supervision and are not explicitly designed to capture high-level visual structure. Recent representation-alignment methods (e.g., REPA) suggest that pretrained visual features can substantially improve diffusion training, and visual co-denoising has emerged as a promising direction for incorporating such features into the generative process. However, existing co-denoising approaches often entangle multiple design choices, making it unclear which design choices are truly essential. Therefore, we present V-Co, a systematic study of visual co-denoising in a unified JiT-based framework. This controlled setting allows us to isolate the ingredients that make visual co-denoising effective. Our study reveals four key ingredients for effective visual co-denoising. First, preserving feature-specific computation while enabling flexible cross-stream interaction motivates a fully dual-stream architecture. Second, effective classifier-free guidance (CFG) requires a structurally defined unconditional prediction. Third, stronger semantic supervision is best provided by a perceptual-drifting hybrid loss. Fourth, stable co-denoising further requires proper cross-stream calibration, which we realize through RMS-based feature rescaling. Together, these findings yield a simple recipe for visual co-denoising. Experiments on ImageNet-256 show that, at comparable model sizes, V-Co outperforms the underlying pixel-space diffusion baseline and strong prior pixel-diffusion methods while using fewer training epochs, offering practical guidance for future representation-aligned generative models.

TLDR: The paper introduces V-Co, a systematic study of visual co-denoising for pixel-space diffusion models, identifying four key ingredients for effective representation alignment and demonstrating improved performance on ImageNet-256 with fewer training epochs.

TLDR: 该论文介绍了V-Co，一个针对像素空间扩散模型的可视化协同去噪的系统研究，识别了有效表征对齐的四个关键要素，并在ImageNet-256上展示了更少的训练周期即可提高的性能。

Relevance: (8/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Han Lin, Xichen Pan, Zun Wang, Yue Zhang, Chu Wang, Jaemin Cho, Mohit Bansal

Search2Motion: Training-Free Object-Level Motion Control via Attention-Consensus Search

We present Search2Motion, a training-free framework for object-level motion editing in image-to-video generation. Unlike prior methods requiring trajectories, bounding boxes, masks, or motion fields, Search2Motion adopts target-frame-based control, leveraging first-last-frame motion priors to realize object relocation while preserving scene stability without fine-tuning. Reliable target-frame construction is achieved through semantic-guided object insertion and robust background inpainting. We further show that early-step self-attention maps predict object and camera dynamics, offering interpretable user feedback and motivating ACE-Seed (Attention Consensus for Early-step Seed selection), a lightweight search strategy that improves motion fidelity without look-ahead sampling or external evaluators. Noting that existing benchmarks conflate object and camera motion, we introduce S2M-DAVIS and S2M-OMB for stable-camera, object-only evaluation, alongside FLF2V-obj metrics that isolate object artifacts without requiring ground-truth trajectories. Search2Motion consistently outperforms baselines on FLF2V-obj and VBench.

TLDR: Search2Motion is a training-free framework for object-level motion editing in image-to-video generation using attention consensus search to guide motion and introducing metrics to isolate object-related artifacts in generated videos.

TLDR: Search2Motion是一个免训练的框架，用于图像到视频生成中物体级别的运动编辑。它使用注意力共识搜索来引导运动，并引入指标来隔离生成视频中与物体相关的伪影。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Sainan Liu, Tz-Ying Wu, Hector A Valdez, Subarna Tripathi

DermaFlux: Synthetic Skin Lesion Generation with Rectified Flows for Enhanced Image Classification

Despite recent advances in deep generative modeling, skin lesion classification systems remain constrained by the limited availability of large, diverse, and well-annotated clinical datasets, resulting in class imbalance between benign and malignant lesions and consequently reduced generalization performance. We introduce DermaFlux, a rectified flow-based text-to-image generative framework that synthesizes clinically grounded skin lesion images from natural language descriptions of dermatological attributes. Built upon Flux.1, DermaFlux is fine-tuned using parameter-efficient Low-Rank Adaptation (LoRA) on a large curated collection of publicly available clinical image datasets. We construct image-text pairs using synthetic textual captions generated by Llama 3.2, following established dermatological criteria including lesion asymmetry, border irregularity, and color variation. Extensive experiments demonstrate that DermaFlux generates diverse and clinically meaningful dermatology images that improve binary classification performance by up to 6% when augmenting small real-world datasets, and by up to 9% when classifiers are trained on DermaFlux-generated synthetic images rather than diffusion-based synthetic images. Our ImageNet-pretrained ViT fine-tuned with only 2,500 real images and 4,375 DermaFlux-generated samples achieves 78.04% binary classification accuracy and an AUC of 0.859, surpassing the next best dermatology model by 8%.

TLDR: The paper introduces DermaFlux, a rectified flow-based text-to-image generative framework for synthesizing skin lesion images from text descriptions to improve skin lesion classification, achieving significant performance gains over existing methods.

TLDR: 该论文介绍了DermaFlux，一个基于修正流的文本到图像生成框架，用于从文本描述中合成皮肤病变图像，以提高皮肤病变分类的性能，相比现有方法取得了显著的性能提升。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Stathis Galanakis, Alexandros Koliousis, Stefanos Zafeiriou

Semantic One-Dimensional Tokenizer for Image Reconstruction and Generation

Visual generative models based on latent space have achieved great success, underscoring the significance of visual tokenization. Mapping images to latents boosts efficiency and enables multimodal alignment for scaling up in downstream tasks. Existing visual tokenizers primarily map images into fixed 2D spatial grids and focus on pixel-level restoration, which hinders the capture of representations with compact global semantics. To address these issues, we propose \textbf{SemTok}, a semantic one-dimensional tokenizer that compresses 2D images into 1D discrete tokens with high-level semantics. SemTok sets a new state-of-the-art in image reconstruction, achieving superior fidelity with a remarkably compact token representation. This is achieved via a synergistic framework with three key innovations: a 2D-to-1D tokenization scheme, a semantic alignment constraint, and a two-stage generative training strategy. Building on SemTok, we construct a masked autoregressive generation framework, which yields notable improvements in downstream image generation tasks. Experiments confirm the effectiveness of our semantic 1D tokenization. Our code will be open-sourced.

TLDR: The paper introduces SemTok, a novel semantic one-dimensional tokenizer for images, which achieves state-of-the-art image reconstruction and improves downstream image generation tasks by producing compact token representations with high-level semantics.

TLDR: 本文介绍了一种新的语义一维图像标记器 SemTok，它通过生成具有高级语义的紧凑token表示，实现了最先进的图像重建，并改进了下游图像生成任务。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yunpeng Qu, Kaidong Zhang, Yukang Ding, Ying Chen, Jian Wang

VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment

Video diffusion models lack explicit geometric supervision during training, leading to inconsistency artifacts such as object deformation, spatial drift, and depth violations in generated videos. To address this limitation, we propose a geometry-based reward model that leverages pretrained geometric foundation models to evaluate multi-view consistency through cross-frame reprojection error. Unlike previous geometric metrics that measure inconsistency in pixel space, where pixel intensity may introduce additional noise, our approach conducts error computation in a pointwise fashion, yielding a more physically grounded and robust error metric. Furthermore, we introduce a geometry-aware sampling strategy that filters out low-texture and non-semantic regions, focusing evaluation on geometrically meaningful areas with reliable correspondences to improve robustness. We apply this reward model to align video diffusion models through two complementary pathways: post-training of a bidirectional model via SFT or Reinforcement Learning and inference-time optimization of a Causal Video Model (e.g., Streaming video generator) via test-time scaling with our reward as a path verifier. Experimental results validate the effectiveness of our design, demonstrating that our geometry-based reward provides superior robustness compared to other variants. By enabling efficient inference-time scaling, our method offers a practical solution for enhancing open-source video models without requiring extensive computational resources for retraining.

TLDR: The paper introduces a geometry-based reward model called VIGOR to improve the consistency of generated videos by addressing object deformation and spatial drift issues in video diffusion models. It utilizes cross-frame reprojection error and geometry-aware sampling to enhance robustness and can be applied during post-training or inference-time optimization.

TLDR: 该论文介绍了一种基于几何的奖励模型 VIGOR，通过解决视频扩散模型中的对象变形和空间漂移问题来提高生成视频的一致性。它利用跨帧反投影误差和几何感知采样来增强鲁棒性，并且可以在后训练或推理时优化期间应用。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Tengjiao Yin, Jinglei Shi, Heng Guo, Xi Wang

Leveling3D: Leveling Up 3D Reconstruction with Feed-Forward 3D Gaussian Splatting and Geometry-Aware Generation

Feed-forward 3D reconstruction has revolutionized 3D vision, providing a powerful baseline for downstream tasks such as novel-view synthesis with 3D Gaussian Splatting. Previous works explore fixing the corrupted rendering results with a diffusion model. However, they lack geometric concern and fail at filling the missing area on the extrapolated view. In this work, we introduce Leveling3D, a novel pipeline that integrates feed-forward 3D reconstruction with geometrical-consistent generation to enable holistic simultaneous reconstruction and generation. We propose a geometry-aware leveling adapter, a lightweight technique that aligns internal knowledge in the diffusion model with the geometry prior from the feed-forward model. The leveling adapter enables generation on the artifact area of the extrapolated novel views caused by underconstrained regions of the 3D representation. Specifically, to learn a more diverse distributed generation, we introduce the palette filtering strategy for training, and a test-time masking refinement to prevent messy boundaries along the fixing regions. More importantly, the enhanced extrapolated novel views from Leveling3D could be used as the inputs for feed-forward 3DGS, leveling up the 3D reconstruction. We achieve SOTA performance on public datasets, including tasks such as novel-view synthesis and depth estimation.

TLDR: The paper introduces Leveling3D, a pipeline combining feed-forward 3D reconstruction with geometry-aware generation via a leveling adapter to improve novel-view synthesis and 3D reconstruction, achieving SOTA results.

TLDR: 本文介绍了一种名为Leveling3D的新流程，它结合了前馈3D重建和几何感知生成技术，通过一个leveling adapter来改进新视角合成和3D重建，并实现了SOTA性能。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Yiming Huang, Baixiang Huang, Beilei Cui, Chi Kit Ng, Long Bai, Hongliang Ren

Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

Unified Multimodal Models (UMMs) are often constrained by the pre-training of their $\textbf{visual generation components}$, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for $\textbf{UMM visual generation}$ and identify these two issues as the major bottlenecks. To address them, we propose $\textbf{Image-Only Training for UMMs (IOMM)}$, a data-efficient two-stage training framework. The first stage pre-trains the visual generative component $\textbf{exclusively}$ using abundant unlabeled image-only data, thereby removing the dependency on paired data $\textbf{for this costly phase}$. The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality. Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art (SOTA) performance. For example, our IOMM-B (3.6B) model was trained from scratch using only $\sim \textbf{1050}$ H800 GPU hours (with the vast majority, $\textbf{1000}$ hours, dedicated to the efficient $\textbf{image-only pre-training stage}$). It achieves $\textbf{0.89}$ on GenEval and $\textbf{0.55}$ on WISE--surpassing strong baselines such as BAGEL-7B (0.82 & 0.55) and BLIP3-o-4B (0.84 & 0.50). Code is available $\href{https://github.com/LINs-lab/IOMM}{https://github.com/LINs-lab/IOMM}$.

TLDR: The paper introduces IOMM, a two-stage training framework for UMM visual generation that utilizes image-only pre-training to improve efficiency and performance, achieving SOTA results with significantly reduced training time and computational cost.

TLDR: 该论文提出了IOMM，一个用于UMM视觉生成的两阶段训练框架，其利用仅图像预训练来提高效率和性能，以显著减少的训练时间和计算成本实现了SOTA结果。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Peng Sun, Jun Xie, Tao Lin

OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder

Existing diffusion-based 3D scene generation methods primarily operate in 2D image/video latent spaces, which makes maintaining cross-view appearance and geometric consistency inherently challenging. To bridge this gap, we present OneWorld, a framework that performs diffusion directly within a coherent 3D representation space. Central to our approach is the 3D Unified Representation Autoencoder (3D-URAE); it leverages pretrained 3D foundation models and augments their geometry-centric nature by injecting appearance and distilling semantics into a unified 3D latent space. Furthermore, we introduce token-level Cross-View-Correspondence (CVC) consistency loss to explicitly enforce structural alignment across views, and propose Manifold-Drift Forcing (MDF) to mitigate train-inference exposure bias and shape a robust 3D manifold by mixing drifted and original representations. Comprehensive experiments demonstrate that OneWorld generates high-quality 3D scenes with superior cross-view consistency compared to state-of-the-art 2D-based methods. Our code will be available at https://github.com/SensenGao/OneWorld.

TLDR: OneWorld introduces a diffusion-based 3D scene generation framework using a 3D unified representation autoencoder, achieving superior cross-view consistency compared to 2D-based methods. It addresses limitations in existing 2D latent space approaches by operating directly in 3D.

TLDR: OneWorld 提出了一个基于扩散的 3D 场景生成框架，使用 3D 统一表示自动编码器，与基于 2D 的方法相比，实现了卓越的跨视图一致性。它通过直接在 3D 中操作，解决了现有 2D 潜在空间方法的局限性。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Sensen Gao, Zhaoqing Wang, Qihang Cao, Dongdong Yu, Changhu Wang, Tongliang Liu, Mingming Gong, Jiawang Bian

LICA: Layered Image Composition Annotations for Graphic Design Research

We introduce LICA (Layered Image Composition Annotations), a large-scale dataset of 1,550,244 multi-layer graphic design compositions designed to advance structured understanding and generation of graphic layouts1. In addition to ren- dered PNG images, LICA represents each design as a hierarchical composition of typed components including text, image, vector, and group elements, each paired with rich per-element metadata such as spatial geometry, typographic attributes, opacity, and visibility. The dataset spans 20 design categories and 971,850 unique templates, providing broad coverage of real-world design structures. We further introduce graphic design video as a new and largely unexplored challenge for current vision-language models through 27,261 animated layouts annotated with per-component keyframes and motion parameters. Beyond scale, LICA establishes a new paradigm of research tasks for graphic design, enabling structured investiga- tions into problems such as layer-aware inpainting, structured layout generation, controlled design editing, and temporally-aware generative modeling. By repre- senting design as a system of compositional layers and relationships, the dataset supports research on models that operate directly on design structure rather than pixels alone.

TLDR: The paper introduces LICA, a large-scale dataset of layered graphic design compositions with rich metadata, enabling research on structured graphic design understanding, generation, and editing, including a novel graphic design video component.

TLDR: 该论文介绍了LICA，一个大规模的分层图形设计组合数据集，包含丰富的元数据，从而可以研究结构化图形设计理解、生成和编辑，其中包括一种新型的图形设计视频组件。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Elad Hirsch, Shubham Yadav, Mohit Garg, Purvanshi Mehta

Diffusion Models for Joint Audio-Video Generation

Multimodal generative models have shown remarkable progress in single-modality video and audio synthesis, yet truly joint audio-video generation remains an open challenge. In this paper, I explore four key contributions to advance this field. First, I release two high-quality, paired audio-video datasets. The datasets consisting on 13 hours of video-game clips and 64 hours of concert performances, each segmented into consistent 34-second samples to facilitate reproducible research. Second, I train the MM-Diffusion architecture from scratch on our datasets, demonstrating its ability to produce semantically coherent audio-video pairs and quantitatively evaluating alignment on rapid actions and musical cues. Third, I investigate joint latent diffusion by leveraging pretrained video and audio encoder-decoders, uncovering challenges and inconsistencies in the multimodal decoding stage. Finally, I propose a sequential two-step text-to-audio-video generation pipeline: first generating video, then conditioning on both the video output and the original prompt to synthesize temporally synchronized audio. My experiments show that this modular approach yields high-fidelity generations of audio video generation.

TLDR: This paper introduces two new audio-video datasets and explores diffusion models for joint audio-video generation, including a sequential text-to-audio-video pipeline that generates high-fidelity results.

TLDR: 本文介绍了两个新的音频-视频数据集，并探索了扩散模型在联合音频-视频生成方面的应用，包括一个顺序的文本到音频-视频的生成流程，可生成高保真结果。

Relevance: (9/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Alejandro Paredes La Torre

Feed-forward Gaussian Registration for Head Avatar Creation and Editing

We present MATCH (Multi-view Avatars from Topologically Corresponding Heads), a multi-view Gaussian registration method for high-quality head avatar creation and editing. State-of-the-art multi-view head avatar methods require time-consuming head tracking followed by expensive avatar optimization, often resulting in a total creation time of more than one day. MATCH, in contrast, directly predicts Gaussian splat textures in correspondence from calibrated multi-view images in just 0.5 seconds per frame, without requiring data preprocessing. The learned intra-subject correspondence across frames enables fast creation of personalized head avatars, while correspondence across subjects supports applications such as expression transfer, optimization-free tracking, semantic editing, and identity interpolation. We establish these correspondences end-to-end using a transformer-based model that predicts Gaussian splat textures in the fixed UV layout of a template mesh. To achieve this, we introduce a novel registration-guided attention block, where each UV-map token attends exclusively to image tokens depicting its corresponding mesh region. This design improves efficiency and performance compared to dense cross-view attention. MATCH outperforms existing methods in novel-view synthesis, geometry registration, and head avatar generation, while making avatar creation 10 times faster than the closest competing baseline. The code and model weights are available on the project website.

TLDR: The paper introduces MATCH, a fast multi-view Gaussian registration method using a transformer-based architecture for high-quality, editable head avatar creation, achieving 10x speedup over existing solutions and enabling various applications like expression transfer and identity interpolation.

TLDR: 本文介绍了MATCH，一种基于Transformer的高效多视图高斯注册方法，用于创建可编辑的高质量头部头像，相比现有方案速度提升10倍，并支持表情迁移和身份插值等多种应用。

Relevance: (8/10)

Novelty: (9/10)

Clarity: (9/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Malte Prinzler, Paulo Gotardo, Siyu Tang, Timo Bolkart

Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLMReward Models

Generative reward models (GRMs) for vision-language models (VLMs) often evaluate outputs via a three-stage pipeline: rubric generation, criterion-based scoring, and a final verdict. However, the intermediate rubric is rarely optimized directly. Prior work typically either treats rubrics as incidental or relies on expensive LLM-as-judge checks that provide no differentiable signal and limited training-time guidance. We propose Proxy-GRM, which introduces proxy-guided rubric verification into Reinforcement Learning (RL) to explicitly enhance rubric quality. Concretely, we train lightweight proxy agents (Proxy-SFT and Proxy-RL) that take a candidate rubric together with the original query and preference pair, and then predict the preference ordering using only the rubric as evidence. The proxy's prediction accuracy serves as a rubric-quality reward, incentivizing the model to produce rubrics that are internally consistent and transferable. With ~50k data samples, Proxy-GRM reaches state-of-the-art results on the VL-Reward Bench, Multimodal Reward Bench, and MM-RLHF-Reward Bench, outperforming the methods trained on four times the data. Ablations show Proxy-SFT is a stronger verifier than Proxy-RL, and implicit reward aggregation performs best. Crucially, the learned rubrics transfer to unseen evaluators, improving reward accuracy at test time without additional training. Our code is available at https://github.com/Qwen-Applications/Proxy-GRM.

TLDR: The paper introduces Proxy-GRM, a method that uses proxy agents to verify and improve the quality of rubrics generated by vision-language reward models, leading to state-of-the-art performance and improved transferability.

TLDR: 该论文介绍了Proxy-GRM，一种使用代理来验证和提高视觉语言奖励模型生成的评分规则质量的方法，从而达到最先进的性能并提高可迁移性。

Relevance: (6/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Weijie Qiu, Dai Guan, Junxin Wang, Zhihang Li, Yongbo Gai, Mengyu Zhou, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang

CompDiff: Hierarchical Compositional Diffusion for Fair and Zero-Shot Intersectional Medical Image Generation

Generative models are increasingly used to augment medical imaging datasets for fairer AI. Yet a key assumption often goes unexamined: that generators themselves produce equally high-quality images across demographic groups. Models trained on imbalanced data can inherit these imbalances, yielding degraded synthesis quality for rare subgroups and struggling with demographic intersections absent from training. We refer to this as the imbalanced generator problem. Existing remedies such as loss reweighting operate at the optimization level and provide limited benefit when training signal is scarce or absent for certain combinations. We propose CompDiff, a hierarchical compositional diffusion framework that addresses this problem at the representation level. A dedicated Hierarchical Conditioner Network (HCN) decomposes demographic conditioning, producing a demographic token concatenated with CLIP embeddings as cross-attention context. This structured factorization encourages parameter sharing across subgroups and supports compositional generalization to rare or unseen demographic intersections. Experiments on chest X-rays (MIMIC-CXR) and fundus images (FairGenMed) show that CompDiff compares favorably against both standard fine-tuning and FairDiffusion across image quality (FID: 64.3 vs. 75.1), subgroup equity (ES-FID), and zero-shot intersectional generalization (up to 21% FID improvement on held-out intersections). Downstream classifiers trained on CompDiff-generated data also show improved AUROC and reduced demographic bias, suggesting that architectural design of demographic conditioning is an important and underexplored factor in fair medical image generation. Code is available at https://anonymous.4open.science/r/CompDiff-6FE6.

TLDR: The paper introduces CompDiff, a hierarchical compositional diffusion framework for fair and zero-shot intersectional medical image generation, addressing the imbalanced generator problem by improving image quality and reducing demographic bias, especially for rare or unseen subgroups.

TLDR: 该论文提出了CompDiff，一个分层组合扩散框架，用于公平和零样本的交叉医学图像生成。它通过提高图像质量和减少人口统计偏差来解决不平衡的生成器问题，尤其是在罕见或未见过的子群体中。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Mahmoud Ibrahim, Bart Elen, Chang Sun, Gokhan Ertaylan, Michel Dumontier

Reliable Reasoning in SVG-LLMs via Multi-Task Multi-Reward Reinforcement Learning

With the rapid advancement of vision-language models, an increasing number of studies have explored their potential for SVG generation tasks. Although existing approaches improve performance by constructing large-scale SVG datasets and introducing SVG-specific tokens, they still suffer from limited generalization, redundant paths in code outputs, and a lack of explicit reasoning. In this work, we present CTRL-S (Chain-of-Thought Reinforcement Learning for SVG), a unified framework that introduces a chain-of-thought mechanism to explicitly expose the model's reasoning process during SVG generation. To support this structured reasoning, we construct SVG-Sophia, a high-quality dataset containing 145K samples across SVG code refinement, Text-to-SVG, and Image-to-SVG tasks. By training the model to generate group-level structured SVG code, CTRL-S significantly improves structural coherence and visual fidelity. Furthermore, we adopt the GRPO algorithm and design a multi-reward optimization framework, incorporating DINO, image-text similarity, format, and code efficiency rewards. Through joint multi-reward optimization and multi-task training, our approach systematically enhances overall generation capabilities. Extensive experiments show that CTRL-S outperforms existing methods, achieving higher task success rates, superior SVG code quality, and exceptional visual fidelity.

TLDR: The paper introduces CTRL-S, a chain-of-thought reinforcement learning framework for SVG generation, using a new dataset (SVG-Sophia) and multi-reward optimization to improve SVG code quality and visual fidelity.

TLDR: 该论文介绍了CTRL-S，一个用于SVG生成的链式思考强化学习框架，它使用了一个新的数据集(SVG-Sophia)和多奖励优化来提高SVG代码质量和视觉保真度。

Relevance: (7/10)

Novelty: (8/10)

Clarity: (9/10)

Potential Impact: (7/10)

Overall: (7/10)

Read Paper (PDF)

Authors: Haomin Wang, Qi Wei, Qianli Ma, Shengyuan Ding, Jinhui Yin, Kai Chen, Hongjie Zhang

FlatLands: Generative Floormap Completion From a Single Egocentric View

A single egocentric image typically captures only a small portion of the floor, yet a complete metric traversability map of the surroundings would better serve applications such as indoor navigation. We introduce FlatLands, a dataset and benchmark for single-view bird's-eye view (BEV) floor completion. The dataset contains 270,575 observations from 17,656 real metric indoor scenes drawn from six existing datasets, with aligned observation, visibility, validity, and ground-truth BEV maps, and the benchmark includes both in- and out-of-distribution evaluation protocols. We compare training-free approaches, deterministic models, ensembles, and stochastic generative models. Finally, we instantiate the task as an end-to-end monocular RGB-to-floormaps pipeline. FlatLands provides a rigorous testbed for uncertainty-aware indoor mapping and generative completion for embodied navigation.

TLDR: The paper introduces FlatLands, a dataset and benchmark for generating complete traversability maps from a single egocentric view, aiming to improve indoor navigation, particularly through uncertainty-aware mapping and generative completion.

TLDR: 该论文介绍了FlatLands，一个用于从单视角图像生成完整可通行地图的数据集和基准，旨在通过不确定性感知地图绘制和生成式补全来改善室内导航。

Relevance: (5/10)

Novelty: (7/10)

Clarity: (9/10)

Potential Impact: (6/10)

Overall: (6/10)

Read Paper (PDF)

Authors: Subhransu S. Bhattacharjee, Dylan Campbell, Rahul Shome

AIGC Daily Papers

WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

Demystifing Video Reasoning

V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising

Search2Motion: Training-Free Object-Level Motion Control via Attention-Consensus Search

DermaFlux: Synthetic Skin Lesion Generation with Rectified Flows for Enhanced Image Classification

Semantic One-Dimensional Tokenizer for Image Reconstruction and Generation

VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment

Leveling3D: Leveling Up 3D Reconstruction with Feed-Forward 3D Gaussian Splatting and Geometry-Aware Generation

Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder

LICA: Layered Image Composition Annotations for Graphic Design Research

Diffusion Models for Joint Audio-Video Generation

Feed-forward Gaussian Registration for Head Avatar Creation and Editing

Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLMReward Models

CompDiff: Hierarchical Compositional Diffusion for Fair and Zero-Shot Intersectional Medical Image Generation

Reliable Reasoning in SVG-LLMs via Multi-Task Multi-Reward Reinforcement Learning

FlatLands: Generative Floormap Completion From a Single Egocentric View