ArXiv CS.CV Papers (Image/Video Generation)

Show-o2: Improved Native Unified Multimodal Models

This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.

TLDR: Show-o2 introduces improved unified multimodal models using autoregressive modeling and flow matching within a 3D causal VAE framework, achieving versatility in text, image, and video tasks.

TLDR: Show-o2 引入了改进的统一多模态模型，该模型在3D因果VAE框架中使用自回归建模和流匹配，从而在文本、图像和视频任务中实现了多功能性。

Relevance: (9/10)

Novelty: (7/10)

Clarity: (8/10)

Potential Impact: (8/10)

Overall: (8/10)

Read Paper (PDF)

Authors: Jinheng Xie, Zhenheng Yang, Mike Zheng Shou

AIGC Daily Papers

Show-o2: Improved Native Unified Multimodal Models