4M: Massively Multimodal Masked Modeling

Dec 10, 2023

Speakers

About

Current machine learning models for vision are often highly specialized and limited to a single modality and task. In contrast, recent large language models exhibit a wide range of capabilities, hinting at the possibility of similarly versatile models in computer vision. In this paper, we take a step in this direction and propose an effective multi-modal pre-training scheme, called 4M. It is a single Transformer encoder-decoder trained using masked modeling objective across a wide range of modalities – including text, images, geometric, and semantic modalities, as well as neural network feature maps. 4M achieves efficient training and scalability by unifying the representation space of all modalities through mapping them into discrete tokens and performing multi-modal masked modeling on a small randomized subset of tokens. 4M exhibits several key capabilities: (1) it can perform a diverse set of vision tasks out of the box, (2) it excels when fine-tuned for unseen downstream tasks or new input modalities, and (3) it can function as a generative model that can be conditioned on arbitrary modalities, enabling a wide variety of expressive multi-modal editing capabilities with unprecedented flexibility.Through comprehensive experimental analyses, we demonstrate 4M's potential as a versatile and scalable foundation model for vision tasks, setting the stage for further exploration and advancement in multi-modal learning for vision and other domains.

Organizer

Store presentation

Should this presentation be stored for 1000 years?

How do we store presentations

Total of 0 viewers voted for saving the presentation to eternal vault which is 0.0%

Sharing

Recommended Videos

Presentations on similar topic, category or speaker

Interested in talks like this? Follow NeurIPS 2023