Dec 10, 2023
Current machine learning models for vision are often highly specialized and limited to a single modality and task. In contrast, recent large language models exhibit a wide range of capabilities, hinting at the possibility of similarly versatile models in computer vision. In this paper, we take a step in this direction and propose an effective multi-modal pre-training scheme, called 4M. It is a single Transformer encoder-decoder trained using masked modeling objective across a wide range of modalities – including text, images, geometric, and semantic modalities, as well as neural network feature maps. 4M achieves efficient training and scalability by unifying the representation space of all modalities through mapping them into discrete tokens and performing multi-modal masked modeling on a small randomized subset of tokens. 4M exhibits several key capabilities: (1) it can perform a diverse set of vision tasks out of the box, (2) it excels when fine-tuned for unseen downstream tasks or new input modalities, and (3) it can function as a generative model that can be conditioned on arbitrary modalities, enabling a wide variety of expressive multi-modal editing capabilities with unprecedented flexibility.Through comprehensive experimental analyses, we demonstrate 4M's potential as a versatile and scalable foundation model for vision tasks, setting the stage for further exploration and advancement in multi-modal learning for vision and other domains.
Total of 0 viewers voted for saving the presentation to eternal vault which is 0.0%
Presentations on similar topic, category or speaker