Jul 24, 2023
Speaker · 0 followers
Speaker · 0 followers
The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models.BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages.The first stage bootstraps vision-language representation learning from a frozen image encoder.The second stage bootstraps vision-to-language generative learning from a frozen language model.BLIP-2 achieves state-of-the-art performance on various vision-language tasks,despite having significantly fewer trainable parameters than existing methods.For example,our model outperforms Flamingo80B by 8.7% onzero-shot VQAv2 with 54x fewer trainable parameters.We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models.BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages.The first stage bootstraps vision-language representation…
Professional recording and live streaming, delivered globally.
Presentations on similar topic, category or speaker
Rahul Ramesh, …
Xiaoyu Tan, …