Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee · Visual Instruction Tuning · SlidesLive

Kategorien

DE

Anmelden Kostenvoranschlag

Visual Instruction Tuning

Dez 15, 2023

Sprecher:innen

Über

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has been shown to improve zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. We present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and an LLM for general-purpose visual and language understanding. To facilitate future research on visual instruction following, we construct two evaluation benchmarks with diverse and challenging application-oriented tasks. Our experiments show that LLaVA demonstrates impressive multimodal chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model, and code publicly available.

Organisator

Präsentation speichern

Soll diese Präsentation für 1000 Jahre gespeichert werden?

Wie speichern wir Präsentationen?

Freigeben

Empfohlene Videos

Präsentationen, deren Thema, Kategorie oder Sprecher:in ähnlich sind