ML//Multimodal//LLaVA
- Large Language and Vision Assistant: the "late fusion" approach to visual language models.
Large Language and Vision Assistant: the "late fusion" approach to visual language models.
Architecture: CLIP vision encoder + linear projection + LLaMA language model. Simple, cheap, effective.
Visual instruction tuning: fine-tune on image-text pairs where the text includes instructions.
Spawned a huge family: LLaVA-1.5, LLaVA-NeXT, plus countless variants. The "Alpaca moment" for multimodal.