ML//Multimodal//LLaVA

- Large Language and Vision Assistant: the "late fusion" approach to visual language models.


Large Language and Vision Assistant: the "late fusion" approach to visual language models.

Architecture: CLIP vision encoder + linear projection + LLaMA language model. Simple, cheap, effective.

Visual instruction tuning: fine-tune on image-text pairs where the text includes instructions.

Spawned a huge family: LLaVA-1.5, LLaVA-NeXT, plus countless variants. The "Alpaca moment" for multimodal.