ML//Multimodal//BLIP

- Bootstrapping Language-Image Pre-training (Salesforce). BLIP (2022) and BLIP-2 (2023)


Bootstrapping Language-Image Pre-training (Salesforce). BLIP (2022) and BLIP-2 (2023)

BLIP-2 innovation: Q-Former — a lightweight querying transformer bridging a frozen vision encoder to a frozen LLM.

No need to fine-tune the vision or language model — only the Q-Former trains. Extremely parameter-efficient.

Superseded CLIP for many tasks. SigLIP (Google) and PaliGemma are related successors.