ML//Multimodal//BLIP

2026-03-06

- Bootstrapping Language-Image Pre-training (Salesforce). BLIP (2022) and BLIP-2 (2023)

Bootstrapping Language-Image Pre-training (Salesforce). BLIP (2022) and BLIP-2 (2023)

BLIP-2 innovation: Q-Former, a lightweight querying transformer bridging a frozen vision encoder to a frozen LLM.

No need to fine-tune the vision or language model. Only the Q-Former trains. Extremely parameter-efficient.

Superseded CLIP for many tasks. SigLIP (Google) and PaliGemma are related successors.