ML//Multimodal

2026-02-15

- Diffusion models (Stable Diffusion, DALL-E 3, Midjourney): generate images but are separate architectures from the LLM. They don't "think" with images, they produce them as output.

Diffusion models (Stable Diffusion, DALL-E 3, Midjourney): generate images but are separate architectures from the LLM. They don't "think" with images, they produce them as output.

Native multimodal (GPT-4o, Gemini): process and generate text, audio, and image within the same architecture. The model "thinks" in multiple modalities simultaneously.

Hybrid/pipeline: use an LLM as the brain orchestrating specialized models (one for text, one for image, one for voice), look native but are pieces glued together inside.