ML//Multimodal

- Diffusion models (Stable Diffusion, DALL-E 3, Midjourney): generate images but are separate architectures from the LLM — they don't "think" with images, they produce them as output.


Diffusion models (Stable Diffusion, DALL-E 3, Midjourney): generate images but are separate architectures from the LLM — they don't "think" with images, they produce them as output.

Native multimodal (GPT-4o, Gemini): process and generate text, audio, and image within the same architecture — the model "thinks" in multiple modalities simultaneously.

Hybrid/pipeline: use an LLM as the brain orchestrating specialized models (one for text, one for image, one for voice) — look native but are pieces glued together inside.