ML//Multimodal
- Diffusion models (Stable Diffusion, DALL-E 3, Midjourney): generate images but are separate architectures from the LLM — they don't "think" with images, they produce them as output.
Diffusion models (Stable Diffusion, DALL-E 3, Midjourney): generate images but are separate architectures from the LLM — they don't "think" with images, they produce them as output.
Native multimodal (GPT-4o, Gemini): process and generate text, audio, and image within the same architecture — the model "thinks" in multiple modalities simultaneously.
Hybrid/pipeline: use an LLM as the brain orchestrating specialized models (one for text, one for image, one for voice) — look native but are pieces glued together inside.