ML//Multimodal//vision transformer
- ViT: chop image into 16×16 patches, treat each patch as a token, run through a standard transformer
ViT: chop image into 16×16 patches, treat each patch as a token, run through a standard transformer
No convolutions. Surprisingly competitive with CNNs given enough data.
The moment transformers started eating computer vision too.