ML//Multimodal//vision transformer

- ViT: chop image into 16×16 patches, treat each patch as a token, run through a standard transformer


ViT: chop image into 16×16 patches, treat each patch as a token, run through a standard transformer

No convolutions. Surprisingly competitive with CNNs given enough data.

The moment transformers started eating computer vision too.