ML//Multimodal//vision transformer

2021-06-12

- ViT: chop image into 16×16 patches, treat each patch as a token, run through a standard transformer

ViT: chop image into 16×16 patches, treat each patch as a token, run through a standard transformer

No convolutions. Surprisingly competitive with CNNs given enough data.

The moment transformers started eating computer vision too.