15:30 - 16:15
Transformers and architectures based on them like BERT, GPT and T5 are the new stars of Natural Language Processing (NLP). With the advent of Vision Transformers (ViTs), it seems that they will take Image and Video Processing by storm as well.
More than a enough reason to take a good look under the hood at what makes Transformers tick. In this session we will study Attention, the mechanism that powers all these Transformer variants. We will learn the difference between Encoder, Decoder and Encoder-Decoder architectures.
Finally, we will take a first glimpse at ViTs and see how what changes are necessary to use a transformer for image processing instead of NLP.