Attention and transformers: a curated guide

A hand-selected list of resources to learn about attention and transformers.

When trying to learn something new, I often am overwhelmed by the vast amount of information that is returned by the initial web search. Hundreds of blog posts, videos, research papers, slides—how can one make sense of it all? I find that the process starts with searching and weighing, where I click through links, skim through each, and eventually select a few good resources that I study in detail.

So instead of writing my own “What is Attention?” post, which will undoubtedly add clutter to the knowledge base rather than anything substantial, I have organized a list of excellent resources made by others. These resources were especially helpful for me in my own learning journey. I hope that this guide will help people who, like me, feel dizzy seeing the sheer amount of links related to transformers, with no clue where to begin.

This list is:

  1. Ordered: from high-level to detailed (basically simple to advanced). But you should not feel the need to complete each step in its entirety before moving on to the next.
  2. Minimal: composed of materials that each add to the learning process in largely non-overlapping ways.
  3. Beginner-friendly: made by a fellow beginner 👼.

1. Andrew Ng’s videos (30 minutes)

Provides an approachable, high-level explanation of attention, which is the fundamental building block of transformers.

For background into recurrent neural networks (RNNs), I recommend a chapter in the Deep Learning Book (

2. Lilian Weng’s blog post (1 hour)

The attention mechanism that was proposed originally by Bahdanau et al. in 2014 has been developed by a series of works until it reached the form used in transformers. This post provides a very nice summary of the progress. It was also updated multiple times to include recent papers.

3. The Transformer paper (1 hour)

Attention Is All You Need (Vaswani et al., 2017) is the paper that first proposed transformers. The writing is clear and approachable, especially with regard to the motivation and philosophy behind the discovery. However, when explaining how exactly the neural network was designed and how it works, the paper is quite succinct. But don’t worry, the next few links will help us out.

4. Pascal Poupart’s lecture at the University of Waterloo (1.5 hours)

The previous materials are quite high-level in that they do not go into much detail on how self-attention and multi-head attention actually work, i.e.) what are the vectors and matrices involved, what operations take place between them, and so on. This lecture elaborates upon these details.

5. The Annotated Transformer by the Harvard NLP group (4 hours)

An invaluable resource that implements a full working transformer in Torch. Code snippets accompany the corresponding paragraphs from the original paper. You can try out the model with a single GPU. Note that the PyTorch version used in the notebook is outdated, so you will have to make a few modifications if you use a recent version.

For background on residual connections, refer to Andrew Ng’s

6. Amirhossein Kazemnejad’s post on positional encodings (30 minutes)

Explains the sinusoidal positional encodings used in the transformer model. I found the analogy to binary representations of integers to be especially helpful.

7. Ashish Vaswani’s guest lecture at Stanford CS224N (1 hour)

As often happens, this lecture by one of the original authors of the paper does not turn out to be easy to understand. However, I found that it provides interesting insights into the thinking that went behind the research process.

Concluding remarks

You might or might not have observed that this post itself can be viewed as an analogy of the attention mechanism. Meta! 🤔