Attention

Query - what I am interested in

Key - what I can offer

Vakue - what I give you

In general, attention is a communication mechanism between nodes. 

Attention (MIT into)

Step 1. Position-aware embeddings

To solve the previous issues with RNN we want to elimenate reccurence! 

Thus, to fed all data at once, we need to encode position into embeddings

To do this, we combine simple embeddings with some position metric

As a result, we got position-aware encodings

Step 2: Extract query - key - value

Step 3 : Compute Attention Score (aka Weighting)

Attention score in [0,1]

Example: He tossed the tennis ball to serve

we pay attention to: tennis, ball, serve

Step 4: Extract features with high attention

At step 3, we only get embeddings we should pay attention to

At step 4, we actually compute the features to make predictions on (output)

Self-attention head block

Attention (Standorf into)

Attention input:

Attention function output (black elipse):

Attention algo output (context): 

Note, attention (black elipse) might be any arbitrary function, example:

It solves:

What is new to Seq2Seq?

Why context sums all hidden states?

Self - attention

Self-attention: encoder access to encoder (itself) or decoder access to decoder (itself)

2. Integrate self-A_1 to Hidden_state_1

3. Compute self-attention for Input_2

4. Integrate Self-A_2 to Hidden_state_2

Note: self-A uses only initial hidden_state (not updated) => ✔️ parallezation 

5. Decoder self-attention is the same



Multi-head attention

Multi-head attention: multiple attention mechanism for same sentence.

Each attention head is aimed to extract a particular relationship btw two tokens.