Advanced NLP

Please check DL2. RNN for into to #Recurrent Networks and #Attention.

Machine Translation

Bilingual evaluation understudy (BLEU) is a "good" metric in NLP.

Example: BLEU = 1 + 0.5 + 0 = 1.5

Forward-pass

N-grams

BLEU is "good" because it is well correlated with Human estimation.

Encoder-Decoder

Might any architectures (ex. RNN)

Forward pass

  1. Encode input X to context-embedding:

    • hidden state of the last layer

  2. Feed context to decoder RNN

  3. Generates iteratively output sentence

Learning Encoder-Decoder

Train set : tuples (source target)

max Cross-Entropy (prediction, truth)

Problem 1 : greedy decoder takes agrmax

It forces NN to translate "word by word" and not by context.

Solution : Beam Search

At every forward pass, we split Seq by best K candidates.

Finally it takes most probable output from all results.

Problems: Forgetability : for very long sequence, encoder and decoder might forget the beginning of encoded / decoded sentence

Bottleneck: all input X should be forced into the last layer (its hidden state)

Solution: Attention : direct access to "important" input data

It solves:

  • Forgetability : no need to memorize all at once

  • Bottleneck : attention now has access to any data

What is new to Seq2Seq?

  • decoder has access to all hidden states (not only last one)

Why context sums all hidden states?

  • one output might depend on multiple input-hidden-states

Self - attention

Self-attention: encoder access to encoder (itself) or decoder access to decoder (itself)

  1. Compute self-attention for Input_1

2. Integrate self-A_1 to Hidden_state_1

3. Compute self-attention for Input_2

4. Integrate Self-A_2 to Hidden_state_2

Note: self-A uses only initial hidden_state (not updated) => ✔️ parallezation

5. Decoder self-attention is the same



Multi-head attention

Multi-head attention: multiple attention mechanism for same sentence.

Each attention is aimed to extract a particular relationship btw two tokens.