Advanced NLP

Please check DL2. RNN for into to #Recurrent Networks and #Attention.

Machine Translation

Bilingual evaluation understudy (BLEU) is a "good" metric in NLP.

Example: BLEU = 1 + 0.5 + 0 = 1.5

Forward-pass

N-grams

BLEU is "good" because it is well correlated with Human estimation.

Might any architectures (ex. RNN)

Forward pass

Learning Encoder-Decoder

Train set : tuples (source → target)

max Cross-Entropy (prediction, truth)

Problem 1 : greedy decoder takes agrmax

It forces NN to translate "word by word" and not by context.

Solution : Beam Search

At every forward pass, we split Seq by best K candidates.

Finally it takes most probable output from all results.

Problems: Forgetability : for very long sequence, encoder and decoder might forget the beginning of encoded / decoded sentence

Bottleneck: all input X should be forced into the last layer (its hidden state)

Solution: Attention : direct access to "important" input data

It solves:

What is new to Seq2Seq?

Why context sums all hidden states?

Self-attention: encoder access to encoder (itself) or decoder access to decoder (itself)

2. Integrate self-A_1 to Hidden_state_1

3. Compute self-attention for Input_2

4. Integrate Self-A_2 to Hidden_state_2

Note: self-A uses only initial hidden_state (not updated) => ✔️ parallezation

5. Decoder self-attention is the same

Multi-head attention: multiple attention mechanism for same sentence.

Each attention is aimed to extract a particular relationship btw two tokens.