Advanced NLP
Machine Translation
Bilingual evaluation understudy (BLEU) is a "good" metric in NLP.
Example: BLEU = 1 + 0.5 + 0 = 1.5
Forward-pass
N-grams
BLEU is "good" because it is well correlated with Human estimation.
Encoder-Decoder
Might any architectures (ex. RNN)
Forward pass
Encode input X to context-embedding:
hidden state of the last layer
Feed context to decoder RNN
Generates iteratively output sentence
Learning Encoder-Decoder
Train set : tuples (source → target)
max Cross-Entropy (prediction, truth)
Problem 1 : greedy decoder takes agrmax
It forces NN to translate "word by word" and not by context.
Solution : Beam Search
At every forward pass, we split Seq by best K candidates.
Finally it takes most probable output from all results.
Problems: Forgetability : for very long sequence, encoder and decoder might forget the beginning of encoded / decoded sentence
Bottleneck: all input X should be forced into the last layer (its hidden state)
Solution: Attention : direct access to "important" input data
It solves:
Forgetability : no need to memorize all at once
Bottleneck : attention now has access to any data
What is new to Seq2Seq?
decoder has access to all hidden states (not only last one)
Why context sums all hidden states?
one output might depend on multiple input-hidden-states
Self - attention
Self-attention: encoder access to encoder (itself) or decoder access to decoder (itself)
Compute self-attention for Input_1
2. Integrate self-A_1 to Hidden_state_1
3. Compute self-attention for Input_2
4. Integrate Self-A_2 to Hidden_state_2
Note: self-A uses only initial hidden_state (not updated) => ✔️ parallezation
5. Decoder self-attention is the same
Multi-head attention
Multi-head attention: multiple attention mechanism for same sentence.
Each attention is aimed to extract a particular relationship btw two tokens.