Transformers
Pros:
constant path lenght between any tokens
(contrary to RNN, where first token was far from last one)parallelization
Cons:
self-attention: quadratic in time and space (scaling is an issue)
History
1951 (Shannon) [Statistical] 3-Gram model : a big lookup table P(word | two prev. words) computed on all available texts.
result sample : a bunch of non-sense but the words are from the same "space"
2011 (Sutskever)[Neuron] RNN
result sample: still non-sense, but has a "flow", we can read it as a correct sentence.
2016 (Josefowicz) [Neuron] LSTM
result sample: a bit of sense, some samples even long ones, might be interpreted in some intellifent way.
2018 (Liu, Saleh) Transformers
sentences make sens though might be wrong in logic. non-sence might take place
2019 (Radford) GPT-2
model can make consistent sentences, coherent across many paragraphs making a stories. non-sence might take place
2019 (Brown) GTP-3
fully make sense. is able to flow across many paragraphs. inherit the style of text (poetic aspect)