Transformers

Pros:

  • constant path lenght between any tokens
    (contrary to RNN, where first token was far from last one)

  • parallelization

Cons:

  • self-attention: quadratic in time and space (scaling is an issue)

History

1951 (Shannon) [Statistical] 3-Gram model : a big lookup table P(word | two prev. words) computed on all available texts.

  • result sample : a bunch of non-sense but the words are from the same "space"

2011 (Sutskever)[Neuron] RNN

  • result sample: still non-sense, but has a "flow", we can read it as a correct sentence.

2016 (Josefowicz) [Neuron] LSTM

  • result sample: a bit of sense, some samples even long ones, might be interpreted in some intellifent way.

2018 (Liu, Saleh) Transformers

  • sentences make sens though might be wrong in logic. non-sence might take place

2019 (Radford) GPT-2

  • model can make consistent sentences, coherent across many paragraphs making a stories. non-sence might take place

2019 (Brown) GTP-3

  • fully make sense. is able to flow across many paragraphs. inherit the style of text (poetic aspect)