Positional Encoding

Transformers replaced RNN & CNN. Their advantage is parallelized computing. But they don't take into account object position as RNN or CNN.

Idea : inject informaion about object into its input embedding.