Natural language processing basics

basics of NLP. Count-based approach (Non-Neuro)

NLP = linguistic + machine learning

NLP structure

Before NN, NLP used count-based (aka statistical) approach, mainly Bayesian, that couldn't:

Text preprocessing

From raw text to "vocablurary" of any tokens like "change", "go", "=" etc

note: for now "change" and "changes" are two different tokens.

Tokenization

2. Normalization : reduce token's vocab

3. Deletion : remove unimportant data

"Hello world" example of NLP model : token  OHE  FC  Softmax next word distribution

Words encoding

Example : ['Text of the very first new sentence with the first words in sentence.', 'Text of the second sentence.','Number three with lot of words words words.','Short text, less words.',] 

Def. Embedding of word X = any arbitrary transformation of word X into a numerical representation (vector)

One-hot encoding

OHE is to place 1 if word is found in vocab, else 0.

Problems:

Bag of words

Sentence is represented as the count of every word in vocab.  

Problems: word order and context is not taken into account.

Encoding weights 

Encoding weights help to encode a world so its embedding better represents the meaning and importance of the world.

Term frequency - inverse document frequency (TF-IDF)

TF-IDF tells how particular word is important for document.

TF-IDF uses hypothesis that less frequent words are more important

So that TF-IDF lowers the weights of common words ('this', 'a', 'she') and raises the weights of rare important words ('space', 'nature', 'shark')

Pointwise mutual information (PMI)

PMI tells how likely two (or N) words come together.

ex. "Puerto" and "Rico" are more likely to come together than "Puerto" and "Cat"

Embeddings

Check Embeddings page

Context embeddings

Main Hypothesis: similar words (by meaning) should have similar vectors (by distance)

Def. Context embedding of word X = number of co-occurence of word X within the window of surrounding words

Word2Vec : use the context  

Word2Vec transforms individual words into a numerical representation (aka embeddings)

It supposes main NLP hypothesis : similar words (by meaning) should have similar vectors (by distance)

Despite OHE, word2vec now takes into account context. It has two main architectures: Skip-grams and CBOW

Continuous bag of words (CBOW)

CBOW predicts a target word from a list of context words

✔️  fast as predicts only 1 distribution

⚠️ the order of words in the context is not important

Skip-Gram Model

Skip-Gram predicts a context words from a target word

✔️ works well for rare target word

slow, hard to train

Cross-Entropy Loss (CEL)

CEL tells how far predicted distributions are from ground truth.

Skip-Gram structure

Last layer: SoftMax to predict distribution

...of the word o given a context c, where:

Skip-Gram likelihood to max:

which is minus log to min:

Fig 3. End-to-end Skip-gram model training on the sentence “The quick brown fox”. Window m=1 (only left and right)

Skip-gram captures the meaning of words given their local context

Problem: In the sentence "The cat..." , tokens "The" and "cat" might often appear together but Skip-gram doesn’t know if “The” is a common word or a word closely related to “cat” specifically. 

Solution is the GloVe model : takes into account the frequency of "The" in global text and in particular with world "cat".

GloVe: word2vec with PMI

Glove model takes into account both local context and global statistics of words in text.

Main idea: focus on co-occurence probabilities:

co-occurence probabilities