Natural language processing basics

basics of NLP. Count-based approach (Non-Neuro)

NLP = linguistic + machine learning

NLP structure

Before NN, NLP used count-based (aka statistical) approach, mainly Bayesian, that couldn't:

work with unseen words

Text preprocessing

Text tokenization

From raw text to "vocablurary" of any tokens like "change", "go", "=" etc

note: for now "change" and "changes" are two different tokens.

Tokenization

2. Normalization : reduce token's vocab

- Stemming : delete end
- Lemmatization : cast to standard

3. Deletion : remove unimportant data

- stop words : “the”, “is” , “and” etc
- non-informative words : "hello", "best regards" etc

"Hello world" example of NLP model : token → OHE → FC → Softmax → next word distribution

Words encoding

Example : ['Text of the very first new sentence with the first words in sentence.', 'Text of the second sentence.','Number three with lot of words words words.','Short text, less words.',]

Def. Embedding of word X = any arbitrary transformation of word X into a numerical representation (vector)

One-hot encoding

OHE is to place 1 if word is found in vocab, else 0.

Problems:

OHE vectors doesn't take Context into account
can't define the meaningful metric between two words
exponential memory grow: OHE vector has billions axes with one bit of information

Bag of words

Sentence is represented as the count of every word in vocab.

Problems: word order and context is not taken into account.

Encoding weights

Encoding weights help to encode a world so its embedding better represents the meaning and importance of the world.

Term frequency - inverse document frequency (TF-IDF)

TF-IDF tells how particular word is important for document.

TF-IDF uses hypothesis that less frequent words are more important

So that TF-IDF lowers the weights of common words ('this', 'a', 'she') and raises the weights of rare important words ('space', 'nature', 'shark')

Pointwise mutual information (PMI)

PMI tells how likely two (or N) words come together.

ex. "Puerto" and "Rico" are more likely to come together than "Puerto" and "Cat"

Embeddings

Check Embeddings page

Context embeddings

Main Hypothesis: similar words (by meaning) should have similar vectors (by distance)

Def. Context embedding of word X = number of co-occurence of word X within the window of surrounding words

❌ if the word X is very rare, emdebbing(X) might be close to zero
❌ adding new item will require to recompute all existing embeddings

Word2Vec : use the context

Medium: Word2Vec Explained

Word2Vec transforms individual words into a numerical representation (aka embeddings)

It supposes main NLP hypothesis : similar words (by meaning) should have similar vectors (by distance)

Despite OHE, word2vec now takes into account context. It has two main architectures: Skip-grams and CBOW

Continuous bag of words (CBOW)

CBOW predicts a target word from a list of context words

✔️ fast as predicts only 1 distribution

⚠️ the order of words in the context is not important

Skip-Gram Model

Skip-Gram predicts a context words from a target word

✔️ works well for rare target word

❌ slow, hard to train

Skip-Gram Learning

Medium:Word embeddings 1

Medium:Word2Vec to Transformers

Cross-Entropy Loss (CEL)

CEL tells how far predicted distributions are from ground truth.

Skip-Gram structure

Last layer: SoftMax to predict distribution

...of the word o given a context c, where:

v - centered word
u - context word

Skip-Gram likelihood to max:

which is minus log to min:

Fig 3. End-to-end Skip-gram model training on the sentence “The quick brown fox”. Window m=1 (only left and right)

Skip-gram captures the meaning of words given their local context

Problem: In the sentence "The cat..." , tokens "The" and "cat" might often appear together but Skip-gram doesn’t know if “The” is a common word or a word closely related to “cat” specifically.

Solution is the GloVe model : takes into account the frequency of "The" in global text and in particular with world "cat".

GloVe: word2vec with PMI

📄 GloVe article

Glove model takes into account both local context and global statistics of words in text.

Main idea: focus on co-occurence probabilities:

how often word j appears within context of word i

X - matrix of co-occurences
X_ij = number of times when word j appears in the context of word i.

co-occurence probabilities