Natural language processing basics
basics of NLP. Count-based approach (Non-Neuro)
NLP = linguistic + machine learning
NLP structure
Before NN, NLP used count-based (aka statistical) approach, mainly Bayesian, that couldn't:
work with unseen words
Text preprocessing
- Text tokenization
From raw text to "vocablurary" of any tokens like "change", "go", "=" etc
note: for now "change" and "changes" are two different tokens.
Tokenization
2. Normalization : reduce token's vocab
Stemming : delete end
Lemmatization : cast to standard
3. Deletion : remove unimportant data
stop words : “the”, “is” , “and” etc
non-informative words : "hello", "best regards" etc
"Hello world" example of NLP model : token → OHE → FC → Softmax → next word distribution
Words encoding
Example : ['Text of the very first new sentence with the first words in sentence.', 'Text of the second sentence.','Number three with lot of words words words.','Short text, less words.',]
Def. Embedding of word X = any arbitrary transformation of word X into a numerical representation (vector)
One-hot encoding
OHE is to place 1 if word is found in vocab, else 0.
Problems:
OHE vectors doesn't take Context into account
can't define the meaningful metric between two words
exponential memory grow: OHE vector has billions axes with one bit of information
Bag of words
Sentence is represented as the count of every word in vocab.
Problems: word order and context is not taken into account.
Encoding weights
Encoding weights help to encode a world so its embedding better represents the meaning and importance of the world.
Term frequency - inverse document frequency (TF-IDF)
TF-IDF tells how particular word is important for document.
TF-IDF uses hypothesis that less frequent words are more important
So that TF-IDF lowers the weights of common words ('this', 'a', 'she') and raises the weights of rare important words ('space', 'nature', 'shark')
Pointwise mutual information (PMI)
PMI tells how likely two (or N) words come together.
ex. "Puerto" and "Rico" are more likely to come together than "Puerto" and "Cat"
Embeddings
Check Embeddings page
Context embeddings
Main Hypothesis: similar words (by meaning) should have similar vectors (by distance)
Def. Context embedding of word X = number of co-occurence of word X within the window of surrounding words
Word2Vec : use the context
Word2Vec transforms individual words into a numerical representation (aka embeddings)
It supposes main NLP hypothesis : similar words (by meaning) should have similar vectors (by distance)
Despite OHE, word2vec now takes into account context. It has two main architectures: Skip-grams and CBOW
Continuous bag of words (CBOW)
CBOW predicts a target word from a list of context words
✔️ fast as predicts only 1 distribution
⚠️ the order of words in the context is not important
Skip-Gram Model
Skip-Gram predicts a context words from a target word
✔️ works well for rare target word
❌ slow, hard to train
Cross-Entropy Loss (CEL)
CEL tells how far predicted distributions are from ground truth.
Skip-Gram structure
Last layer: SoftMax to predict distribution
...of the word o given a context c, where:
v - centered word
u - context word
Skip-Gram likelihood to max:
which is minus log to min:
Fig 3. End-to-end Skip-gram model training on the sentence “The quick brown fox”. Window m=1 (only left and right)
Skip-gram captures the meaning of words given their local context
Problem: In the sentence "The cat..." , tokens "The" and "cat" might often appear together but Skip-gram doesn’t know if “The” is a common word or a word closely related to “cat” specifically.
Solution is the GloVe model : takes into account the frequency of "The" in global text and in particular with world "cat".
Glove model takes into account both local context and global statistics of words in text.
Main idea: focus on co-occurence probabilities:
how often word j appears within context of word i
X - matrix of co-occurences
X_ij = number of times when word j appears in the context of word i.
co-occurence probabilities