Multi-lingual NER

Goal: Identify entities (aka NER) for a Swiss user (4 languages)

Model: base : XLM-RoBERTa + head : token-classification

Dataset: multi-lingual PANX dataset (DE, FR, IT and ENG)

Steps:

Dataset

EXTREME PANX Dataset

Labeled senteces with IOB format 

we will import PANX for DE, FR, IT and EN as
Swiss = 63% DE + 23% FR + 8% IT + 6% EN.

Total 7 tags : O, B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC

Example:

Model

Body: Cross-Language RoBERTa (XLMR)

XLM-RoBERTa is a multilingial version of RoBERTa

XLMR is pre-trained on 2.5TB of CommonCrawl with 100 languages

Head: Token-Classification for NER

nn.Dropout()
nn.Linear( hidden_size, num_labels=7)

Tokenizer

Preprocess:

Finetune to german

FT dataset : PANX.DE['train']

data size: 12.6k

Results:

Finetune to sub-french

Finetuned dataset : PANX.FR['train']

data size: 250, 500, 1k, 2k and 4k

Results:

Finetune to DE+FR+it+en

Finetuned dataset :

Notes:

conclusion