Multi-lingual NER
Goal: Identify entities (aka NER) for a Swiss user (4 languages)
Model: base : XLM-RoBERTa + head : token-classification
Dataset: multi-lingual PANX dataset (DE, FR, IT and ENG)
Steps:
Fine-tune XMLR to Panx DE only and zero-shot on FR, IT, ENG.
Fine-tune XMLR to Panx FR on 250, 1k, 2k, 4k, 12k examples and compare
Fine-tune XMLR to Panx DE + FR, compare results
Fine-tune XMLR to Panx DE + FR + IT + ENG. Make conclusion
Dataset
EXTREME PANX Dataset
Labeled senteces with IOB format
we will import PANX for DE, FR, IT and EN as
Swiss = 63% DE + 23% FR + 8% IT + 6% EN.
Total 7 tags : O, B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC
Example:
Model
Body: Cross-Language RoBERTa (XLMR)
XLM-RoBERTa is a multilingial version of RoBERTa
XLMR is pre-trained on 2.5TB of CommonCrawl with 100 languages
Head: Token-Classification for NER
nn.Dropout()
nn.Linear( hidden_size, num_labels=7)
Tokenizer
Preprocess:
Tokenizer : AutoTokeniezr.from_pretraiend("xml-roberta")
tag subsequent subwords with IGN
wrap with <s> and </s>
Finetune to german
FT dataset : PANX.DE['train']
data size: 12.6k
Results:
F1 score on PANX.DE['test'] : 87%
zero-shot F1 PANX.FR['test'] : 71%
zero-shot F1 PANX.EN['test'] : 59%
Finetune to sub-french
Finetuned dataset : PANX.FR['train']
data size: 250, 500, 1k, 2k and 4k
Results:
French benefits from German zero shot
NER is transferable between languages
Finetune to DE+FR+it+en
Finetuned dataset :
12.6k PANX.DE['train']
4.6k PANX.FR['train']
1.6k PANX.IT['train']
1.2k PANX.EN['train']
Notes:
English here is low resource data
English got significant gain in perf
conclusion
Cross-lingual transfer is extremely beneficial for less common languages
The farther are linguistic groups, the less are benefits from linguistic transfer