extractive qa

Goal: Given user question about the product, extract relevant documents and answer the question.

Model: miniLM (deepset/minilm-uncased-squad2)

Dataset: SubjQA pairs of question & answers

Steps:

Perform QA for (query, context)
Perform extractive QA for (query) only
- set up Retrieval that select few conexts from entire data
- set up Reader to answer query based on selected contexts

Example: Extractive QA for e-commerce website

DATASET : SubQA electronics

Train examples: 1,295

Validation examples: 255

Test examples: 358

Example Context:

I really like this keyboard. I give it 4 stars because it doesn’t have a CAPS LOCK key so I never know if my caps are on. But for the price, it really suffices as a wireless keyboard. I have very large hands and this keyboard is compact, but I have no complaints.

Example Question:

Does the keyboard lightweight?

Example Answer:

this keyboard is compact

MODEL

BERT architecture
33M params: 12-layers, 384-hidden, 12-heads, 1536-d_ff
2.7x faster than BERT

Fine-tuned on SQuAD 2.0
pre-trained model : miniML
- - "microsoft/MiniLM-L12-H384-uncased"

Span classification

Common QA frameworks:

Span classification : predict start & end position within the context.
Free text generation
Multiple-Choice QA
Fill the blanks in answer template with correct words
Yes/No QA

Let's check Wright Flyer example with miniML:

Span classification QA framework:

Example : The Wright brothers flew the motor-operated airplane on December 17, 1903. Their aircraft, the W-Flyer, used ailerons for control and had a 12-horsepower engine.

It works well!

However in real life, we have questions only 🤔 So we need to somehow find relevant passages in the entire corpus.

The simpliest way: concat all reviews into huge context. But this will have unnaceptable latency☹️The smartest way is:

Retriever-Reader architectuer

Retriever: embed contexts, select one with high dot product @ query

sparse retriever: high-dim sparse vector (ex. Bag Of Words, TF-IDF, BM25)
dense retriever: low-dim dense vector (ex. BERT, RoBERTa)

Reader: extract answer from the best documents provided by retriever

Document store: docs database provided to the retriever at query time

Set up Retriever

We use BM25 retriever (TF-IDF based).

Let's query "What is the length of the cord?"

The retriever managed to extract related reviews where potential answer might be found (check the table)

Set up Reader

The reader is basically the elastic abstraction for the model we already played with (miniML)

Example : The Wright brothers flew the motor-operated airplane on December 17, 1903. Their aircraft, the W-Flyer, used ailerons for control and had a 12-horsepower engine.

Extractive QA

Finally, we achieved our Goal!

Product: Amazon Kindle e-book (code: B0074BW614)
Query : "Is it good for reading?"
Retriever: BM25
Document dataset: SubjQA
Reader: miniML (BART) model

top 3 answers for query "Is it good for reading?"& product Kindle e-book

Evaluate Retriever

The quality mainly cames from neuro-Reader (BERT-based), but Retriever sets a bottleneck, providing relevant documents.

Evaluation:

retrieve top_k documents
check if answer if present in each retrieved document
compute recall

Result:

The optimal choice of top_k is around k=3

Sparse (BM25) vs Dense (DPR) retriever evaluation

Evaluate Reader

Extractive QA has two reader metrics.

Representative score=balance of both

Domain Adaptation

EM and F1 score on SubjQA dataset for 3 models.

base model: MiniLM-L12-H384-uncased

miniML(FT-SQuAD) generalizes poorly QA task on SubjQA
miniML(FT-SQuAD & FT-SubjQA) adapts well to SubjQA
miniML(FT-SubjQA) overfits due to small SubjQA dataset

Evaluate QA Pipeline

Let's compare Reader vs Retrieval&Reader.

We can see the Retriever impact on overall performance

Reader-only vs Retriever & Reader scores on SubjQA

Generative QA

We implemeted only extractive QA which tries to find answer's start token in the context (token classification)

Generative QA can synhesize the answer from scattered parts along the entire context.

Retrieval-Augmented Generation (RAG):

Reader is now a generator (encoder-decoder like T5 or BART)
RAG receives latent vectors of retrieved documents and query as input

Conclusion

Pre-trained & fine-tuned models for QA might still lack of generalization. Domain adaptation can boost the performance.