Open Domain Question Answering
Question Answering (QA) aims to provide precise answers in response to the user’s questions in natural language. QA can be divided into two types: Knowledge Base KB-QA and Textual QA. KB-QA searches for answers from a predefined knowledge base whereas Textual QA from unstructured text doc-uments. Based on the available textual information textual QA is studied un-der two settings MRC (Machine Reading Comprehension) and Open-domain QA. To answer a question under MRC setting both the question and one context passage or document that contain the correct answer are provided, whereas the OpenQA setting tries to answer a specific question given a list of documents. In the second setting, in addition to finding the correct an-swer in a context passage, the model should first rank the documents by their likelihood of containing the correct answer.
The traditional QA system usually comprises three stages. The first one is question analysis, used to facilitate relevant document retrieval,by generating search queries, and predicting the question type to facilitate answer extraction in the upcoming stages. The second stage is document retrieval where various models have been used among which are:
• the Boolean model is a simple model where all the present words in each document are recorded, and the queries are structured as Boolean expressions combined with Boolean operators of the form « word1 AND word2 NOT word3 ». Finally, the model returns 1 if the document fulfills the expression and 0 otherwise.
• Vector Space Models: first both queries and documents are transformed into vector representation then the similarity is computed using simple functions (e.g. cosine similarity)
• Probabilistic Models: Integrates the relationship between words into a model, for example, term frequency and document length.
• Language Model: for each document d in the collection a language model Md such as ,the unigram language model, is constructed. Then for a query q the documents are ranked according to the probability p(Md|q).
Finally, the answer extraction phase is responsible for extracting the answer from the context passage and relies on diﬀerent matching methods for example word or phrase matching.
Modern OpenQA systems
Inspired by traditional architectures, Modern OpenQA systems are based on the « retriever-reader » architecture. The retriever can be thought of as an IR information retrieval system,it is designed to extract the relevant documents to a given question,it ranks the documents in the corpus according to their relevance to the question and returns the top k scoring documents, in other words retrievers are often formulated as a text ranking problem, which is the task of ranking documents in a corpus C = fdig composed of an arbitrary number of textual documents, in terms of their relevance to a query q. While the reader is designed to infer the final answer from the received documents, it is commonly implemented as a neural MRC model. These are the two key elements of a modern OpenQA system.
Figure 1: Modern OpenQA Architecture
In order to quantify the quality of the OpenQA system modules, several met-rics are used. The evaluation of a model over a set of queries is called a run. Moreover, the evaluation can be performed on each module separately or as a pipeline configuration. Below, the most important metrics used in this project.
The Retriever model receives in input a set of questions and a set of documents and gives in output, for each question a ranked list of documents that might be useful for answering the question. In order to evaluate the retriever, we should have the gold standard, that is for each question, what is the correct document to retrieve.
Recall Recall is the fraction of correct retrievals from the queries in a single run. It is often evaluated at cutoﬀ k,hence, Recall@k.
Recall = correct retrieval count
Where, correct retrieval count, is the number of times the retriever was able to find at least one relevant document for a query in a singe run. And, Q, is the number of queries(questions) in a single run.The metric is simple to interpret but does not take into account the position of the relevant document also known as the rank.
Mean Average Precision (MAP), is the mean of average precision of queries. It is used when we have many relevant documents for each query.
M AP = Query q=1 AveP recision(q)
AvePrecision is the mean of the precision scores after each relevant document is retrieved.
Mean Reciprocal Rank is the average of the reciprocal ranks of the first retrieved relevant document for a sample of queries Q, where reciprocal rank is the multiplicative inverse of the rank of the first correct answer.
Recall is considered to be a limiting factor for QA systems, and it is regarded as the system’s upper limit. MAP and MRR, on the other hand, are mea-surements of retrieval quality and they provide an assessment of the rank.
The Reader model receives in input a set of questions and a context document for each question, and gives in output, for each question an answer in natural language. In order to evaluate the reader, we should have the gold standard, that is for each question, what is the correct answer to extract.
Exact match is a metric that checks if the predicted answers matches ex-actly the true answer for every query in a single run.
em = exact matches count
correct retrievals is used when the reader is evaluated separately, i.e it is not penalized for the retriever errors. In case of the evaluation of the whole system the number of queries ’Q’ is used.
F1 score definition is not straightforward in the case of natural language processing. The F1 metric measures the overlap between the true answer and the predicted answer, which makes it less strict than em. It is calculated the same way as in standard F1 score TP + 12(FP + FN)
TP : the number of tokens (usually characters in this context) present in both predicted answer and true answer.
FP : the number of tokens present in the predicted answer but not the true answer.
FN : the number of tokens present in the true answer but not in the predicted answer.
Initial status of the project Piaf
The system Piaf is based on Haystack, an end-to-end framework for building production-ready search pipelines for Question Answering and semantic doc-ument search.
This framework oﬀers strong backends like Elasticsearch 1, FAISS 2 and trans-formers and it is highly customizable to integrate custom models. Piaf QA includes two building blocks, as described previously: a retriever for document selection and a reader for answer extraction. The reader is a transformer-based language model trained on a French QA dataset, And the Retriever is a TF-IDF like algorithm (BM25) provided by Elasticsearch backend.
To assess the system’s performance, LabIA maintains datasets of their clients. DILA (Direction de l’information légale et administrative) that runs the service-public website is one of these clients. The DILA dataset is composed of 3121 document from service-public.fr website and 380 user-asked queries and answers provided by public service agents. The dataset spans a wide range of themes such as social security, working conditions, legal procedures.The dataset is characterized by document that are considered long in the context of text ranking with an average document length of 913 word per document, and domain-specific terminology.
When supplied with an appropriate context, the reader is adequately eﬃcient; nevertheless, the retriever brick frequently fails to oﬀer the appropriate context in the full pipeline setup. In this project, we will focus on improving the retriever component of the Piaf system.
Figure 2: Example of a context with annotated answer and question in the passport file of DILA dataset
We now have a fundamental understanding of OpenQA technologies as well as the project’s initial condition. The first stage of the internship is to investigate classical methods or what is known as sparse retrievers or searchers. The purpose of this section is to present these methods before implementing and testing their results on the DILA dataset.
Exact term matching
Exact term matching methods or sparse retrievers are based on classical IR methods. In exact-term matching the terms from documents and terms from a query have to match exactly to contribute to a ranking or relevance score. Usually, these terms are normalized to some extent for example by applying stemming. The similarity function between a document d and query q for these methods can be written as follows :
S(q; d) = f(t) t2q\d
Where f is a function of a term and its associated statistics, the two most important of which are term frequency and document frequency. A central theme of early research on this topic was the exploration of various term weighting schemes for representing documents in vector space using easily computed statistics. Two of the most known of these methods are TF-IDF (term frequency–inverse document frequency) and BM25. These methods are still an entry point for many recent approaches in text ranking.
Okapi BM25 is a probabilistic model used to estimate the relevance of documents to a given search query. It is an improvement upon the TF-IDF retrieval method.
n f(qi; D)(K1 + 1)
where f(qi; D) is the term frequency in document D, avgdl is the average document length in the corpus. K1 and b are free parameters.
LMDirichletSimilarity and LMJelinekMercerSimilarity. Unlike BM25, words in the query that are not present in the documents contribute to the score using smoothing techniques. LMDirichletSimilarity and LMJelinek-MercerSimilarity use smoothing techniques to estimate the relevance of doc-uments to a given search query. – LMDirichletSimilarity
n f(qi; D) + f(qi; C)=jCj
Score(D; Q) = log( ) (2)
Score(D; Q) = log((1
where f(qi; D) is the term frequency in the document D , f(qi ; C ) the term frequency in the collection C, jdj is the total number of words in document d; jCj is the total number of words in the collection C, and are free parameters.
Divergence from randomness (DFR). The DFR model assumes that the relevant words in a document are those whose frequencies diverge from the frequency indicated by a fundamental randomness model.
wi;j = [ logp(qijC)] [1 p(qijD)]
where f(qi; Q) is the term frequency in the query, p(qi jC) and p(qijD) are the probability estimation of term in the collection and document respectively using randomness models such as Poisson, or Bose-Einstein.
Divergence from independence (DFI). The DFI model replaces the notion of randomness in DFR, with the notion of independence. A term in a document has more weight if its frequency diverges from its predicted frequency from the independence model.
e(qi; D) = f(qi; C)jCj
e(qijD) is the expected frequency. |d|,|C| are lengths of document and collec-tion respectively in number of words.
The initial step was to perform a sparse retrievers benchmark to utilize the findings as a reference point with the forthcoming parts of the project. The current system uses the default retriever BM25 with its default parameters. The goal is to explore the integration of other methods and search for param-eters that produce the best results for the data.
Before proceeding to the retrievers, we first apply the necessary preprocessing to the data. In this part, we have performed the following preprocessing techniques.
• elision to remove elision from the beginning of a token e.g. (l’arbre -> arbre)
• stemming to reduce the word to its root form, this guarantees that diﬀrent variations of a word match.
• stop words to removes stop words from a token sequence e.g. (le,la,les)
• lower case
For each algorithm mentioned in 3.2. we perform the test with and without preprocessing and grid search of their parameters.
The test was performed using elasticsearch similarity modules, and piaf-ml testing framework on the DILA dataset.
Table of contents :
1 Internship Presentation
1.1 Problem formulation
1.2 Piaf project
1.3 Internship objectives
2.1 Open Domain Question Answering
2.1.1 Traditional Architectures
2.1.2 Modern OpenQA systems
2.1.3 Evaluation Metrics
2.2 Initial status of the project Piaf
3 Sparse Methods
3.1 Exact term matching
3.2 Theoretical background
4 Neural Language Models
4.1.1 The encoder layer
4.1.2 The decoder layer
4.3 BERT for text ranking task
4.4 State of the art
4.4.1 Multi stage re-rankers
4.4.2 Dense retrievers
5.1 French QA datasets
5.2 Text ranking dataset creation
5.3 Dataset Analysis
5.4 Data split
6 Methodology and implementation
6.1 BERT based model
6.2 Embeddings extraction
6.3 Classification Approach
6.4 Pairwise Approach
6.5 Complementary scoring
6.6 Fine-tuning the complete model
6.7 Long Documents Ranking
7 Results and discussion
7.1 Experimental setup
7.2 Evaluation Metrics