Machine learning for information extraction
Recent years have seen an explosive adoption of machine learning techniques in the Informa-tion Extraction area. We distinguish the following family of techniques:
• Supervised learning designates a family of machine learning methods that map an input data point, i.e., a text representation (Section 2.2.5) of text unit (e.g., a phrase, a sentence, a paragraph or an article), to an output, after learning the input-output correspondence on a labeled dataset. A standard practice is dividing the labeled dataset into three parts: the training set of labeled examples is used to learn (“train”) the statistical model, the development set (also called “dev set”, in short) is used to tune the model’s parameters, finally the test set is used to evaluate the model’s performance. The advantage of super-vised learning methods is that they are based on “gold standard” points (usually provided by humans); however, this is also their weakness, as human labor is generally expensive. Another weakness is the quality of gold standard because two humans could easily give two different labels to the same data point. Examples of supervised learning are
– Linear Support Vector Machines (Linear SVM) [Cortes and Vapnik, 1995].
– Gaussian Na¨ıve Bayes [Friedman et al., 1997].
– Random Forests [Breiman, 2001].
– Logistic Regression [Cramer, 2002].
– Gradient Boosted Decision Tree [Friedman, 2000].
• Unsupervised learning is a class of machine learning methods that do not require labeled data; this is an advantage given that they are no longer dependent on the availability of costly human-labeled examples. These methods try to learn from the data distribution in order to discover the interesting structures or semantic characteristics of the data. For example, given a dataset of news articles, a classifier following an unsupervised approach could group together similar articles thanks to the similarity of their semantic represen-tations. Some examples of unsupervised learning are:
– k-Nearest Neighbour Classifiers [Cunningham and Delany, 2007]
– Autoencoder [Rumelhart et al., 1986].
– Local outlier factor [Breunig et al., 2000].
• Semi-supervised learning is a hybrid between supervised and unsupervised learning. Given a set of labeled data L and another (bigger) set of unlabeled data U, the semi-supervised learning methods try to combine the useful information from both sets in order to improve over the performance of supervised models trained on only L. Further, by relying both on L and on U, semi-supervised learning reduces the dependency on a large set of labeled examples. An example of this approach is Transductive SVM [Joachims, 1999].
• Another approach to overcome the difficulty of creating large labeled datasets is distant supervision, which leverages knowledge from a knowledge base to generate the training data. However, this method faces the challenge on handling potentially noisy labeled (training) data thus obtained. This learning paradigm was proposed in [Mintz et al., 2009].
• Data programming is an approach proposed by [Ratner et al., 2016] that could create labeled data from user’s labeling functions using heuristics, patterns, etc. to express weak supervision signals. For example, on the task of extracting the relation isSpouseOf from text, a user could write the labeling function: “If two people who have the same child then they might be husband and wife” in order to return a label 1 (a positive training example) for the relation (A. Nguyen, isSpouseOf, B. Tran) from the sentence “C. Nguyen is the child of A. Nguyen and B. Tran”. The authors implemented a system called Snorkel [Rat-ner et al., 2017] which learns from the agreements and disagreements of all the provided labeling functions to generate probabilistic label for every unlabeled data example. Their system analyzes the relationships of labeling functions to decide when to simply apply majority vote and when to infer the label from all the labeling functions. Finally, a super-vised machine learning model is trained to minimize a loss function of the probabilistic labels. The authors develop connectors to popular machine learning libraries to let end-users have a wide range of platform choices.
The predecessor of Snorkel is DeepDive [Zhang, 2015]. This system relies on heuristic rules and distant supervision to obtain the training data. Multiple labels for the same data example are resolved by majority vote.
Deep learning for information extraction
We recall here basic terminology and most frequently recurring deep learning models to which this manuscript will refer in the sequel.
A neuron is the basic unit (a node) of a neural network. A neuron receives its inputs consisting of a vector of numerical values fx1; x2; : : : ; xng, a vector of weights fw1; w2; : : : ; wng that reflects the importance of each xi, and a bias vector b. The output of neuron Y is computed by a non-linear activation function f: X Y = f( wi xi + b) 1 i n.
The purpose of f is learning the non linear representations of the input data. The most simple neural network is a feed-forward neural network that consists of multiple layers. Each layer is a collection of neurons. There are edges connecting neurons from two adjacent layers. Each edge contains weight that has been discussed previously. Layers could be classified into three types: input layer that represents the input data, hidden layer that trans-forms the data from the input layer to the output layer, output layer that represents the expected output data. A feed-forward neural network with more than one hidden layers is called multi layer perceptron.
There are more sophisticated neural network architectures. Convolutional neural networks (CNNs) [LeCun and Bengio, 1998] are designed to classify images. A CNN uses convolu-tional layers to learn the representations of local regions (e.g., a pixel and its eight surrounding pixels) from the input image. Each convolutional layer learns a specific visual feature. All these extracted features are sent to the output layer for the classification task. To handle text data, a local region is considered as n contiguous words. Specific linguistic feature is learnt in each convolutional layer. CNNs have been applied in text classification [Kim, 2014], relation extraction [Zeng et al., 2014], etc.
Recurrent neural networks (RNNs) are designed to process sequence data, e.g., text as a sequence of words. In the standard feed-forward neural network, neurons in the hidden layer are only connected to neurons from the previous layers. RNN allows the connection between two adjacent neurons in the hidden layer. This modification makes the network learn from the history of the sequence encoded in the hidden layer. The effectiveness of RNNs has been shown in many tasks such as text classification, time series prediction, etc. RNN variants such as Long Short Term Memory (LSTM) [Hochreiter and Schmidhuber, 1997], Bidirectional LSTM (Bi-LSTM) [Schuster and Paliwal, 1997] are more popular than the vanilla RNN in practice.
Metrics for evaluating information extraction quality
It is often the case that we need to evaluate the quality of an information extraction process, in order to get a quantitative grasp of the trust which can be put in its output.
To evaluate the quality, a gold standard of answers considered correct (typically provided by humans) is usually assumed available. The gold standard is in some cases a set, e.g., objects which are sure to belong to a certain class (in a classification problem), or the set of all mentions of humans which an information extraction algorithm must identify in a text. In other cases, the gold standard is a list, for instance, when the problem is to return a ranked list of answers from the most relevant to the least relevant, and a ranked list of relevant answers is specified by a human. Based on such a gold standard, for a given information extraction method which returns a certain set of answers to a given task, a set of popular metrics are:
• Precision, denoted p, is the fraction of the returned results that are part of the gold stan-dard. Precision can be seen as reflecting the usefulness of a method, i.e., how many of the correct results are returned.
• Recall, denoted r, is the fraction of the gold standard that is part of the returned results. Recall can be seen as reflecting the completeness of the method. Precision indicates the usefulness while recall indicates the completeness.
• There is a natural tension between precision and recall; returning more results cannot decrease precision, but it can decrease recall, and vice versa. Thus, a single metric in-cluding both is the F1-score, defined as the harmonic mean of the two previous metrics: F 1 = 2 ppr+ r .
• The above discussion is based on a “binary” setting where a result can be either be part of the gold standard (e.g., be “relevant”) or not. In a more general setting, e.g., classification with more than two classes, two variants of the F1-score can be defined, respectively, are macro-average and micro-average. The macro-averaged F1-score is the average of the F1-score of all classes. The micro-averaged F1-score is the weighted average of the classes’ F1-scores, which takes into account the contribution of all classes.
• ROC curve (receiver operating characteristic curve) is a plot of the two metrics of a classi-fication model: recall on the vertical axis, and FPR (False Positive Rate) in the horizontal axis. FPR is calculated as FPR = False Positive / (False Positive + True Negative) where False Positive refers to the situation when a binary classifier incorrectly predicts the pos-itive class and True Negative refers to the case when a binary classifier correctly predicts the negative class. The curve is constructed from all pairs (recall; F P R) correspond-ing to all classification thresholds, i.e., a float value to compare with the binary model’s probability in order to classify an example as positive or negative class.
Area Under the ROC Curve (AUC) is the area below the ROC curve which indicates the probability that a random positive example could be ranked higher than a random negative example.
Pattern matching In [Levy et al., 2017], claims are extracted from Wikipedia using pattern matching with an automatically generated lexicon, and developing a ranking score on the matched sentences. We detail this below.
Firstly, the authors manually identify a list of debate topics (main concepts), denoted M C. Each of these topics is a concept, e.g., “doping in sport”, “boxing” etc. From Wikipedia, the authors extract a corpus of 1.86 millions sentences that contain these topics.
A sentence which has the structure “… that … MC …” is a candidate in which they search for a claim, e.g., “The president claimed that his government has reduced the unemployment rate to 5%.” The corpus is divided into two sets: c1 which contains the above structure and c2 which does not. They define a sentence suffix as the part of the sentence that follows a topic from M C. Then, for each word w in the vocabulary obtained from all sentences, they compute Psuff (c1jw) = n1=(n1 + n2), where n1 is the number of sentences in c1 that contain w in the sentence suffix, and n2 is the number of sentences in c2 that contain w. If Psuff (c1jw) > jc1j , then w belongs to the set of words called Claim Lexicon (LC).
To rank a sentence s that matches the pattern “… that … M C … cl …”, where cl is a word in CL (if there are more than one word from CL that appears in s after M C, the first one is selected), they take the average of two following scores:
• w2v: for each word w in s following the first ‘that’, they find the best cosine similar-ity between the word2vec representation of w and that of each word in M C, and then average the obtained scores;
• slop: this is number of tokens between the word ‘that’ and cl.
Using linguistic features
[Nakashole and Mitchell, 2014] aim to determine the truthfulness of a fact such as “Einstein died in Princeton” by analyzing the presence of language’s level of objectivity.
They use crowdsourcing to analyze the correlation between the objectivity of language and the trustworthiness of news articles. For each article, they ask workers to label it as subjective / objective and trustworthy / untrustworthy. An example of objective sentence is “Theories allege that Obama’s published birth certificate is a forgery, that his actual birthplace is not Hawaii but Kenya”. An example of subjective sentence is “Well, I think Obama was born in Kenya because his grandma who lives in Kenya said he was born there.”. On a dataset of 420 articles, they report that 74% of the untrustworthy articles are also found subjective. On the other hand, 64% of trustworthy articles were labeled as objective. Thus, they form a hypothesis objective text is more trustworthy than subjective one.
To train an objectivity detector, they crowdsource a labeled dataset and reuse another dataset from prior work on subjectivity detection [Pang and Lee, 2004]. The final dataset consists of 4,600 documents. They train a binary classifier with five sets of features: (1) subjectivity lexicon of strong and weak subjective words, (2) sentiment lexicon of positive and negative words, (3) Wikipedia-derived bias lexicon, (4) part-of-speech, (5) frequent bi-grams. They report a precision of 78.14%. The probability that the learned model assigns to the objective label is call objectivity of a document. The feature sets (1) and (2) are also adopted by [Nadeem et al., 2019] in a similar effort to analyze the language used in relevant documents of a claim. A new feature set Wiki-bias lexicon, bias cues and controversial words such as abortion and execute, are also introduced by [Nadeem et al., 2019].
A fact candidate is a piece of evidence connected to a claim. It can appear in one or more sources, where a source is a collection of documents such as the set of articles in a newspaper, or Wikipedia. A fact candidate is represented as a RDF triple (subject; verbal phrase; object). Using the objectivity of a document, they compute the two following scores for a fact candidate:
• objectivity score to ensure that a fact candidate mentioned in 10 sources with 0.9 objec-tivity should be given more faith than a fact candidate stated in 1 source of 0.9 objectivity.
• co-mention score to gives “boost” to the fact candidate that is co-mentioned in the same source(s) with another fact candidate that has high objectivity score.
Finally they compute a believability score for each fact candidate which is a weighted sum of the two previous mentioned scores. The believability score is used to determine whether a fact candidate is true or false. High accuracy scores are reported on three different datasets:
• KB dataset provided by [Carlson et al., 2010] consisting of 190 fact candidates about company acquisitions, book authors, movie directors, and athlete teams.
• Wikipedia dataset consisting of 54 fact candidates about US politicians. They created this dataset manually.
• General Knowledge Quiz dataset consisting of 18 fact candidates collected from a gen-eral knowledge quiz24.
Table of contents :
1.2 Contributions and outline
2.1 Resource Description Framework
2.2 Information extraction
2.2.1 Information extraction tasks
2.2.2 Machine learning for information extraction
2.2.3 Deep learning for information extraction
2.2.4 Metrics for evaluating information extraction quality
2.2.5 Text representation
3 State of the art of computational fact checking
3.1 Claim extraction
3.1.1 Unsupervised approaches
3.1.2 Supervised methods
3.2 Reference source search
3.3 Related datasets
3.4 Claim accuracy assessment
3.4.1 Using external sources
3.4.2 Using a knowledge graph
3.4.3 Using linguistic features
3.4.4 Using user input
3.5 Fact checking challenges
3.5.1 Fake news challenge
3.5.2 Fact Extraction and VERification
3.5.3 Check worthiness
3.6 Automated end-to-end fact checking systems
4 Extracting linked data from statistic spreadsheets
4.2 Reference statistic data
4.2.1 INSEE data sources
4.2.2 Conceptual data model
4.3 Spreadsheet data extraction
4.3.1 Data cell identification
184.108.40.206 The leftmost data location
220.127.116.11 Row signature
18.104.22.168 Collect additional data cells
4.3.2 Identification and extraction of header cells
22.214.171.124 The horizontal border
126.96.36.199 Cell borders
188.8.131.52 Collect header cells
4.3.3 Populating the data model
4.4 Linked data vocabulary
4.7 Related works
4.8 Conclusion and future works
5 Searching for truth in a database of statistics
5.2 Search problem and algorithm
5.2.1 Dataset search
5.2.2 Text processing
5.2.3 Word-dataset score
5.2.4 Relevance score function
184.108.40.206 Content-based relevance score function
220.127.116.11 Location-aware score components
18.104.22.168 Content- and location-aware relevance score
5.2.5 Data cell search
5.3.1 Datasets and queries
22.214.171.124 Evaluation metric
126.96.36.199 Parameter estimation and results
188.8.131.52 Running time
184.108.40.206 Comparison against baselines
5.3.3 Web application for online statistic search
5.5 Related works
5.6 Conclusion and future works
6 Statistical mentions from textual claims
6.2 Statistical claim extraction outline
6.3 Entity, relation and value extraction
6.3.1 Statistical entities
6.3.2 Relevant verbs and measurement units
6.3.3 Bootstrapping approach
6.3.4 Extraction rules
6.4.1 Evaluation of the extraction rules
6.4.2 Evaluation of the end-to-end system
6.6 Related works
6.7 Conclusion and future works
7 Topics exploration and classification
7.1 Corpus construction
7.2 Topic extraction
7.3 Topic classification
7.3.2 Model training