Deep learning for information extraction
We recall here basic terminology and most frequently recurring deep learning models to which this manuscript will refer in the sequel.
A neuron is the basic unit (a node) of a neural network. A neuron receives its inputs consisting of a vector of numerical values fx1; x2; : : : ; xng, a vector of weights fw1;w2; : : : ;wng that reflects the importance of each xi, and a bias vector b. The output of neuron Y is computed by a non-linear activation function f: Y = f( X 1in wi xi + b) The purpose of f is learning the non linear representations of the input data.
The most simple neural network is a feed-forward neural network that consists of multiple layers. Each layer is a collection of neurons. There are edges connecting neurons from two adjacent layers. Each edge contains weight that has been discussed previously. Layers could be classified into three types: input layer that represents the input data, hidden layer that transforms the data from the input layer to the output layer, output layer that represents the expected output data. A feed-forward neural network with more than one hidden layers is called multi layer perceptron. There are more sophisticated neural network architectures. Convolutional neural networks (CNNs) [LeCun and Bengio, 1998] are designed to classify images. A CNN uses convolutional layers to learn the representations of local regions (e.g., a pixel and its eight surrounding pixels) from the input image. Each convolutional layer learns a specific visual feature. All these extracted features are sent to the output layer for the classification task. To handle text data, a local region is considered as n contiguous words. Specific linguistic feature is learnt in each convolutional layer. CNNs have been applied in text classification [Kim, 2014], relation extraction [Zeng et al., 2014], etc.
Recurrent neural networks (RNNs) are designed to process sequence data, e.g., text as a sequence of words. In the standard feed-forward neural network, neurons in the hidden layer are only connected to neurons from the previous layers. RNN allows the connection between two adjacent neurons in the hidden layer. This modification makes the network learn from the history of the sequence encoded in the hidden layer. The effectiveness of RNNs has been shown in many tasks such as text classification, time series prediction, etc. RNN variants such as Long Short Term Memory (LSTM) [Hochreiter and Schmidhuber, 1997], Bidirectional LSTM (Bi- LSTM) [Schuster and Paliwal, 1997] are more popular than the vanilla RNN in practice.
Metrics for evaluating information extraction quality
It is often the case that we need to evaluate the quality of an information extraction process, in order to get a quantitative grasp of the trust which can be put in its output.
To evaluate the quality, a gold standard of answers considered correct (typically provided by humans) is usually assumed available. The gold standard is in some cases a set, e.g., objects which are sure to belong to a certain class (in a classification problem), or the set of all mentions of humans which an information extraction algorithm must identify in a text. In other cases, the gold standard is a list, for instance, when the problem is to return a ranked list of answers from the most relevant to the least relevant, and a ranked list of relevant answers is specified by a human. Based on such a gold standard, for a given information extraction method which returns a certain set of answers to a given task, a set of popular metrics are:
• Precision, denoted p, is the fraction of the returned results that are part of the gold standard. Precision can be seen as reflecting the usefulness of a method, i.e., how many of the correct results are returned.
• Recall, denoted r, is the fraction of the gold standard that is part of the returned results. Recall can be seen as reflecting the completeness of the method. Precision indicates the usefulness while recall indicates the completeness.
• There is a natural tension between precision and recall; returning more results cannot decrease precision, but it can decrease recall, and vice versa. Thus, a single metric including both is the F1-score, defined as the harmonic mean of the two previous metrics: F1 = 2 pr p + r .
• The above discussion is based on a “binary” setting where a result can be either be part of the gold standard (e.g., be “relevant”) or not. In a more general setting, e.g., classification with more than two classes, two variants of the F1-score can be defined, respectively, are macro-average and micro-average. The macro-averaged F1-score is the average of the F1-score of all classes. The micro-averaged F1-score is the weighted average of the classes’ F1-scores, which takes into account the contribution of all classes.
Reference source search
For a given claim, a fact checking system searches for relevant reference sources from different sources. Below, we list the main categories of reference sources, and outline how the most relevant data for a given claim is found.
• Search engine such as Google, Bing. The claim could be issued directly against the search engine [Zhi et al., 2017]. Or it could be converted into search query by retaining only its verbs, nouns and adjectives [Nadeem et al., 2019, Karadzhov et al., 2017]. Named entities, e.g., location, person’s names, etc. could also be added to the query issued to the search engine [Karadzhov et al., 2017, Wang et al., 2018].
• Knowledge bases such as DBpedia [Lehmann et al., 2015] and SemMedDB [Kilicoglu et al., 2012] can be leveraged to find the most probable paths in the knowledge base that connect the subject and object of a claim given in a triple format subject, predicate, object [Shi and Weninger, 2016]. The evidence facts related to a given claim could also be extracted from knowledge bases [Ahmadi et al., 2019].
• Wikipedia pages could be used to support or refute a given claim [Thorne et al., 2018]. A subset of sentences from these pages could also be retrieved to give specific evidence to explain the systems’ decision.
• Previously fact checked claims could be compared with the given claim to find outwhether a fact check for this claim already exists [Hassan et al., 2017, Lotan et al., 2013]. Such a comparison can be made based on a text similarity measure between the claim and the previously fact-checked claims.
• Social media content has been used as background (reference) information in [Goasdou´e et al., 2013]: social media content is archived, then person names are used as search terms in order to identify the posts from a given actor.
• Table cells could be aligned with textual mentions of quantities in [Ibrahim et al., 2019].
Table of contents :
1.2 Contributions and outline
2.1 Resource Description Framework
2.2 Information extraction
2.2.1 Information extraction tasks
2.2.2 Machine learning for information extraction
2.2.3 Deep learning for information extraction
2.2.4 Metrics for evaluating information extraction quality
2.2.5 Text representation
3 State of the art of computational fact checking
3.1 Claim extraction
3.1.1 Unsupervised approaches
3.1.2 Supervised methods
3.2 Reference source search
3.3 Related datasets
3.4 Claim accuracy assessment
3.4.1 Using external sources
3.4.2 Using a knowledge graph
3.4.3 Using linguistic features
3.4.4 Using user input
3.5 Fact checking challenges
3.5.1 Fake news challenge
3.5.2 Fact Extraction and VERification
3.5.3 Check worthiness
3.6 Automated end-to-end fact checking systems
4 Extracting linked data from statistic spreadsheets
4.2 Reference statistic data
4.2.1 INSEE data sources
4.2.2 Conceptual data model
4.3 Spreadsheet data extraction
4.3.1 Data cell identification
188.8.131.52 The leftmost data location
184.108.40.206 Row signature
220.127.116.11 Collect additional data cells
4.3.2 Identification and extraction of header cells
18.104.22.168 The horizontal border
22.214.171.124 Cell borders
126.96.36.199 Collect header cells
4.3.3 Populating the data model
4.4 Linked data vocabulary
4.7 Related works
4.8 Conclusion and future works
5 Searching for truth in a database of statistics
5.2 Search problem and algorithm
5.2.1 Dataset search
5.2.2 Text processing
5.2.3 Word-dataset score
5.2.4 Relevance score function
188.8.131.52 Content-based relevance score function
184.108.40.206 Location-aware score components
220.127.116.11 Content- and location-aware relevance score
5.2.5 Data cell search
5.3.1 Datasets and queries
18.104.22.168 Evaluation metric
22.214.171.124 Parameter estimation and results
126.96.36.199 Running time
188.8.131.52 Comparison against baselines
5.3.3 Web application for online statistic search
5.5 Related works
5.6 Conclusion and future works
6 Statistical mentions from textual claims
6.2 Statistical claim extraction outline
6.3 Entity, relation and value extraction
6.3.1 Statistical entities
6.3.2 Relevant verbs and measurement units
6.3.3 Bootstrapping approach
6.3.4 Extraction rules
6.4.1 Evaluation of the extraction rules
6.4.2 Evaluation of the end-to-end system
6.6 Related works
6.7 Conclusion and future works
7 Topics exploration and classification
7.1 Corpus construction
7.2 Topic extraction
7.3 Topic classification
7.3.2 Model training