TKB software architecture

Get Complete Project Material File(s) Now! »

Preliminary research

The LATEX extraction script

As there is no publicly available dataset of PDF articles and their theoretical results, the first step in this project is to create one. Thankfully arXiv, a database of more than 1M open-access research papers, has an API to download articles PDFs and their sources. The source files that we are interested in are LATEX sources because they have a lot of semantic content. LATEX is a software and programming language designed to facilitate the writing of textual documents, letting the user focus on semantic content by taking care of formatting considerations. In the case of theoretical results, they are often signaled by appropriate environments such as \begin{theorem}…\end{theorem} to indicate LATEX that a special formatting should be used there. These environments are defined using the \newtheorem command. This also means that one could use these markers to identify where are theorems are in a given paper. During Daria’s project, a LATEX plugin has been developed to extract all results from a given source.
It works by re-defining the \newtheorem command, adding a hook to output each result in its own page. As the plugin re-defines the \newtheorem command, it needs to be applied before the command is used but after its definition. That list of extracted results can then be used for machine learning purposes. However, this method is somewhat unwieldy because it can not be used to quickly compute if a word in the article is part of a result. To generate a sequence of labeled tokens, the PDF must be read through in order while sequentially matching content with the list of extracted results.
As an alternative to this ad-hoc algorithm, I suggested changing the LATEX script to generate custom PDF links while keeping the original layout. The output of the LATEX extraction step became a set of labeled bounding boxes highlighting the results of the article. Associated with a spatial index, we can now perform a fast lookup of each token’s kind: result or not.
This method was applied to 6000 articles downloaded from arXiv in the field of complexity theory. Some losses are encountered: for example, 258 papers are not written in Latex. Compilation errors were encountered when the extraction script was included in the wrong place (for example after results environments have been defined). And even if there is a de facto standard way of describing results in Latex, there is a multitude of small variations that makes it hard to identify all of them. Theorems that have a special name, written in another language, or not defined using \newtheorem are not extracted by this method. This is not that problematic as the goal is to have a large enough dataset to train machine learning models, but it can introduce a bias as the extracted dataset is not representative of the real paper distribution. For example books (that are under paywalls) and old publications (which fails to be extracted using the LATEX method) are underrepresented in this dataset.

Conditional random fields for theorem extraction

Using the LATEX extraction method, we obtain a dataset of PDF papers with custom annotations to iden-tify the results. The next step is to convert this to a sequence of pairs of features and labels that will be fed into the machine learning system. To do that the PDF is transformed into an XML file using PDFalto. This tool extracts PDF content in the ALTO format presented earlier and PDF annotations in a separate file. Then the Python lxml library is used to parse the XML content, producing a set of bounding boxes from the annotations and a sequence of tokens corresponding to the lines (or words, depending on selected granularity) of the document. These tokens can then be associated with numeric and categorical features, computed from XML information. The initial features were the number of words, average word length, the proportion of italic and bold symbols, the usage of a math font, whether the first word is bold, capitalized, equal to “proof” or a heading (“theorem”, “definition”, . . . ). I also added more geometric features such as font size, vertical space between the line and the previous line, and whether there is indentation or not.
I focused on linear chain Conditional Random Fields [8] (CRFs) because they are known to be good for sequence labelling. CRFs form a class of graphical models: it is described as follows as a graph (V; E) that has the following properties:
• V = X [ Y , X being the observations and Y the latent variables.
• Y conditioned to X depend only on its neigh-bors, thus following the Markov property: p(YijX; Yj; j 6= i) = p(YijX; Yj; j i) ( being the graph neighbor relation)

Choosing metrics

We have a dataset and a model, but now metrics are needed to be able to compare our results. As it is a classification problem, the focus will be on precision and recall measures, combined into the F1 score. However the dataset is prone to class imbalance, as non-results dominates results, so we focus on two metrics: macro (unweighted) average and weighted average of F1-scores.

Implementing CRFs on Grobid

After obtaining somewhat good preliminary results (0.90 F1-score on identifying theorems). I decided that taking advantage of Grobid infrastructure would be a good idea to further improve results. To achieve this, the following tasks needed to be accomplished:
• Add a new word-based model to extract results, based on what was done for full-text content.
• Create and test new layout features.
• Generate an annotated dataset that Grobid can handle, using the LATEX extraction tool.
Adding a new model Model definitions are located in grobid-core/src/[…]/engines, with one file per model. Each model is a class that can generate features from a document, apply a tagger (a CRF model for instance) and output a TEI XML file. TEI (Text Initiative Encoding) is the main output format for Grobid: it’s a lossy format that describes the document in a more high-level fashion. Even if models are supported to behave similarly, there is no code abstraction for this and a lot of content is redundant between files. That being said, adding a new model would not benefit from the work that already has been done for the other models.
Creating new features As feature generation is not shared between models, improving a model’s set of features won’t improve the others’. Moreover, even if the dataset is distributed in the Github repository, its format makes it hard to add new features: the training data is distributed as, for each model, a set of pairs of input features and target XML TEI file. Therefore updating features would require finding the original papers before applying a step to re-generate these features. At this date, there is no way to automatically download the training dataset’s PDFs. Therefore, even if we only want to add new features, either the original articles need to be found on the internet or a new custom dataset has to be created. But creating a new dataset is time-consuming: for each PDF, a dummy TEI file needs to be generated, before annotating it using the target’s nodes. The flow of the document needs not to be changed, or else the matching procedure done by Grobid will fail.
Figure 3: The annotation process in Grobid. The annotator has to follow through the parsed document text in order to add the tags corresponding to each section.
Generate an annotated dataset At last, given what has been explored before, generating an annotated dataset is probably the easiest task. Basically, the algorithm needs to take for input a list of PDFs, generate the non annotated TEI file using Grobid, and then parse the TEI file and the PDF in parallel, using PDF annotations to figure out the label for each node, grouping neighbors under the same node parent.
Grobid is a very good software for production purposes: it is fast and provides very accurate results for most use cases. However, I found that developing models, features, and testing them on training data was hard and error-prone. Even if Grobid could benefit from a results model as an extension, I found that integrating the work into Grobid would have taken too much time and not be flexible enough for future iterations. I therefore chose to develop a new software, with a focus on flexibility and improved dataset management.


TKB: an annotated paper management system

The main work of this internship is the TKB software. It comprises a set of interfaces and libraries to manage data and machine learning algorithms. More precisely, TKB is a database of articles that can be annotated, be it manually or using previously trained machine learning models. It allows updating these annotations easily, to design and manage machine learning algorithms, and features in a flexible way, and to provide batch processing of documents. The project began as a simple annotation tool that quickly became a dataset management system.

TKB software architecture

TKB is made of a central library and several modules that make use of this library.
lib: TKB-lib is the Python library managing the database. It introduces and manage the main concepts that are required in this project:
• Paper: a PDF document that can be annotated through annotation layers.
• Annotation layer: a set of labeled bounding boxes, labels being elements of a given annotation class. Annotation layers can be tagged.
• Tags: a set of annotation layers. Tags make dataset management easier, as they, for example, can be used to identify data provenance or virtually separate data in training and test sets.
• Annotation class: a set of possible labels and where they can appear. For starters, three classes were introduced:
– segmentation makes a coarse partitioning of document elements: [front,headnote,footnote,page, body,appendix,acknowledgements]
– header labels elements from the document header: [title]
– results identify results in document body and annex: [lemma,theorem,proof,definition,…]
• Annotation extractor: a module that can automatically create an annotation layer for a given article. This module is usually a machine learning model, it can be trained using manually annotated samples. Extractors may depend on other annotation classes in order to take advantage of previously extracted information. This hasn’t yet been used but for example, we could imagine that a results extractor would make use of the location of specific items such as bibliographical references or formulas, that would be identified by another extractor.
cli: A command-line interface manages bulk operations on the dataset. Training and applying models is done through this interface.
server: The API frontend exposes TKB-lib’s features through a REST API. It allows paper annotations management such as editing annotations or applying an extraction to create a new annotation layer.
web: The web interface features a user-friendly way of editing annotations, that are viewed as bounding box overlays on the PDF document. A goal of the web interface is to easily iterate over machine learning designs by visualizing how the algorithm is behaving.

Data management and representation

The list of papers is stored in a SQLite database file. The title, PDF location and list of annotation layers are stored in this list. Annotation layer content is stored in separate files, in specific papers metadata directories. Initially stored as a simple Python dictionary, the SQLite database usage was motivated by the need for access in a multi-process context. Locks and synchronization are therefore delegated to the SQL driver. sqlalchemy has been used to interface with the database. It is quite efficient and features an ORM (object-relational mapping), meaning that data is represented as natural Python objects that can be used (and eventually mutated) in the library.
Paper annotations form a list of bounding boxes, each bounding box containing its coordinates in the PDF document, its label, and optional data for the Web front-end to display. The pickle Python module is used to write this data structure to files. This is not the most efficient way of storing data but the usage is straightforward and this format could be used to export data to other platforms. As it’s expected to perform spatial operations over annotation layers, a spatial index is populated when loading bounding boxes. This spatial index is used to perform low-cost lookups of bounding box intersections, therefore being useful to label every token in the document. The rtree library is used for this purpose, it’s a binding for libspatialindex implementing the R*-tree data structure.

Building features while keeping hierarchical information

Feature management is something I wanted to have more flexibility in comparison to Grobid. In our cases, features are automatically extracted descriptors of XML tokens that can be fed into machine learning systems. As XML is a hierarchical format, I wanted to be able to easily create and compose features that can live on different hierarchy levels. For example, featuring the ALTO format can be done be designing <Page>-level features, <TextBlock>,<TextLine>,<String>-level features and composing them altogether. Given an output hierarchy level, these features and joined and aggregated to produce one single feature table synchronized with the selected tokens. I also developed document-wise feature normalization and token-wise deltas which are both useful preprocessing methods in the case of CRFs.
Example: let’s say we have this document and we want to generate <TextLine>-wise features.
<S t r i n g CONTENT= » I ‘m » FONTSIZE= » 12 « />
<S t r i n g CONTENT= »an » FONTSIZE= » 10 « />
<S t r i n g CONTENT= » example  » FONTSIZE= » 10 « />
<TextLine />
<S t r i n g CONTENT= » the  » FONTSIZE= » 10 « />
<S t r i n g CONTENT= » end .  » FONTSIZE= » 10 « />
<TextLine />
<S t r i n g CONTENT= » f o o t n o t e  » FONTSIZE= »8″/> <TextLine />
</ Page>
First step for each tag type a set of features is generated and stored in a pandas Dataframe. For this there are several feature extractors working independently, performing a linear pass over the whole document. In order not to repeat that costly phase, feature tables are cached to disk.

Table of contents :

1 Introduction 
1.1 State of the art
1.2 Internship overview
2 Preliminary research 
2.1 The LATEX extraction script
2.2 Conditional random fields for theorem extraction
2.3 Choosing metrics
2.4 Implementing CRFs on Grobid
3 TKB: an annotated paper management system 
3.1 TKB software architecture
3.2 Data management and representation
3.3 Building features while keeping hierarchical information
3.4 User interface
4 Models, features and extractors 
4.1 Building features
4.2 Trainable extractors
4.3 Extracting LATEX metadata
4.4 A naive algorithm for result extraction
4.5 Special extractors: agreement and features
5 Results 
5.1 Segmentation class
5.2 Header model
5.3 Results
6 Conclusion 
A Experiments
A.1 Choosing the L1-regularization parameter in the case of CRFs


Related Posts