The EPIQUE project
This thesis has been financed by the ANR EPIQUE project5. The main goals of the EPIQUE project are:
• The definition and implementation of new innovative tool for the reconstruction and exploration of multi-scale dynamics in complete real-world scientific corpora and for obtaining new insights in the evolution of complex human generated knowledge and information.
• The definition of a uniform framework for specifying, implementing and integrating large-scale text and graph mining tasks which can be customized independently of the higher-level mining algorithms with respect to specific cost models and hardware constraints (memory, CPU).
• The validation of classical hypotheses concerning the evolution of scientific fields and con-tent and to test and improve these hypotheses in the light of the reconstructed phylomemies and of general patterns detectable within them. From the perspective of philosophy of science, EPIQUE should enable the empirical validation of theories on science evolution which have been formulated by considering only a few canonical texts. Preliminary results on small document collections covering particular scientific fields already demonstrate that phylomemetic graphs reveal novel semantic insights about science evolution . Complete scientific archives like the Web of Science can more reliably be seen as a plausible testimony of scientific activity and taking into account of the whole corpus not only applies to more scientific fields but also reveals a deeper understanding of inter-disciplinary evolution.
To achieve these goals, the EPIQUE project includes four academic partners:
• Computer Science Laboratory of Sorbonne Université (Lip6, project coordinator): The LIP6 Database research team has a long experience of research in large-scale distributed data management, data integration, web data processing and data quality.
• Institute for History and Philosophy of Sciences and Techniques (IHPST): IHPST is the leading French laboratory in philosophy of science and has produced works on epistemol-ogy of computer simulation and the epistemology of big data science that are relevant to the current project.
• Complex Systems Institute of Paris Ile-de-France (ISC-PIF): ISC-PIF is an interdisciplinary research and training center of CNRS that promotes the development of French, European and international strategic projects on complex adaptive systems. ISC-PIF helps its partners to pool their resources for large corpora mining and analysis, and is in charge of several technological platforms in the domains of high performance computing, big data, digital humanities or visualisations.
• Research Institute of Computer Science and Random Systems (IRISA Rennes): The IRISA DRUID research team is specialized in data modeling, data protection and crowdsourcing.
This thesis mainly concerns the goal of defining and implementing the general EPIQUE framework for generating and analyzing phylomemetic networks. This framework is based on scalable text and graph mining algorithms and integrated in a workflow composed of four main steps (Figure 1.2):
1. The term extraction step applies preprocessing (term extraction, stop-word removal, deformation, term extraction) and transforms each document into a set of weighted terms.
2. The transformed documents are then grouped according to periods predefined by the user.
3. The topic detection step consists first of detecting topics (sets of strongly semantically related terms) in the term sets. In the framework of the project, we consider different ways of topic extraction, and in particular word embedding , word co-occurrence  and probabilistic models (LDA ).
4. Subjects extracted from different time intervals are then compared to each other, for example using the Jaccard  distance or the cosine distance, to generate the phy-lomemetic graph with alignments (split, merge, equivalent) representing their temporal evolution. The analysis and customization step of the phylomemetic graph that allows experts to interact with the workflow by generating and visualizing phylomemetic trees and by customizing the workflow by modifying the data (e.g. removing or adding a term in a topic) and the parameters (e.g. the time interval for splitting the document collection).
Example 2 The evolution graphs in Figure 1.1 are obtained by applying a workflow to the subsets of documents covering each time period. This workflow includes a standard NLP document preprocessing step and applies the Latent Dirichlet Allocation (LDA) method to extract a predefined number of weighted term vectors (topics) describing the scientific publication activity for each period. Finally, topics from two subsequent periods are linked by applying cosine similarity .
Facing archives of increasing size, the EPIQUE workflow is not a sequence of independent tasks on a given dataset, but an integrated framework which allows experts to interact and to control the whole process through high level languages and interfaces (e.g. for specifying the scientific field and time-range of interest).
This thesis addresses three main issues when building and exploring topic evolution networks:
Challenge 1: Scalability
Scalability is the first important challenge of our work. Most current science evolution workflows focus on domain-specific datasets. For instance,  studies the dynamics of phylomemies using documents in the domain of embryology. Computer Science articles from DBLP are analyzed in  to map the evolution of scientific fields.  conducts a lead-lag analysis to identify different topic evolution patterns for preprints and papers in astrophysics, etc. The size of these datasets is limited. Our first challenge is to build global maps of the evolution of science on large scientific domains by applying appropriate scientometric models on large databases like the Web of Science, MedLine or Open Archives likes arxiv. It is an ambitious goal of making sense of unstructured text through generic data processing tasks (term extraction, stop-word removal, stemming, index term generation and term selection) which become complex when dealing with very large amounts of digitized text. In this thesis we explore the development of new scalable solutions exploiting recent parallel data processing frameworks like Hadoop , Spark , and Pregel .
Challenge 2: Tuning and Quality
The second challenge concerns the quality of phylomemetic networks. The generation of topic evolution networks is a difficult task including complex unsupervised text and graph mining algorithms. Its workflow usually consists of several steps, which can be performed in different ways by choosing different algorithms and parameters. Building “meaningful” topic evolution networks is an iterative process where domain experts must choose the right algorithms and parameters for each step of the phylomemy generation workflow, especially for the topic extraction step and the topic alignment step, and correctly tune method-specific hyper-parameters and thresholds with respect to a given dataset and an expected output. Each method depends on parameters which can significantly affect the quality of topic sets and the quality of evolution graphs. Most of these parameters are dataset-specific and require empirical tuning. Besides, the tuning process also includes the threshold-based filtering of topic alignment edges to produce phylomemies for different levels of detail. To solve this challenge, we explore different solutions to accelerate the graph generation workflow and to assist experts in choosing the optimal number of topics per period to produce highly divers topic evolution networks.
Challenge 3: Interactive Exploration and Analysis
Figure 1.3 gives an example of a phylomemetic graph extracted from a subset of the arXiv corpus covering 20 years of publications. This corpus is split into 20 periods where each period contains 50 topics. If a corpus is extremely large and the content is quite diverse, we can obtain a very large topic evolution network like Figure 1.3 which is too complex to be visually analyzed. Existing graph visualisation standards and tools like Gephi 6  or Graphviz 7  can be used to generate high-quality visualisations, but their use for exploring large graphs and identifying meaningful evolution patterns is still limited. An important goal of our thesis is to allow experts to explore phylomemies interactively by filtering interesting subgraphs according to particular evolution patterns. For example, a user may want to find all topics containing the term “database” and connected by a path to a future topic with emerging term “deep learning” in such a large and heterogeneous graph. This is practically impossible to achieve by exploring the visual representation and without any filtering capacities. Some users also want to spot interesting evolution patterns by their structure. For example, find topics of which the diameters of their patterns are more than 5 and which evolve into more than 3 domains. But evidently, most of the patterns in Figure 1.3 are connected together consisting a huge part of the phylomemy making it hard to identify important patterns. By adjusting the topic alignment thresholds, it is possible to reduce the complexity, but this also reduces the opportunity to detect interesting evolution patterns. To solve this challenge, we introduce a new query language for filtering topics and topic sub-graphs according to their content and structure.
Contributions and thesis outline
The main contributions of this thesis are listed as follows:
Topic Evolution Framework
Topic Evolution Model and Query Language We proposed a generic framework for the computation and interactive exploration of evolution networks. This framework includes a high-level data model using standard document and data processing technologies for extracting, storing and exploring meaningful topic evolution networks. This model relies on the notion of pivot topic graphs describing the content and the evolution dynamics of topics at different levels of detail, and includes a high-level filter-based query language which enables users to interactively explore the evolution of topics by composing structural, temporal and semantic topic filters and specify their search criteria in a simple and sound way. These topic filters consist of structural conditions on the evolution graph properties like average out- and indegree, and temporal conditions on the topic term trends like the emergence and decay of terms. This generic topic evolution framework allows users to interactively explore and analyze topic evolution networks (phylomemies). The extraction of meaningful evolution patterns from very large document archives through the framework will be introduced in Section 3.1 of Chapter 3 and the model for the interactive exploration of large topic evolution networks will be explained in Chapter 4. The presentation of the model also includes a formal analysis of the monotonicity properties of pivot filter expressions.
Diversity-based Topic Number Estimation In this model, we also defined a quality metric based on topic diversity which guarantees that a topic evolving toward many topics actually represents a significant evolution. We proposed a topic extraction method that automatically computes for each period, a set of topics that meets a given diversity condition. This diversity-based measure for estimating the quality of a topic set solves the challenge of topic number tuning and helps users to extract representative topics. The definition of this measure will be given in Section 3.2 of Chapter 3.
Prototype Implementation We implemented a scalable proof of concept prototype on top of Apache Spark for processing large scientific corpora containing millions of documents and finding meaningful topic evolution networks for both stable topics and highly evolving ones. The prototype including a pivot graph generation interface (notebook) and an interface (notebook) for exploring and analyzing pivot graphs is able to address all previous challenges and will be illustrated in Chapter 6.
Pivot graph computation
Two Topic Alignment Strategies To generate multi-stage topic evolution networks, we proposed two alignment strategies where one takes account of the similarities of all topic pairs which has been integrated in our prototype and another one focuses on aligning topics only from consecutive periods and uses a nearest-neighbor condition to select candidate topics. Both strategies addressed the problem of computing a very large number of cosine-based topic alignments on top of Apache Spark. We also highlighted that Spark’s native distributed cosine similarity computation can be improved for the nearest-neighbor strategy in our context of graph evolution of scientific documents. The more detailed description will be provided in Section 3.3 of Chapter 3.
Two Pivot Graph Computation Strategies To achieve a high level of interactivity, we firstly proposed an incremental join-based transitive closure algorithm for the materialization of pivot graphs and the graph properties by computing and storing all possible aggregated values in advance. This kind of materialization is computation and storage-intensive, but also can directly benefit of standard big data technologies to achieve scalability. Its main benefit is that even complex structural and temporal topic filters can be implemented by simple value-based selections on the generated topic properties. To avoid many join operations of the join-based materialization strategy, we proposed an alternative pivot graph materialization strategy based on GraphX  using Bulk Synchronous Parallel Paradigm . The objective of proposing these two pivot graph computation strategies is to solve the scalability challenge. These two strategies will be discussed in Sections 5.1 and 5.2 of Chapter 5.
We preprocessed several corpora of different scales, such as a small corpus about economics, Glyphosate, arXiv, Nature, Wiley and Elsevier, etc., where most of them are issued from the ISTEX  platform which is an online access to more than 23 million articles from all scientific disciplines. We then conducted a detailed experimental evaluation on the performance of different EPIQUE workflow steps including the LDA generation, the topic alignment, the pivot graph computation, the topic labeling and the graph metrics computation over four of the previously mentioned real-world datasets. This evaluation also includes experiments measuring the scalability of our pivot graph generation algorithm over larger synthetic topic evolution networks. These experiments prove that each step of our framework is capable of efficiently handling large datasets (scientific document collections or evolution graphs). The detailed descriptions of these experiments are distributed across the thesis and will be given in Sections 3.2 and 3.3 of Chapter 3 and Section 5.3 of Chapter 5.
Topic modeling can be applied to solve various problems like document classification [34, 35, 36, 37], sentiment analysis [38, 39], topic discovery [22, 40] and image object localization [41, 42]. In our work, we apply topic model mainly for extracting a representative set of topics describing a collection of documents.
Most topic models are based on the assumption that groups of words describing a semantic concept (topic) will often occur together in semantically similar documents. Various methods have been applied to the topic modeling problem, such as Frequent Itemset mining , Community Detection and Clique Percolation , Word Embedding  and Statistical Topic Modeling . All these models are based on a preprocessing phase where each text document is first transformed into a structured (sequence, graph) or unstructured (set) of terms. These structures are then analyzed taking account of the term co-occurrence in these structures. We will first give a short description of the different topic modeling approaches and then concentrate on the statistical topic model which we applied in our work.
Topic Modeling Approaches
• Frequent Itemset : A topic is a set of terms that are relevant to one or several documents and describe their contents. One hypothesis about topic extraction is to consider the set of terms that show up together in certain number of documents as a topic. Frequent Item-set Mining consists in detecting sets of items (terms) that often appear together in the same transaction (document). It has been applied to market basket analysis where it aims at finding regularities in the shopping behavior of customers of supermarkets, mail-order companies, on-line shops etc. Figure 2.1 illustrates the use of Frequent Itemset for topic extraction. Documents can be considered as transactions, while indexed terms of each document can be regarded as items. Then, from the example corpus, two topics can be extracted where one topic contains “influenza” and “fever”, and another one contains “cancer” and “tobacco”.
• Clique Percolation : The Clique Percolation Method (CPM) is a popular approach for analyzing the overlapping community structure of networks. The term network community is usually defined as a group of nodes that are more densely connected to each other than to other nodes in the network. The clique percolation topic model considers terms as nodes and topics as communities in the network. One advantage of this approach is that clusters can overlap and a given term can thus be present with various meanings in different topics.
 applies k-clique percolation for extracting overlapping topics (paradigmatic fields) from scientific documents and defining an asymmetric topic proximity metric to represent the hierarchical structure of scientific activity.
 proposes ACPM, an extension of CPM to analyze the dynamics between topics and to identify topic clusters exhibiting an increase of collaborations that potentially will lead to the emergence of new topics. The approach generates semantic enhanced topic networks selecting all keywords from the publications in a given year that also appear as concepts in the CSO1 Computer Science Ontology .
• Word Embedding : Word Embedding represents words as vectors (embedding) such that words with a similar meaning are represented by similar vectors. The real-valued vector representation is learned from a corpus for a predefined fixed sized vocabulary. Vocabulary terms are mapped into a high-dimensional vector space such that terms frequently used in the same context appear close together in this space. Then the identified denser regions in this space can be considered as scientific fields.
Our EPIQUE project partners from IRISA Rennes study the structures of scientific topics and their evolution by using the Word Embedding . They trained a word embedding model on the Wiley corpus of a given time period and applied the cosine similarity to compute the distance matrix of the data points. By applying a new hierarchical clustering method , they obtained a rich structured representation of the topic clusters extracted for different periods and their evolution.
• Statistical Topic Models : Statistical topic modeling is based on the distributional hypothesis that words that are close in meaning will occur in similar documents. The analysis of the relationships between a set of documents and a set of terms (document-term matrix) produces a fixed number of topics, the distribution of the topics over the documents (document-topic matrix) and the distribution of the terms over the topics (topic-term matrix). The goal of topic modeling is to uncover these distributions (latent variables) that shape the meaning of the document and the whole corpus.
Table of contents :
1.1 Modeling Science Evolution
1.2 The EPIQUE project
1.3 Research Challenges
1.3.1 Challenge 1: Scalability
1.3.2 Challenge 2: Tuning and Quality
1.3.3 Challenge 3: Interactive Exploration and Analysis
1.4 Contributions and thesis outline
1.4.1 Topic Evolution Framework
1.4.2 Pivot graph computation
1.4.3 Experimental Evaluation
1.5 List of Publications
2 Related Work
2.1 Topic Modeling
2.1.1 Topic Modeling Approaches
2.1.2 Latent Topic Models
2.1.3 Dynamic Topic Modeling
2.2 Topic Evolution Models
2.2.1 Topic Trend Analysis
2.2.2 Topic Evolution Networks
3 Phylomemy Generation
3.1 Phylomemy Generation Workflow
3.1.1 Data Preprocessing
3.1.2 Corpus Periodization
3.1.3 Topic Extraction
3.1.4 Topic Evolution Alignment
3.2 Topic Extraction
3.2.1 Diversity Estimation
3.2.2 Experimental Evaluation
3.3 Topic Alignment
3.3.1 Full Matrix Alignment Computation
3.3.2 Nearest-Neighbor Alignment Computation
4 Pivot Graph Model and Query Language
4.1 Topic Evolution Model
4.2 Pivot Topic Graphs
4.3 Pivot Topic Functions
4.4 Pivot Topic Calculus
4.5 Pivot Topic Query Language
5 Pivot Graph Generation
5.1 Relational Pivot Graph Computation
5.1.1 Baseline Join-based TC computation
5.1.2 Smart Join-based TC Computation
5.1.3 Incremental Join-based TC Computation
5.2 BSP-based Pivot Graph Computation
5.2.1 The Pregel operator
5.2.2 The Pregel operator in GraphX
5.2.3 Future Pivot Graph Generation with Pregel
5.3.1 Comparison of Relational TC Computations
5.3.2 Parallelization and Scalability
5.3.3 Join-based versus Pregel-based TC Computation
6 Implementation and Prototype
6.1 Pivot Graph Generation and Query Evaluation
6.1.1 Pivot Graph Computation with Spark
6.1.2 Topic Label Computation
6.1.3 Pivot Graph Metrics Computation
6.1.4 Pivot Query Evaluation
6.2 EPIQUE Prototype
6.2.1 Prototype Architecture
6.2.2 The Topic Evolution Graph Generator Notebook
6.2.3 The Topic Evolution Graph Explorator Notebook
6.2.4 Some Pivot Graph Examples
7 Conclusion and Future Work
7.1 Main Contributions
7.1.1 Topic Evolution Network Computation
7.1.2 Pivot Graph Model and Query Language
7.1.3 Implementation and Optimization
7.2 Future Work
7.2.1 Generic Topic Evolution Workflow
7.2.2 Topic Evolution Model and Query Language