Challenges in describing and comparing EBS pipelines
For several reasons, EBS system pipelines and the methods they rely on are heterogeneously described in the literature, thereby making it di cult to precisely and formally compare them.
First, research devoted to text-mining methods for web-based health surveillance and the im-plementation of such methods in EBS systems are two parallel yet non-synchronous processes. Hence, it is not easy to know which method is actually integrated into an EBS system. Similarly, some implemented algorithms are not documented or fully described.
Second, there is a lack of consistency in the vocabulary used to describe the steps of the pipeline and the data ow. Indeed, EBS systems vocabulary are at the crossroads between recent EI de – nitions and concepts and the well-established informatics jargon. We noted several discrepancies between the vocabulary describing EBS pipelines and WHO formal de nitions of computational terms. For instance, the “data triage” step is referred to as “classi cation” in most EBS systems, as it usually relies on classi cation algorithms to categorise news articles. In HealthMap, however, the term “classi cation” is used in reference to the extraction of the location and disease describing an outbreak. In computational terms, this task corresponds to event extraction. In both HealthMap and PADI-web, the term “signal” refers to unveri ed data detected by the system (Arsevska et al., 2018; Brownstein et al., 2008), while in the formal EBS process, a signal corresponds to data consid-ered relevant after the triage step (Section 1.3.2). As detailed later, the term “event” is particularly prone to multiple de nitions, whether it is used in the epidemiological or informatical context.
Secondly, the EBS process is formally described as a unidirectional process since it was mod-elled on the IBS scheme (Figure 2). However, the ow of data within EBS systems is not necessarily linear. For instance, ltering and validation steps can be done at di erent levels: a news article can be categorised as relevant, while some events extracted from its content will be considered irrelevant (e.g. reference to an old outbreak).
Thus, in the following section we do not categorize text-mining methods based on steps of the EBS process. Instead, we distinguish approaches dedicated to analysing news articles as a whole (document-based) and more ne-grained methods based on epidemiological entities (entity-based).
4 Approaches, stakes and limits of text-mining methods applied to online news monitoring In this section, we describe how EBS systems rely on text-mining methods to automate or enhance part or all of their process. We excluded the ProMED system in the remainder of this section since it involves a fully manual process. However, we refer to ProMED in several parts of the discussion to compare automated approaches to the totally expert-based paradigm. Besides, a comprehensive description of the ProMED pipeline is detailed elsewhere (Carrion and Mado , 2017).
Importantly, all of the studied EBS systems, except for AquaticHealth.net and PADI-web, were primarily designed for public health surveillance. The methods and limits discussed hereafter are therefore not speci c to animal health.
We divided the approaches into two levels, i.e. document and entity levels. The term document refers to a news article, while the term entity refers to an epidemiological entity, as described in Section 2.3. The main features of each EBS systems are summarized in Table 3, at the end of the section. The current pipeline of PADI-Web is based on 4 steps (i.e. data collection, data processing, data classi cation, information extraction) (detailed in Appendix B and C). The contributions of this thesis enable to improve the last steps of PADI-Web by proposing new information retrieval and event extraction methods.
Compared to IBS systems, which mostly passively receive information, EBS systems implement an active retrieval process. When automated, data retrieval is a critical step to ensure the sensi-tivity of the EBS system since it channels what will be further analysed and potentially detected as signals.
A commonly shared strategy to retrieve news articles from online sources involves the imple-mentation of Really Simple Syndication (RSS) feeds. RSS feeds are easily customisable by the user and they can be used to automatically retrieve data from a broad range of sources:
• A speci c online news source, e.g. AquaticHealth.net monitors websites dedicated to aqua-culture, such as the shsite.com;
• A range of online news sources through news aggregators, e.g. Google News for HealthMap and PADI-web; Al Bawaba and Factiva for GPHIN;
• Another EBS system, e.g. HealthMap, BioCaster and AquaticHealth, collects ProMED alerts;
• An IBS system, e.g. HealthMap, Argus and AquaticHealth, retrieves o cial OIE noti cations through the WAHIS platform. BioCaster and Argus also collect WHO alerts.
EBS systems are more heterogeneous than IBS regarding the range of sources they monitor, with a number of them retrieving validated events from o cial sources or another EBS system. In such cases, the EBS systems act as aggregators, and the detected signals are systematically validated and displayed. Web aggregators gather material from a variety of sources.
An RSS feed consists of a query, i.e. a combination of keywords and boolean operators (e.g. “and”, “or”). There are two types of query, corresponding to two distinct strategies. Disease-based (speci c) queries contain names of diseases (including scienti c names) and pathogens and aim at detecting timely information of a known disease. Moreover, non-speci c queries do not contain any disease-speci c terms, but instead host, symptom and outbreak-related keywords (e.g. “cases”, “spread”, etc.). These queries aim at detecting early signals of an unknown disease or a rare condition that is not yet known. They are implemented in HealthMap, Medisys, and PADI-web (Arsevska et al., 2017; Blench, 2008; Mantero et al., 2011).
The terms used in the queries are either proposed by domain experts or trained analysts (Medisys and GPHIN), by the system users (AquaticHealth.net) or based on a medical ontology (BioCaster) (Collier et al., 2007; Mantero et al., 2011; Mykhalovskiy and Weir, 2006). A mixed approach was proposed to create PADI-web queries, which combines automatic extraction of terms from a cor-pus of relevant documents and validation with a consensus of domain expert (Arsevska et al., 2016). The data retrieval frequency sharply di ers, i.e. from every 15 min for the most reactive system (GPHIN) to every 24 h (PADI-web).
Most EBS systems have implemented RSS feeds in several languages. As their pipelines are designed for texts in standard English, news are translated from their source language into En-glish. This approach assumes that even lossy translation can be su cient for machine learning components (e.g. classi ers) trained on English data.
The HealthMap system has a translation scheme that uses the article link (i.e. URL) to detect the original language automatically. For instance, if “trto=enandtrf=zh” appears in the link, the source language is identi ed as Chinese. Non-English articles are displayed in their language source in the HealthMap interface, and users are provided a link to Google Translate to enable them to access to the translated version.
Both PADI-web and GPHIN rely on the Microsoft Azure Translate API that was selected to detect the source language and translate the news content to English (Carter et al., 2020). Mi-crosoft Azure is neural machine translation system3, whereby each word in a sentence is coded along a 500-dimensional vector representing its unique characteristics within a speci c language (Research, 2018).
Expert feedbacks suggest that automatic translations are somewhat noisy (Carter et al., 2020; Valentin et al., 2020c). Domain-speci c terms, especially compound expressions, can result in er-roneous translation. For instance, in PADI-web, the African swine fever disease was translated into various incorrect forms. Translation errors can negatively impact the extraction of epidemi-ological information, e.g. by creating spurious matches with geographical names. To overcome these shortcomings and limit the time lag needed by automatic translation, (Lejeune et al., 2015) proposed a language-independent approach that directly extracts information from the news in the native language. This sharply di erent approach, implemented in the so-called Data Analy-sis for Information Extraction in any Language (DAnIEL) system, handles language variations by processing text at the character level, rather than at the word level (Lejeune et al., 2012). This allows the detection of subpatterns of location and diseases names, such as “deng” in deng˛e and dengi, two Polish variants of dengue.
Document triage: de-duplication and related documents
De-duplication is the fact of eliminating redundant information. In the NLP eld, this task is linked with the text similarity concept: two documents having a similarity score above a prede-ned threshold are considered redundant. Text similarity encompasses two notions: (i) semantic similarity and (ii) lexical similarity. The lexical similarity between two documents relies on characters or lexical units that occur in both texts (Corley and Mihalcea, 2005). It is usually calcu-lated in terms of edit distance (Guégan and Hernandez, 2006), lexical overlap (Jijkoun and de Rijke, 2005) or largest common substring. Another common approach derived from the lexical overlap involves representing two documents as dimensional vectors, where each dimension corresponds to a term (Gudivada et al., 2018). This representation relies on the segmentation of texts into words, referred to as “bag-of-words” (described in Section 2.2.1). The similarity between two documents can be further calculated by di erent metrics, such as the cosine of the angle (Cosine similarity) or the Jaccard coe cient (Gomaa and Fahmy, 2013).
To a certain extent, lexical similarity methods are often used to achieve semantic similarity. However, they fail to capture all semantic similarity in trivial cases, e.g. the use of di erent terms to describe the same concept. More sophisticated methods have thus been proposed: topologi-cal or knowledge-based methods, which rely on semantic and ontological relationships between the words (e.g. polysemy, synonym, etc.), corpus-based methods, which learn statistical simi-larity from data (e.g. latent semantic analysis) and recent word embedding representation (Bil-lah Nagoudi et al., 2017; Majumder et al., 2016; Nguyen et al., 2019).
Table of contents :
1 Animal disease surveillance context
1.1 Stakes and limits of traditional surveillance
1.2 Novel sources of health surveillance information
1.3 Epidemic intelligence process
2 Indicator-based and event-based surveillance systems in animal health
2.1 International indicator-based surveillance systems
2.2 Event-based surveillance systems
2.3 Epidemiological information within the EI process
3 Online news sources for event-based surveillance
3.1 Animal health information newsworthiness
3.2 Structural features of online news
3.3 Challenges in describing and comparing EBS pipelines
4 Approaches, stakes and limit of text-mining methods
4.1 Document level
4.2 Entity level
2 Epidemic intelligence and information retrieval
1 Elaboration of a new framework for ne-grained epidemiological annotation
1.1 Motivation and context of framework elaboration
2 Retrieval of ne-grained epidemiological information
2.2 Supervised classication
2.3 Lexicosyntactic pattern-based approach
3 General discussion
3 Event extraction from news articles
1 Information extraction in PADI-web pipeline
2 Event extraction: a statistical approach
2.1 Statistical approach
2.2 Corpus and evaluation
3 Event extraction: lexicosyntactical approach
3.1 Lexicosemantic representation approach
3.3 Results and discussion
4 Using epidemiological features to improve information retrieval
1 Proposed approach
1.1 Morpho-syntactic features
1.2 Lexicosemantic features
2 Corpus and evaluation
2.2 Evaluation protocol
3.1 Morphosyntactic features
3.2 Lexicosemantic features
5 Integration of event-based surveillance systems into epidemic intelligence activities– Case studies
1 Dissemination of information in event-based surveillance
1.2 Preliminary results and discussion
2 Early detection of unknown diseases by event-based surveillance systems
2.1 Context of the study
2.2 Material and methods
2.3 Results and discussion
6 Major contributions and perspectives
1 Summary of the main contributions
2 Perspectives of text mining in epidemic intelligence
2.1 Retrieval of ne-grained epidemiological information
2.2 Event detection
2.3 Retrieval of related documents
2.4 Role of the expert
2.5 Detection of weak signals
3 Perspectives of the event-based surveillance systems
3.1 Information dissemination between sources
3.2 Enhancing One Health surveillance through cross-sectorial collaboration
3.3 Event-based surveillance systems in developing countries
3.4 Monitoring of online news through news aggregators