Get Complete Project Material File(s) Now! »

## Adverse drug reactions detection

Improving Adverse Drug Reaction (ADR) detection is one of the promises carried by the increased volume and availability of observational data [Sta+10]. ADRs can be defined as a harmful event following a single dose or prolonged use of a drug. An ADR can be related to the dose or not. Dose effects can be caused by supra-therapeutic doses (toxic effect resulting from an excessive dosage), sub-therapeutic doses (hyper-susceptibility), or at standard therapeutic doses (collateral effects, such an effect occurring in a non-targeted tissue) [AF03]. Dose relationships are tough to assess when using data from SNDS, as prescriptions are not known, and drug packaging is standardized [Tup+17a]. Hence, this thesis focuses on the temporality of ADRs. Indeed, while some ADRs can be time-independent (e.g. digoxin toxicity caused by potassium depletion [AF03]), many of them might happen either at the first dose (e.g. anaphylaxis after first penicillin use) or with some delay of varying length. Delayed effects might occur at the time of drug withdrawal (e.g. opiates) later on (e.g. carcinogenesis [AF03]). Individual susceptibility might heavily affect the occurrence and timing of ADRs [AF03]. Figure 5 represents examples of ADR risk patterns.

Historically, post-marketing surveillance relies on spontaneous reports from physicians and consumers [Sch+16], thus depending on human detection of adverse effects. Reports then trigger statistical confirmation studies, eventually using claims data such as [Neu+12]. Unfortunately, relying on human detection has been shown to result in ADRs under-reporting [Alv+98]. Indeed, when ADR events are rare, joining the dots might be very hard for a human observer. Data mining LODs might complement human detection by screening a vast amount of drug and reaction combinations to improve ADR detection.

** Modeling challenges**

Several methodological challenges specific to LODs such as SNDS complicate this task. Indeed, healthcare data is the result of three intertwined processes [Alb+18; Hag+14]:

(i) An epidemiological process, reflecting the physiology and pathophysiology of the observed patients.

(ii) A behavioral process related to the patients’ lifestyles and healthcare utilization habits.

(iii) An institutional process related to the structure and the operation of the health-care system.

As a result, studies using this data should consider the peculiarities described in the next paragraphs.

Missing information and coding errors. While SNDS is very rich, it does not contain information that might be critical, depending on the conducted studies. Typ-ical examples are socioeconomic characteristics (income, marital status), lifestyle habits (smoking status, alcohol consumption, nutrition), examination results, test results, over-the-counter drugs, drugs delivered during hospital stays, and prescrip-tions and accurate drugs dosage. SNDS records the cause of death since 2018. The absence of such information might cause biases depending on the statistical modeling strategies.

Besides, data can be inaccurate. While some errors can be random, the healthcare billing systems might lead to systematic errors. For example, French hospitals are paid based on a flat price corresponding to a stay’s “main diagnosis.” Therefore, they are incited to code health events in a specific way to optimize both their revenue and practitioners’ time. The resulting recordings may thus conflict with the nominal definition of a concept [HA13]. These errors can be correlated to individual health-care providers, depending on their coding policy and working habits. In the case of SNDS, this issue might influence hospital stay data. However, as outpatient care is automatically recorded, it should be less affected by such coding biases [Tup+17a].

Pathways. Specific pathways might influence the results of studies based on obser-vational data. When the studied molecules are often prescribed in a given sequential order, it is hard to separate their individual influence on an event of interest [Hri+16].

Reverse dynamics. Healthcare data capture beneficiaries’ interactions with the healthcare system rather than a direct recording of their physiology, resulting in feedback loops and reversed dynamics [HA13]. Indeed, diseases precede their symp-toms in terms of physiology. The data may record the symptoms (through exams or medical acts, for example) before the actual identification of the disease [HAP11].

Not-at-random sampling. The beneficiaries’ events recording occurs when they interact with the healthcare system, i.e., data is only sampled when beneficiaries have health issues. As such, data sampling should not be considered random. In response, some studies impute data [Piv+14], use the information missingness as a feature8 [Hag+14] or use flexible models, which is the approach developed in this thesis.

These issues might result in biases, the most pervasive one being the indication bias when it comes to observational studies. This bias occurs when an indication (e.g. fever) both prompts an exposure (e.g. paracetamol) and causes outcomes (e.g. asthma) [Aro+18]. Following this example, a study ignoring the fact that some viral infections causing fever increase the risk of developping asthma would wrongly associate asthma with paracetamol. Such biases are hard to avoid, especially when using SNDS as drug prescriptions are not recorded. For now, the only solution is to tailor the studies to address each database peculiarities [Mad+14]. The approach developed in this thesis relies on careful phenotyping and study designs, flexible model, and cautious interpretation to derive useful insights on many aspects of the healthcare system and its beneficiaries. However, causal inference is hindered by many unobserved confounding variables and the impossibility of taking actions on healthcare policies for research purposes.

### Mathematical tools

This section quickly introduces several mathematical tools used in this thesis. The works presented here aim at estimating temporal associations of diverse health-care events. Statistical learning methods were used to establish general patterns structuring the data under study. The ADR detection problem was formulated as a supervised learning problem, aiming to predict a specific longitudinal event based on other longitudinal health events. The modeling of temporal dynamics borrows concepts from point processes theory and survival analysis. Estimating the resulting models’ parameters relies on sparsity inducing penalties, proximal operators, and stochastic optimization.

Supervised learning. Given a training sample of annotated examples = {( 1, 1), … , ( , )} ,

where ∈ ⊂ ℝ are -dimensional input features and ∈ ⊂ ℝ are values to be predicted, supervised learning consist in learning a prediction function ℎ ∶ → .

A goodness-of-fit function is used to measure how well a statistical model fits a dataset given the model parameters . For example, the negative log-likelihood function of a model, the quadratic loss, or the cross-entropy loss might serve as goodness-of-fit functions depending on the supervised task. In this thesis, a data sample corresponds to the history of patient . Data samples are assumed to be generated independently.

where implicitly depends on the data sample. Fitting the model to a dataset consist in solving the minimization problem min ( ). ∈ℝ

Penalization and proximal operators. Statistical models considered in this thesis require the estimation of parameters ∈ ℝ given a goodness-of-fit function ( ) ∶ ℝ → ℝ. When the number of parameters is large, there is a risk of model overfitting. In this case, the model is too closely adjusted to the training dataset and does not generalize well to other datasets. To prevent overfitting, the parameters’ space can be constrained using a sparsity-inducing norm ∶ ℝ → ℝ leading to an optimization problem of the form min ( ) + ( ), (1) ∈ℝ

where > 0 is the penalization strength. A cross-validation procedure selects the best performing penalization strength according to some performance metric. The function is assumed to be differentiable and −smooth, i.e. ‖ ( ) − ( )‖ ≤ ‖ − ‖,

for any , ∈ ℝ where ‖.‖ is the Euclidean norm on ℝ and > 0 is the Lipschitz constant. The penalty can be non-differentiable but it is assumed to be prox-capable, in the sense that its proximal operator (see definition below) can be easily computed. A range of methods can be used to solve Equation (1). Techniques known as proximal methods are particularly adapted to such problems thanks to their good convergence rates and scalability [Bac+12].

**Common methodologies**

Besides the mathematical model formulation and estimation, the design of ADR detection studies is crucial, even more, when using LOD data. The Observational Medical Outcomes Partnership (OMOP) pioneered an alert system based on vast amounts of claims data [Rya+12; Rya+13b]. Their approach relied on designs and models usually employed in observational studies. These designs can be divided into four categories.

**Multiple groups designs**

Such designs compare subjects who experienced the adverse effect (cases) with subjects who did not (controls).

Cohort studies compare groups of patients selected according to their exposure to some risk factor. For example, the risk associated with a drug might be studied with a new-users cohort. In this case, patients newly exposed to the drug of interest (DoI) is juxtaposed with a comparator population. The comparator group might be patients exposed to a drug of a different pharmacologic class sharing the DoI’s indication; or patients with a diagnosis for the DoI’s indication.

The Case-control method compares two population groups according to the oc-currence of an adverse effect. The patients who experienced an adverse effect (the cases) are compared with the patients who did not (the controls). When performed on administrative databases, such designs are always nested within a cohort.

Comparing the different groups can be made by estimating odds ratios using a logistic regression model predicting the target event from drug exposure. Odds ratios are said unadjusted when estimated with univariate logistic regression predicting the target event from drug exposure. Odds ratios are adjusted when estimated using multivariate logistic regression to control confounding variables. When using a Cox model (described Section 2.2), the survival time and the incidence rates are estimated and compared between the patient groups.

These approaches are very sensitive to residual systematic differences between the studied groups. Thus, their performance heavily depends on measuring confu-sion factors or ensuring that the compared groups share similar characteristics (e.g. demographics, life habits, or existing diseases).

**Single group designs**

Such designs, called Self-controlled designs, include only cases who experienced the studied adverse event. They compare subjects considered to be at risk of experiencing the event (risk periods) with themselves when they are not at risk (control or risk-free periods). As each included patient is known to have experienced the studied event, statistical models are conditional to this event occurrence. These approaches are not sensitive to observed or unobserved covariates that are constant over time. However, self-controlled designs remain sensitive to systematic differences between risk periods and control periods. A few single group designs are introduced below.

The Case-Crossover method compares, for each individual, a single risk period immediately preceding the adverse event to one or several control periods, always preceding the risk period. The length of these risk and control periods are the same for all individuals. The association between drug exposure and the adverse effect is measured through case crossover odds ratios defined as the rate of exposure during the risk period divided by the rate of exposure during control periods. These rates can be estimated using a conditional logistic regression or a conditional Poisson model [AGT14].

Similarly, the Self-Controlled Case Series (SCCS) design relies on case data. How-ever, instead of relying on time-periods common to all patients, risk and control periods are defined individually according to the information available during the whole observation period [FW06]. An observation period is defined according to assumptions often associated with the event of interest. Then, risk periods (or ex-posure periods) are characterized according to drug exposures times and a set of assumptions specific to the drugs or event under study. The control periods are defined as the periods when individuals are observed but not exposed. In opposition to case-crossover, this method is bi-directional as it uses information from both the periods preceding and following the event time. The relative risk of being exposed is estimated using a conditional Poisson model. The drug effect is then assessed by comparing the target event relative risk during exposure and control periods.

Self-controlled Cohort is a self-controlled design applied directly to a popula-tion as a whole in contrast with the previous designs modeling individual patients’ trajectories [RSM13]. It estimates Incidence Rate Ratios (IRRs) as = ( 0/ 0), ( 1/ 1)

where 0 (resp. 1) are the length of post-exposure (resp. pre-exposure) risk periods, and 0 (resp. 1) the number of adverse events observed during post-exposure (resp. pre-exposure) risk periods.

**Hybrids**

Other approaches borrow ideas from the two design families previously described. Information Component Temporal Pattern Discovery (ICTPD) is a variant of self-controlled cohort comparing patients with themselves (self-control) and assessing the existence of systematic differences by comparing case time intervals with equivalent periods in a control group (case-control). This approach adds a comparator group to self-controlled designs to control systematic differences between the risk and control periods. However, this approach remains sensitive to systematic differences between risk and control periods unique to the case group [Nor+13].

**Others**

Disproportionality analysis directly compares drug-event pairs co-occurrences using 2 tests to identify pairs which are more often reported together [Mon+11].

Longitudinal Gamma Poisson Shrinker is based on a similar idea, adapted to longitudinal data [Sch11]. Instead of merely counting events or non-events occurring within or outside drug exposure periods, it considers the length of drug exposure and non-exposed time to detect disproportionality. This approach is combined with LEOPARD, an algorithm comparing drug prescription rates in a fixed window, before and after the occurrence of a target event. It allows us to detect false positives caused by protopathic bias, i.e. situations when a drug is prescribed to cure the target event or its early manifestation instead of causing it [Fai15].

#### Comparison

OMOP developed benchmarks to evaluate these approaches’ performance on ADRs detection when using claims databases [Rya+12; Rya+13b]. To perform these bench-marks, the researchers produced an ADRs database containing drug and adverse events pairs. They estimated different combinations of models and designs, varying hypotheses, and hyperparameters to produce binary answers for each (drug, reaction) pair. These answers were compared to a database [Rya+13a] listing positive and neg-ative associations between molecules and reactions. These benchmarks concluded to a better performance of self-controlled designs over case-controlled designs. The scarcity of demographic and individual habits data in claims databases may explain this conclusion as it hinders control matching when using case-control designs. However, the evaluation method presented in [Rya+13b] has several shortcomings:

• Estimates are produced iteratively on drug and reaction pairs, which poses a high risk of obtaining estimates biased by unobserved confounding variables.

• Their ground truth ADR database has since been criticized for having misclas-sified some of the considered pairs [HAF16].

• The method used to choose between the many assumptions and hyperparame-ters is likely to overfit the ADR corpus, as they did not use an ADR testing set distinct from their training set.

**Selected approach**

Both human detection of ADR and tailored risk quantification studies are not scal-able enough to perform large scale ADR screening. Indeed, the latter requires a tremendous amount of manual tuning to provide results, see for example [Neu+12].

Moreover, LOD-specific issues raised in Section 2.1 show that developing a fully automated ADR detection system on SNDS data would suffer from too many biases to be effectively used in practice [Mad+14]. However, the OMOP benchmarks de-scribed in Section 2.3 indicate that models combined with self-control designs might be robust enough to derive useful information from claims data when it comes to detecting ADRs. These observations motivated the development of a new model to improve ADR detection.

**Table of contents :**

**Introduction **

1 Use of large observational databases for research

1.1 Characterization of large observational databases in healthcare

1.2 Barriers to methodological research

1.3 Contribution: a framework for reproducible and fast data processing

**2 Adverse drug reactions detection **

2.1 Modeling challenges

2.2 Mathematical tools

2.3 Common methodologies

2.4 Selected approach

2.5 Contribution: Convolutional SCCS

2.6 Applications

2.7 Discussion

**3 Learning representations for health data **

3.1 Deep learning architectures for healthcare

3.2 Pre-training strategies

3.3 Contribution: attention and pre-training strategies comparison

3.4 Experiments

**I SCALPEL3: a scalable open-source library for healthcare claims databases **

I.1 Introduction

I.2 Background

I.3 Material and Methods

I.3.1 The SNDS database

I.3.2 SCALPEL3: a SCAlable Pipeline for hEaLth data

I.3.3 SCALPEL-Flattening: denormalization of the data

I.3.4 SCALPEL-Extraction: extraction of concepts

I.3.5 SCALPEL-Analysis: interactive manipulation and analysis of cohorts

I.4 Results

I.5 Discussion

I.6 Conclusion

I.7 Summary Table

I.8 Declarations of interest

I.9 Authors’ contribution

I.10 Acknowledgments

**Appendix**

I.A Scalpel Analysis usage examples

I.B List of SNDS databases currently denormalized

I.C List of available extractors

I.D List of the available transformers

**II ConvSCCS: convolutional self-controlled case series model for lagged adverse event detections**

II.1 Introduction

II.2 Self-controlled case series models

II.2.1 Conditional Poisson regression and SCCS models

II.2.2 Risk screening

II.3 ConvSCCS: an extension of SCCS models

II.3.1 Discrete convolutional SCCS

II.3.2 Penalised estimation

II.4 Experiments

II.4.1 Simulations

II.4.2 Application on data from the French national health insurance information system

II.5 Conclusion

Appendix

II.A Likelihood in SCCS models

II.B Discrete time SCCS

II.C Numerical implementation

II.D Software

II.E Simulations details

**III Screening anxiolytics, hypnotics, antidepressants and neuroleptics for bone fracture risk among elderly**

III.1 Introduction

III.2 Materials and methods

III.2.1 Data Source

III.2.2 Study design

III.2.3 Case definition

III.2.4 Exposure definition

III.2.5 Statistical Analysis

III.2.6 Sensitivity and subgroup analysis

III.2.7 Software

III.3 Results

III.3.1 All fractures

III.4 Discussion

III.4.1 Key results

III.4.2 Limitations

III.4.3 Interpretation

III.5 Conclusion

Appendix

III.A Codes

III.B Sensitivity analysis

III.C SCCS assumption assessment

**IV Attention and unsupervised pre-training for EHR**

IV.1 Introduction

IV.2 Methods

IV.2.1 Models architecture

IV.2.2 Unsupervised pre-training

IV.2.3 Supervised fine-tuning, losses and metrics

IV.2.4 Hyper-parameters and training details

IV.3 Experiments

IV.4 Results

IV.5 Conclusion

Appendix

IV.A Encoders

IV.A.1 Vanilla transformer

IV.A.2 Linear transformer

IV.A.3 Graph Attention Network

IV.B Unsupervised Pre-training Strategies

IV.B.1 Masked Language Model

IV.B.2 Triplet loss

IV.B.3 Contrastive Predictive Coding

A Résumé des contributions

A.1 Utilisation de bases de données observationnelles

A.1.1 Contribution : un logiciel d’extraction rapide et reproductible de concepts médicaux

A.2 Détection d’effets indésirables médicamenteux

A.2.1 Défis méthodologiques

A.2.2 Approche retenue

A.2.3 Contribution : Convolutional SCCS

A.2.4 Applications

A.2.5 Discussion

A.3 Apprentissage de représentations en santé

A.3.1 Apprentissage profond en santé

A.3.2 Stratégies de préentraînement

A.3.3 Contribution : comparaison de modèles d’attention et de méthodes de préentraînement

A.3.4 Expériences

List of Figures

List of Tables

**Bibliography **

List of acronyms

Index