About the Orpailleur team
It has been created in 1997 by Amedeo Napoli. The team leader isn’t Amedeo Napoli anymore, since he is now emeritus researcher and has left his place to Miguel Couceiro. The team is com-posed of about 23 members, including interns, 10 PhD students, 1 post-doc, 1 engineer, and 9 permanent researchers. A part of the team is located in Metz. Orpailleur conducts research on data mining and knowledge discovery, in general. These elds are nevertheless close to others, such as arti cial intelligence or natural language processing, which are also studied by Orpailleur. Orpailleur is the french for gold panners, a way to tell that team members are searching for the gold in data.
If you look at the organization chart, you will notice that INRIA project teams are marked on the right. But INRIA teams are not necessarily exclusively composed of INRIA researchers. When I entered the Orpailleur team, it was indeed an INRIA project team4.
Internship and research contract
This end-of-study internship follows a research contract, that started in September 2020, and that I chose to do out of interest in research. This is when I met the team, became familiar with the laboratory and its wonderful restaurant, two days a week, all along the rst semester. The three other days in the week were reserved for the courses at TELECOM Nancy. The second semester is dedicated to the internship (full time), which started on March 1, and will end on August 31.
In September, it was rst planned that I work on molecule classi cation and predictions, which is a project supervised by Amedeo Napoli and Bernard Maigret (CNRS emeritus researcher, CAPSID team), and that started a few months before, during a master internship. This was in fact postpone to my internship in March / April in order to collaborate with Clément Bellanger, another M2 intern. This is why I rst focused on machine learning fairness and biases with Miguel Couceiro and Guilherme Alves Da Silva (a Brazilian PhD student), and continued a bit to work on it along with molecules during my end-of-study internship.
The covid-19 section
Of course this thesis couldn’t avoid a word on the ongoing covid-19 pandemic. Unfortunately, a new epidemic wave happened again in France, which forced the government to set up a new lockdown at the end of October 2020. I then worked remotely from the All Saint’s Day vacations to the 9th of June, when the laboratory nally reopened its canteen and allowed people to come in a more exible fashion, two days a week, and then three days a week, starting from July.
My two or three days at the laboratory were selected by taking the hottest days of the week. I was not in the internship room, but in an o ce of the team, with a PhD student. There was no 4 But for some reasons, it is not anymore, so the team should now be marked on the left on the chart.
schedule imposed, but I adopted a rhythm, from about 9:00 / 9:30 am until 6:00 pm. Of course I took a lunch break to eat with the PhD students.
On Thursday morning, it is the team’s ritual: a Malotec conference5, organized by Miguel. A researcher comes to discuss a topic for a short hour, then there is a time for the audience to ask questions. It is a kind of of knowledge sharing moment for the team.
During the lockdown, in order to stay in touch, we met through the Microsoft Teams and Dis-cord communication platforms. Around one weekly meeting was organized with Miguel and Guilherme in order to make a progress report on the research around FixOut, discuss the results and try to bring out new ideas; the same every Monday with Amedeo, Bernard and Clement for antibiotics.
The project is managed in an Agile fashion, with Miguel and Amedeo as coordinators. Some deadlines are imposed when it comes to submitting an article for a journal or symposium. The peer-review system allows us to have a feedback of the scienti c community of the domain, and improve our future papers.
On the hardware side, I use my personal computer to work. For heavier calculations, I have an access to the cluster of calculations Grid50006. I also use Google Colab for the preparation of some models.
First of all, this end-of-study project deals with many machine learning models. You may notice that “machine learning” is composed of two words, stating that this allows machines to learn by themselves from data. By “model”, we here understand a function that takes an instance as input, and outputs a prediction. Machine learning models have then in general two stages:
• the training stage, where the model learns from data: its parameters are adjusted in order to t to the training set;
• the testing stage, where we assess the model on new data.
The way the model is constructed is described by hyper-parameters.
We here focus on supervised training, which means we must indicate to our model what output we expect. More speci cally, such a model is trained on labelled data, i.e. the dataset consists in a list of instances (inputs), which are associated to the output we expect to get from the model.
We also focus on classi cation tasks, i.e. the role of our model is to predict one class among many ones (two or more). This is to be distinguished from regression tasks, which aim to predict con-tinuous values. For a classi cation problem, machine learning generally outputs values between 0 and 1 for each class, that we can interpret like a belonging probability, or a likelihood. The nal binary prediction for a class membership is obtained by discriminating the predicted probability with a threshold, 0.5 being the default value in many systems.
This method is quite similar to linear regression, though logistic regression is used for classi ca-tion. The prediction is a simple weighted sum of the inputs, plus an additional bias. This sum is then passed through an activation function, giving a result between 0 and 1.
Where ~ is the predicted output, G8 are the inputs, 08 and 1 are the training parameters, i.e. pa-rameters that will change during the training, and f is a logistic function1. We can notice that a weight is attributed to each input, which allows us to be aware of the importance given to these di erent inputs. A common logistic function is the sigmoid, where f is de ned as:
Random forests and AdaBoost
Random forests are ensembles of decision trees. Each decision tree of the ensemble is trained on a random subset of the dataset, and for which some attributes are randomly removed. To get the nal result, each tree makes a prediction. These predictions are then averaged to obtain a global score, which, in general, appears to be more accurate than a simple decision tree, trained on the whole dataset. This technique was formalized for the rst time in 2001 by the American statistician Leo Breiman .
Because decision trees may be complex and because ensembles such as random forests typically contain around a hundred decision trees, such models are di cult to interpret. We can nonethe-less visualize single trees, as shown on gure 3.1 here-under.
AdaBoost is also an ensemble, but the idea is di erent in that it consecutively trains submodels, one by one. It begins with a simple model (e.g. a decision tree), and then trains another by mostly reusing instances for which the previous model predicted a wrong answer, and so on. This technique is not really a model in itself, but a meta-algorithm that is agnostic to the model, i.e. AdaBoost can be used with any model. It is however used most of the time with decision trees3.
This was rst discovered and introduced by Yoav Freund and Robert Schapire in 1997 .
Neural networks consist in several layers of arti cial neurons. A neuron is a simple function that takes many inputs and return one output that can be distributed to other neurons. A well known and used function is the perceptron, discovered in 1958 by Frank Rosenblatt , inspired from the biological neuron, and described on gure 3.2. An arti cial neuron slightly di ers from a simple logistic in that it can outputs values in wide ranges (not only in [0, 1]), according to the chosen activation function.
Based on this simple idea, many combinations and operations can be used to form a neural net-work layer. Here below are the roles of some layers that we will use during the project descrip-tion. Even though it has very interesting inner mechanisms, we will not focus on it, and skip their details.
3We can also add, that scikit-learn and other machine learning engines like Weka, Microsoft Azure ML Studio, implement AdaBoost with decision trees by default.
It is necessary since deep learning models are trained by using a gradient descent algorithm, minimizing the error calculating (between the predicted ~ and the expected ~ˆ) by calculating its gradient with respect to the model’s weights. To make this possible, every operation used in the network, including the error computation, must then be di erentiable.
Discovered simultaneously in 1986 by the French researcher and engineer Yann Le Cun, and also by David Rumelhart and Geo rey Hinton, the backpropagation algorithm  is a technique used to compute the gradient w.r.t. any parameter in neural networks. They received the Turing Award in 2018 for their work, except for Rumelhart who regrettably died in 2011.
These layers are used to deal with sequences of data. More precisely, we will be led to use Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) layers, which contain a memory cell that is updated each time a new instance of the sequence is given, as described on gure 3.4.
Figure 3.4: Illustration of an LSTM unit, a typical example of recurrent layer. The layer outputs a vector C (for the CC term of the sequence) and takes its last output C 1 as an additional input.
LSTMs have rst been mentioned in 1995 by the German researchers Sepp Hochreiter and Jür-gen Schmidhuber, and GRUs have been proposed in 2014 by Kyunghyun Cho as a simpli cation of LSTMs . They were mostly used for text classi cation and generation, since text can be considered as sequences of characters, syllables or words.
Formulated in 2014 , attention layers also deal with sequences of data, but treat all the elements of the sequence at a time, and compute importance weights for each element. These weights thus represent how much attention we should give for each item, and are also called attention weights.
Understanding what is an embedding
It is very common to talk about embeddings in deep learning, and we will be using this term many times in this report.
An embedding is the short name for an embedding vector, which describes an object with a re-stricted number of dimensions. We expect from embedded objects that the more similar they are,
the smaller their vector distance will be. This conversion results in a vector space of embeddings, also called a semantic space, or a latent space.
It is often applied in NLP, for example to embed words – we then talk about word embedding. In such a latent space, words that are close in meaning are pretty close in vector distance (either in euclidean distance or in cosine similarity). The gure 3.6 shows an arti cial example to illustrate the typical distribution of words in a latent space, in two dimensions.
Antibiotics classi cation
This part only concerns antibiotics classi cation, more precisely the task consisting in predicting whether a given molecule is an antibiotic. Two models have been used to manage this task. These two models are conceptually di erent, but both work on the same input format.
Databases and SMILES format
Antibiotics classi cation is more precisely the action of classifying a molecule as antibiotic or not. This is therefore a simple two-class classi cation problem. However, in some cases, we are also interested in knowing more precisely the antibiotics family of the molecule. Since there are many families, the classi cation problem becomes multi-class.
The datasets are then quite clear: CSV les that contain a molecule in the rst column, as an input, and the next = columns are the = classes for the classi cation problem, containing a zero in case of non membership, and a one in the opposite case.
However, it is not obvious how to encode molecules in the dataset, since molecules have complex structures, that can be simpli ed as graphs. In our case, all molecules are described in the SMILES
format, which stands for Simpli ed Molecular-Input Line-Entry System. In simple terms, this is just a way to describe the graph of the molecules with a string of characters.
Table of contents :
2 Context and environment
2.2 About the Orpailleur team
2.3 Internship and research contract
2.4 The covid-19 section
3 State of the art
3.1 Machine learning models
3.1.1 Logistic regression
3.1.2 Random forests and AdaBoost
3.1.3 Neural networks
3.2 Antibiotics classication
3.2.1 Databases and SMILES format
3.4 Models metrics
4 Contributions to FixOut
4.1 Problem description
4.1.1 Fairness issues in machine learning
4.1.2 The purpose of FixOut
4.1.3 The interest in textual applications
4.2 General adaptation of FixOut to text
4.2.1 Problem analysis
4.2.2 Proposed solution
4.3 Adaptation to neural networks
4.3.1 Avoiding re-training of many sub-models
4.3.2 Using gradient based explainers
4.5 Interactive web demo
5 Explanation of antibiotic molecules
5.1 Problem analysis
5.1.1 Models to explain
5.2 Proposed solution
5.2.1 Explanation of DeepChem
5.2.2 Explanation of Chemprop
5.3.1 LIME and SHAP for Chemprop
5.3.2 Molecule visualizations
5.3.3 Mol graph organization
5.4.1 Interpretation of DeepChem results is not trivial
5.4.2 Chemprop, by considering the average contribution
5.4.3 By considering each feature separately
5.4.4 Color scale normalization
5.5 Comparison to PathExplain and interaction explanation
5.5.1 PathExplain for interaction explanations
5.5.2 The problem of explaining interactions with LIME
5.5.3 Implementation of PathExplain
5.7 Web interface