Get Complete Project Material File(s) Now! »

## From Counterfactual Reasoning to Counterfactual Explanations in Machine Learning

- Counterfactual reasoning. The term counterfactual originally comes from psychology (see e.g. Roese, 1997; Byrne, 2008), and Artificial Intelligence literatures (Lewis, 1973; Ginsberg, 1986). In these domains, counterfactual reasoning « is a concept that involves the creation of possible alternatives to life events that have already occurred (counterfactual world). This resoning revolves around answering the question What if …? when thinking of how things could have turned out differently » (Wikipedia definition for Counterfactual Thinking2). In psychology, counterfactual reasoning is associated to multiple benefits, some of them mentioned in the next Section 2.3.1.2. Loosely inspired from cognitive sciences, counterfactual reasoning in machine learning can be found in several tasks such as planning failures (Halpern and Pearl, 2005), reinforcement learning (Swaminathan and Joachims, 2015) or generative adversarial networks (Neal et al., 2018). Another highly studied topic is counterfactual fairness (Kusner et al., 2017), which uses counterfactual reasoning to assess the presence of bias. In this paradigm, a prediction of a model is considered to be biased if it differs in the counterfactual world.

Counterfactuals for Machine Learning Interpretability. Counterfactual explanations for machine learning predictions (also sometimes referred to as contrastive explanations) have been used for several years (Bottou et al., 2013). They have been recently the focus of attention sinceWachter et al. (2018): this law research paper dis- cusses the right for an explanation stated in the GDPR. More precisely, it proposes to use counterfactual explanations to explain individual predictions made by a blackbox model. Since then, multiple works on counterfactual explanations have been proposed (see Artelt and Hammer (2019) for a recent survey on counterfactual explanations), some of them being presented further in this section. Several reasons explain this success, some of them detailed in the next section, after counterfactual explanations for Machine Learning Interpretability are defined here. Counterfactual explanations for Machine Learning Interpretability can be seen as an adaptation of counterfactual reasoning to the context of Machine Learning Interpretability.

The key idea is to analyze the predictions of a classifier by envisaging alternative situations that may modify them. Let us consider a trained classifier and the case where a single prediction, associated with an instance from the input space, is to be interpreted. In this context, the goal of counterfactual explanations is to consider an alternative version of this instance, that is to say another instance, and study the differences observed in their predictions. This difference between the original state and the counterfactual one constitutes a change that can be observed and measured.

This change can also be interpreted as a list of actions that are required to alter the prediction. Hence, the objective of counterfactual explanations can be formulated as the following question:

1 What actions are required to alter this prediction?

2 Another, slightly different, formulation is:

3 How does applying this change impact the prediction?.

### Two Counterfactual Explanation Approaches

This section describes in turn two counterfactual approaches, namely HCLS proposed by Lash et al. (2017a), and the already presented LORE (Guidotti et al., 2019a), that are relevant for the rest of the thesis. In particular, these approaches are the focus of HCLS (Lash et al., 2017a) As mentioned earlier, finding a counterfactual explanation is equivalent to solving a specific in inverse classification problem. Lash et al. (2017b) consider a framework for model-agnostic budget-constrained inverse classification.

The idea behind this problem is to solve an inverse classification problem with a notion of maximum budget, capping the allowed cost c(e). The proposed problem is therefore slightly different from the one written in Equation2.2: e = arg max e2X p(e).

#### From Locality to Counterfactuals

As presented in Section 2.1.3, page 17, local interpretability approaches aim at generating explanations to help a user understand the reasons for a single prediction by a classifier by focusing on identifying some local behavior of the classifier (Rueping, 2006; Guidotti et al., 2018; Carvalho et al., 2019). Following the discussion of Section 2.1.2, page 13, the local explanation should give insights about the « parts » of the model that are activated specifically when making the prediction. These parts can be easily identified in the context of specific classifiers: one can mention as examples the list of activated rules of a decision tree (Guidotti et al., 2019a) or the visualization of the neuron activations of a deep neural network (Yosinki et al., 2015; Selvaraju et al., 2016). However, identifying and extracting such a local rationale is not possible in the post-hoc context, as it requires knowledge and access to the inner workings of the model to be explained. Furthermore, this local behavior can be hard to define for other models such as SVM (e.g. with a complex unknown kernel) or ensemble methods for instance: the former may use non-linear kernels which make the definition of the locality difficult, while the latter aggregate the decisions of multiple weak classifiers, making the notion of local behavior complex to define for the ensemble model. Therefore, we propose to reduce the generation of a local explanation to identifying the local decision boundary of the classifier. In this context, the local decision boundary is defined as the closest part of the decision boundary to the observation whose prediction is to be interpreted. For that reason, counterfactual explanations (presented in Section 2.3, page 28) appear as a good answer to the task of generating a local explanation. Indeed, they try to identify the closest touchpoint of the decision boundary of the classifier, and therefore generate the most local explanation possible. Nevertheless, as discussed in Section 2.3.1.3, page 32, generating a counterfactual explanation relies first and foremost on defining a cost function that is relevant for the considered problem and user. This problem is tacked in the next section, in which the desired behavior of the post-hoc counterfactual explainer (hence the proposed cost function) is discussed, and used to define our objective function.

**Discussion about the Desired Behavior**

Two objectives: locality and sparsity. As explained in Section 2.3.1.3, page 32, given a black-box classifier and an observation whose prediction is to be interpreted, a counterfactual explanation is associated to a specific data point that is predicted to belong to the other class. The final explanation given to the user is then expressed in the form of the change vector between the observation and the identified data point (see Definition 1, page 30). Following the principle of locality discussed in the previous section, the explaining data point must be as close as possible to the observation whose prediction is to be interpreted. However, defining the associated cost function in the post-hoc context is challenging. Due to the considered agnosticity assumptions, no information about the potential metric functions used by the classifier is available to help define the associated cost function of the explanation vector. Moreover, the classifier might have used some specific distance function, sometimes even defined in another feature space, to make predictions. In this context, using the Euclidean distance to define this closeness has been shown to be a reasonable assumption (see e.g. Strumbelj et al., 2009; Lash et al., 2017b). We thus propose to identify the l2-closest touchpoint of the decision boundary.

However, recalling the second objective of explanation methods defined in Section 2.1.2, page 13, in order to be usable, the explanation should furthermore be easily understandable by the user. In this context, focusing solely on the Euclidean distance may lead to explanations that are difficult to read. This is especially true in the case of high dimensional data for instance: finding a counterfactual example by looking for the closest in terms of Euclidean distance would lead to an explanation involving a high number of attributes. As evoked in Section 2.1.4, page 19, including the sparsity of the explanation in the formulated objective seems therefore important.

**Proposed Problem Formalization and the Growing Spheres Algorithm**

As stated in the previous section, generating a counterfactual explanation that is both local and sparse is challenging in the post-hoc paradigm. Existing approaches thus generally rely on optimizing one of these notions at the cost of the other. In this section, we propose a formalization of the objective of the explainer and use the discussions of Section 3.1 to propose a post-hoc counterfactual explanation approach, called Growing Spheres.

First, in Section 3.2.1, a formalization of the counterfactual problem we want to solve is proposed. Section 3.2.2 describes the Growing Spheres algorithm. Then, a discussion about the main hyperparameters that Growing Spheres is relying on is proposed in Section 3.2.3. Finally in Section 3.2.4, we tackle the extension of the proposed approach to the context of multiclass classification.

**Growing Spheres Hyperparameters: n, h and w**

The Growing Spheres algorithm relies on three hyperparameters that highly impact the performance of the algorithm and its outputs. While the hyperparameter h (radius of the initial hypersphere) mostly influences the computation time of the algorithm rather than its results, this is not the case for the hyperparameters n and w. Indeed, at each iteration of the Generation step, n instances are drawn following a uniform distribution over the hyperspherical layers SL, defined by its width w. Yet, the approximation error considered to define ˜ e with respect to the analytical solution e directly depends on the density of the instances drawn in the space, i.e. the volume of SL divided by the number of generated instances n. However, because this value is less transparent for the user, we choose to consider w and n separately instead of defining the density as a hyperparameter. In practice, we choose an arbitrary value for w (that needs to be small for the considered problem) and calibrate Growing Spheres using n only. An obvious consequence of this choice is that the density of the generated instances decreases exponentially at each iteration, meaning that the average approximation error increases with the number of iteration. However, this assumption seems more reasonable than requiring either n to increase exponentially or w to decrease exponentially for convergence reasons.

Nevertheless, considering the highly stochastic nature of the Growing Spheres algorithm, although these parameters influence the produced results, their exact value is not crucial.

**Adapting the Algorithm to Multiclass Classification**

This chapter, and in particular the proposed Growing Spheres algorithm, focuses on binary classification. However, adapting the considered problem to the context of multiclass classification is not particularly challenging. When there are more than two classes, two possibilities can arise: counterfactual explanations can either be tar- geted or untargeted. Considering a multiclass classifier f : X ! Y, the following modifications ensue:

Untargeted counterfactuals. Untargeted counterfactuals aim at generating the closest instance classified differently. The predicted class itself of the final counterfactual example is thus not important. Applying this to Growing Spheres is thus easy, since the algorithm does not need to be modified: the closest instance e f 2 X satisfying f (e f ) 6= f (x) is still thus to be returned. Targeted counterfactuals. Targeted counterfactuals aim at generating the closest instance predicted to belong to a specific class l 2 Y. The stopping criterion of the generation step thus becomes f ( ˜ e) = l. Similarly, the condition in the projection step becomes f (e0) = l (instead of line 4 of Algorithm 3) to ensure that f (projH( ˜ e)) = l.

**Table of contents :**

**1 Introduction **

**2 Technical Context **

2.1 Key Notions of Machine Learning Interpretability

2.2 Surrogate Model Approaches

2.3 Counterfactual Explanation Approaches

2.4 Conclusion

2.5 Notations

**3 Generating Post-hoc Counterfactuals and the Risk of Out-of-distribution Explanations **

3.1 Motivations

3.2 Proposed Problem Formalization and the Growing Spheres Algorithm .

3.3 Experimental Validation

3.4 Discussion: Out-of-Distribution Counterfactuals

3.5 Conclusion

**4 The Risk of Unjustified Explanations **

4.1 Ground-truth Justification

4.2 LRA: an Algorithm to Detect Unjustified Classification Regions

4.3 Experimental Assessment of the Local Risk of Generating Unjustified Counterfactuals

4.4 VE: An Algorithm to Assess the Vulnerability of Post-hoc Counterfactual

Approaches

4.5 Conclusion

**5 Defining Explanation Locality for Post-hoc Surrogate Models **

5.1 Locality for Local Surrogate Models

5.2 Measuring Locality: the Local Fidelity Criterion

5.3 A New Local Surrogate Approach: the LS Algorithm

5.4 Discussion: Local Surrogates and Counterfactuals

5.5 Conclusion

**6 Conclusion and Perspectives **

6.1 Summary of the Contributions

6.2 FutureWorks

**Appendix A Justification of Adversarial Examples on MNIST **

**References **