Improvement of the failure prediction model
In order to start on a solid basis, we decided to start researching and reading several research articles, in order to be able to focus our project on a first path. We managed to find 4 articles that could be useful to us. How did we choose them ?
First of all, we had to take into account the constraints of the problem. Here, our goal is to provide a list of companies to people who do not necessarily have knowledge or even confidence in Artificial Intelligence and more specifically in Data Sciences. It was therefore important to create this trust. For this, the explainability of the model is a major criterion to take into account. So-called « black box » models were to be avoided as much as possible. Secondly, we wanted models that would improve our results, so models that performed well according to the diﬀerent metrics used. And finally, the volume of data used and the temporal complexity. We want to have robust models, which train on data sets of sizes around one million entries, and whose complexity makes the training not too long.
Thus, we have chosen the following 4 articles :
— Systematic review of Bankruptcy Prediction Models – Alaka et al, 2015  : Comparison of several models, advantages and disadvantages of each
— Bankruptcy prediction using imaged financial ratios and convolutional neural networks  : Use of Deep Learning (CNN)
— Anchors : High-Precision Model-Agnostic Explanations  : Adding explainability to the model
— White-box Induction From SVM Models : Explainable AI with Logic Programming  : Explainabi-lity for SVM only
Systematic review of Bankruptcy Prediction Models – Alaka et al, 2015
In this article, several methods are tested, on diﬀerent datasets, and these models are evaluated accor-ding to several criteria. Among these criteria, we find the three we need (explainability, performance and execution time). There are 5 others in addition, which are the following (ranked by importance) :
— Multicollinearity/Correlation of variables : measures the sensitivity of the model to the pos-sibility of having features that are collinear
— Assumptions needed for the model : measures the initial constraints that must be met by, for example, the input data
— Variable selection : the selection of useful variables to optimize the model performance
— Robustness to the non-homogeneity of the distribution : this is useful here because the distribution of our variables is heterogeneous (remember that here there is a strong heterogeneity due to the fact that there is only a small part of the firms that go bankrupt)
— Propensity to over-fitter : here it will not be too much the case because our volume of data remains modest
The models we have chosen are : logistic regression (LR), support vector machine (SVM), decision tree (DT) and finally the multi layer perceptron (MLP). Let’s detail what each of these models does. Logistic regression is a conditional probability model, which uses the sigmoid function to calculate this probability. This model is a classical binary classifier which returns a probability between 0 and 1 which translates the probability that a company goes bankrupt. By default, if it is above 0.5 we will assign bankruptcy as an output, but we can modify this threshold.
The support vector machine uses a linear model to obtain an optimal separation of hyperplanes. It will construct the boundaries of these hyperplanes through binary classification. The variables closest to these hyperplanes will be called support vectors and will be used to determine the output value (outcome, in this case bankruptcy or not).
The decision tree is an iterative model that works by dichotomy. The decision tree will allow to provide decision rules afterwards. To know in which order to perform these rules, we will measure the importance. For example, if the turnover turns out to be more important in the classification than the number of employees, then the rule on turnover will be placed upstream.
The multi-layer perceptron is the basic model of neural networks. It has several layers that are fully connected to each other. This aims to reproduce the functioning of the human brain, hence the term neural networks.
Nevertheless, this article has some limitations. First of all, we can wonder if, for each of the methods used, a cross-validation has been applied or not, and if for the neural models the temporal character is eﬀectively taken into account. This would be of great help in our case. Moreover, for most of the studies included, we notice that they make the hypothesis of a balanced dataset (37 studies take proportions between 50% bankruptcy and as much non-bankruptcy and 30% bankruptcy and 70% non-bankruptcy, only 2 studies with proportions in 25-75% and still 2 around 15-85%, and only one with 0.3-99.7%). For studies with balanced datasets, we would have to compare the results of the review with our results after oversampling to find a balanced dataset. On the other hand, it is convenient for us that the unbalanced studies use our models.
Bankruptcy prediction using imaged financial ratios and convolutional neural networks
This article was of particular interest to us because it used sophisticated Deep Learning methods, and that was one of our goals. Let’s detail what this article brings.
First, let’s explain the reason for using Convolutional Neural Networks (CNN). CNNs have allowed us to obtain much better results (in terms of performance) than conventional methods. Moreover, CNNs are more adapted to image-related problems. Therefore, in order to use a CNN, it was necessary to transpose our numerical data into images.
Now let’s talk about the database used. The study focuses on the companies of the Tokyo Stock Exchange. The data come from the Nikkei NEEDS Financial QUEST database, having for date of end June 2016, and taking the last 4 values for each company (the collection of the data is not regular, it is possible that for a company we have the data only quarterly, or that a month we do not obtain them, it is for that that we take the last 4 values obtained and not the values on the 4 preceding periods). In total, there are 175 input features from the balance sheets and 88 features from the P&L, which makes a total of 283 input features. A first filter is applied, in fact, we will only keep the features which will be eﬀectively present in p% of the companies (in this case, p is 80 here). After that, the total number of features is 133.
Afterwards, we have 102 firms that failed and 2062 that did not, for a failure rate of 4.71%. Among the bankruptcies, we separate them into 5 sets of 20 firms, and we leave the last two aside. In the same way, we make 5 sets of 20 for the non-bankrupt firms and we leave the 1962 others aside. We transform these diﬀerent sets into images. We create synthetic data using the weighted average method. We then draw 4 sets among the 5 for the two types of companies (there are thus 5 possible choices of training set/test set pairs). For the training set, we have 50% of bankruptcies and 50% of non-bankruptcies, bringing to 7520 x 2 images for the training set. For the test set, we have 88 images for the bankruptcies and 7928 for the non-bankruptcies. In other parts of the paper, they decide not to generate synthetic data for the non-bankrupt firms (they are already over-represented) or not to respect the 50-50 proportion in the training set. The results are diﬀerent but not necessarily better.
For the dataset production we have explained how we distribute the data, and which data we keep.
Let’s detail the process by which we transform our digital data into images. We will apply two methods :
a random method, which will randomly assign to the items a pixel on the image of size N x N where N
is the number of items ; a correlated method, which will assign to the correlated ratios « close » pixels. Both methods are sensible, because for the random method, a CNN with enough layers will be able to determine patterns even between distant pixels, and for the correlated method, the pattern will be discernible with fewer layers. To begin, we randomly place the ratios on the image. Thanks to this, we will have generated the images of the Random method. We calculate the cost function (energy) associated with this image :
d i, j ° x i x j 2 y i y j 2
E pi,jq |cpRpiq, Rpjqq| dpi, jq
p q r p q p qs r p q p qs
where Rpiq is the financial ratio i, c pRpiq, Rpjqq the correlation coeﬃcient between ratio i and ratio j and pxpiq, ypiqq the coordinates of pixel i. If E can be reduced by exchanging two pixels (we swap them), then we make this change, otherwise, we do not change anything. We repeat this process, if there is no reduction of E after 3N steps, we stop. Once we have done this, we will generate synthetic data. To do this we will « average » the data from the existing data, and we will assume that by averaging such data, the synthetic firm will remain in the same class (bankruptcy or not) as the two firms used. The fact that we simulate synthetic data comes from the fact that we do not have enough values for the bankruptcy category, and for this reason we artificially create companies that will be.
The architecture used thereafter is the GoogLeNet, which is a CNN with 27 layers, and 7,000,000 input features, which can be seen in figure 1.
The limitations of this article are various. First, the complexity of the architecture used. Deep neural networks require more resources for training and prediction. Moreover, these models are not explicable. It would therefore be necessary to add an explanatory layer in the pipeline, making the model even more complex.
Anchors : High-Precision Model-Agnostic Explanations
In this article, the author will present an algorithm to explain « black box » models, such as the one described in the previous article .
The problem with explanatory models is that they are too local (e.g. in « it’s not too bad » the « not » is positive, in « it’s not too good » it is negative, and most explanatory algorithms fail to make this distinction). The goal is therefore to overcome this problem. The Anchors model relies on intuitive rules to explain the prediction of the model.
The goal of this algorithm is therefore to find the anchor (a kind of rule set) that will satisfy the accuracy constraint with a high probability. If several anchors are suitable, we will take the one that will cover the largest field (the set of rules that will apply to the largest number of inputs). This brings us back to an optimization problem.
The algorithm works like this : first, the set of rules A is empty (this rule applies to all inputs), then we add ONE rule, which gives us a set of rules at step 1. To find this rule, we generate one rule per feature, which gives us a set of new possible rules. Among all these rules, we take the one that satisfies the condition said in the previous paragraph. We repeat this process. This will give us the smallest anchor, but will not give us the « coverage » of the latter, i.e. we will not know its scope.
This greedy approach has some qualities but also some flaws. The first one is that it only allows an increment of one rule by one rule at each step, so any choice that would not be optimal would have heavy consequences on the continuation. Moreover, it does not seek to satisfy the « coverage » condition, but rather the notion of the smallest set of rules. Another approach is therefore necessary if we want to respect these two criteria. This approach is similar to greedy, except that instead of keeping THE best at each iteration, it keeps a predefined number B. Among these, it will choose the one with the highest coverage. We are therefore more likely to respect this second criterion.
In terms of results, we compare LIME to Anchor, and we look at the accuracy and coverage. LIME is an algorithm that disturbs the input values and looks at the consequences on the prediction, and thus draws conclusions on the functioning of the model. We compare 3 methods : a logistic regression, a neural network with 2 layers and 50 units each and 400 gradient boosted trees. We are going to apply them to classification at 3 levels : firstly, according to data on adults, predict their income (more or less than 50k/year), secondly, predict whether an ex-convict/person who has had dealings with the law will be sentenced again, and finally, the quality of a loan for a person according to financial data on them. For the three applications and the three models, the Anchor model gives us a much better accuracy than LIME (about 97% versus about 70%). Nevertheless, in terms of coverage, we realize that the LIME is more interesting for the first two applications, while the Anchor will be better for the third. Nevertheless, we have on average a coverage of 12% for Anchor against 15% for LIME, so it remains in the same order of magnitude.
The limits of this model are : when we are at the limits, we can have very specific decision rules and therefore not very flexible to generalization or we can have rules that will enter in opposition ; and finally the weak coverage (only 12% to 15% of the predictions are explained, that is to say that we remain with 85% of unexplained predictions).
White-box Induction From SVM Models : Explainable AI with Logic Programming
The goal is to bring explainability to the SVM, so that the resulting rules are understandable, accurate and faithful to the model. To do so, we will measure the Accuracy, the Recall and the F1-score (to measure the fidelity).
First, we look at the SHAP model. To do this, we will measure the importance of each feature in the model. For any i such that xi is a feature of the model, we will apply the model to any subset S of features not containing xi, then apply the model to this same subset to which we will have added xi. We subtract the results and see what happens. We do this for all the subsets not containing xi then we average and we obtain what we call the Shapley value of i. We do this for all the data samples, and all the features, and it gives us a matrix.
This is the first approach, now let’s talk about the Shap FOIL approach, which is a bit more sophis-ticated. The SHAP model allows to determine a set of features that will lead to a certain decision of the model. The Shap FOIL model allows the following : if a set of features explains what leads to a certain support vector, then it allows to give an explanation for all features « similar » to this vector.
We compare this model to an ALEPH (which is a state-of-the-art algorithm in this ILP domain), applying them to SVMs on a dataset coming from the UCI which collects data such as heart rate, blood pressure, etc. We separate the dataset into several categories (8 categories). We apply a classical SVM, and the two other models. In 7 cases out of 8, we realize that Shap FOIL outperforms ALEPH. The fidelity of Shap FOIL is shown by its F1-score which is close to 0.9, against 0.8 for ALEPH. And above all, the F1-score remains always higher than 0.8 where ALEPH’s one reaches 0.55 in one case. In addition to that, the Shap FOIL model produces less rules than ALEPH, which is a big advantage in the explicability and its understanding for the user (we are around half less rules for Shap FOIL than for ALEPH).
As said before, we decided to restrict ourselves to a benchmark of some models. We kept four of them from , and we decided to add two others : the Random Forest, which will simulate N Decision Trees at the same time and will apply the principle of voting for the outcome, and the Gradient Boosting which is an ensemble architecture that consists in aggregating models (here they are Decision Trees) sequentially by weighting the training samples, and these same weights are modified during the training. Finally, we decided to look at the Voting Classifier of these six models. To explain simply, it will apply simultaneously the six models proposed above, and will apply a vote on the outcomes. Two choices are possible, either we make a weighted vote, and therefore we have to assign weights to each of the models, or we make a majority vote without any particular weighting (all models are equivalent).
In the whole suite, unless otherwise stated, we have a train set of 8.4 million entries, separated into 80% for training and 20% for validation. Similarly, we have a test set of 1.2 million inputs. The pipeline is as follows : first, we apply a OneHotEncoding on our textual features, then we normalize our data, and finally we apply the chosen model. Finally, we perform a Grid Search. The latter allows us to optimize the hyperparameters (for example for a Decision Tree, an interesting hyperparameter is the depth of the tree) of the model according to a chosen metric (here we have chosen the Balanced Accuracy). It is important to specify that this optimization is only partial. Indeed, the Grid Search will only perform its search on a (non-exhaustive) list of values that the user fills in himself.
We decided to use Logistic Regression for two reasons. First, in order to improve the existing model (which is itself a logistic regression). And second, in order to have a classical binary model that will serve as a control for the future.
For this model, we have applied a Grid Search on two hyperparameters : the penalty (l1, l2, elasticnet or none), and the coeﬃcient C which will amplify or not the relugarization of the model.
Following this step, the best hyperparameters to optimize our balanced accuracy were : a penalty in l2 norm, and a coeﬃcient C worth 1000 (as it is the inverse that counts, this large value reduces the importance of the regularization).
To identify the value of this model, we looked at the metrics we discussed earlier, but we were interested in another metric, which we will use in all our tests later. We decided to look at the Area Under Curve (AUCPR). This metric is interesting because it represents the area under the curve (the curve that plots the True Positive rate against the False Positive rate). This curve is a complementary value to Precision and Recall. It allows us to measure the compensation between the two, i.e. whether or not it is more interesting to decrease one by increasing the other or vice versa.
In the table 2 is a summary of all the metrics for this model. We notice that compared to the current model, we have an improvement in two of the main metrics. We gain 1.45% in balanced accuracy, and 0.225 in F2-score. Nevertheless, this improvement remains relatively small for the first metric. The objective, if we follow Alaka et al. , would be to obtain results (at least) higher than 70%.
If we look at the other metrics, we have Precision and Recall that are in line with what the current model already oﬀered. For the AUCPR on the other hand, we have decreased (by about 0.1). We thus realize that this model we propose is certainly an improvement for two of our reference metrics, but is not a unanimous candidate. It is therefore worthwhile to continue the research.
One of the reasons for choosing the Decision Tree was its explicability, and the possibility of extracting a chain of rules intelligible to users.
For the hyperparameters, we are therefore interested in only the two most important ones : the depth of the tree (the deeper it is, the more rules there will be, and consequently the less interpretable it will be for the user because it will be too complex) and the criterion (this is the function that allows us to measure the contribution of the addition of a new rule).
Here, we have arrived at an interesting case. From a certain depth, we realized (see figure 2) that the Test Accuracy (the one we are interested in) remained relatively constant (small increase) when the depth of the tree increased, before reaching an over-fitting depth of 12. To maintain a relatively good explainability, we therefore limited ourselves to a depth of 5.
Table of contents :
Context and problem statement
Improvement of the failure prediction model
1 Related Work
1.1 Systematic review of Bankruptcy Prediction Models – Alaka et al, 2015
1.2 Bankruptcy prediction using imaged financial ratios and convolutional neural networks
1.3 Anchors : High-Precision Model-Agnostic Explanations
1.4 White-box Induction From SVM Models : Explainable AI with Logic Programming
2 Benchmark of the different models
2.1 Logistic Regression
2.2 Decision Tree
2.3 Support Vector Machine
2.4 Random Forest
2.5 Gradient Boosting
2.6 Multi Layer Perceptron
2.7 Voting Classifier
2.8 Conclusion of this benchmark
3 Taking into account the temporality
3.1 Comparison between the initial dataset and the fully modified dataset
3.2 Comparison between the initial dataset and the datasets where only the lag variables of a
feature category are added
3.2.1 Logistic Regression
3.2.2 Decision Tree
3.2.3 Random Forest
3.2.4 Multi Layer Perceptron
4 A model by sector ?
5 The importance of features
5.1 Logistic Regression
5.2 Decision Tree
5.3 Random Forest
Table of Figures and Tables