Chapter 3 Measures of Model Assessment
This chapter describes the methods that will be used in Chapter 4 and 5 to assess the performance of the Bayesian models developed in Chapter 2 (non-hierarchical models including two-component mixtures) and Chapter 5 (hierarchical models including two-component mixtures) for predicting stutter ratio. The theoretical background, benefits, and limitations of various performance measures are reviewed in order to identify appro-priate measures for evaluating the Bayesian statistical models presented.
Statistical models, in general, are developed based on few fundamental assumptions. The distributional assumption on the data, for instance, plays a key role in the plausibil-ity of the inference that is based on the fitted model. Models incorporated in Bayesian data analysis are also subjected to key assumptions. Hence, a model that exhibits poor plausibility tends to produce misleading inferences. Therefore, an assessment of these assumptions is always a good practice. Generally, it is essential to check the statistical capability of a model in order to produce a realistic summary of the data at hand . In the classical (frequentists’) approach, comparisons between the observations and the pre-dictions (expected results under the model) are used as the basis of goodness-of-fit tests that quantify the inconsistency in terms of a probability value (p-value). In the context of Bayesian data analysis, the posterior predictive distribution, which describes the char-acteristics and statistical behaviour of unobserved future observations conditioned on the observed real data at hand, is used to answer the prediction problems .
In general, a statistical model is a probabilistic system that involves a probability dis-tribution or a finite/infinite mixture of distributions. These models are widely used in explanation, prediction, or making inferences on some real-world phenomena. It is pos-sible to approximate a given phenomenon with more than one model. Accordingly, the complexity of a model can vary from simple to very complex. Very complex models may include very large number of parameters. A fully non-parametric model, for example, may consist of enormous number of parameters. Models where the number of param-eters can grow with the size of training data set are more appropriately referred to as non-parametric. In 1976, George E. P. Box stated, ”all models are wrong, but some are useful” . This is a widely believed fact in modelling and hence, no single statistical model is able to capture the real mechanism behind naturally generated data . How-ever, a model that is rich enough to approximate the behaviour of data including essential uncertainties is generally accepted as a good model. Usually, it is more convenient to build different models based on one particular distribution (e.g. regression models with normal distribution). A set of such models can be easily compared using an appropriate criterion. However, in situations where the models have originated from different distri-butions, the comparisons are quite interesting. This becomes further complicated with the use of different models adopting various modeling concepts. For example, a situation that requires selecting one out of a set including hierarchical, mixture, and hierarchical mixture models will be very complicated in practice.
Assessing Bayesian models can involve evaluation of the fit of a model to data and comparisons of several candidate models for predictive accuracy and for improvements. The methods available for assessing the model fit are of three types [74, 75]:
- posterior predictive checks
- prior predictive checks
- mixed checks.
Prior predictive checks are used to evaluate replications with different parameter values whereas mixed checks are used for evaluating hierarchical models. In posterior predictive checks, data simulated under the fitted model are compared with the actual data . Therefore, it examines whether there are systematic differences between the actual and replicated data . Predictive model accuracy is estimated using information criteria such as Akaike information criterion (AIC), Bayesian information criterion (BIC), De-viance information criterion (DIC), and Watanabe-Akaike (or widely available) informa-tion criterion (WAIC), and cross-validation (CV). In addition, Bayesian p-values calcu-lated based on the discrepancy measures (test quantities) can also be used as tools for posterior predictive checks, especially in the contexts of model improvement. The goal of information criteria is to obtain an unbiased measure of out-of-sample prediction er-ror [74, 173]. Since posterior checks use the data twice; once for model estimation and once testing, a penalty constant or bias correction is applied to these criteria. Although, these criteria are unable to reflect the goodness-of-fit in an absolute sense, the differences (in the information theoretic criterion of choice between competing models) can measure the relative performance of the models of interest. However, the use of some of these measures is only valid under certain circumstances. The computation cost is also another problem. Calculation of predictive accuracy measures should not take a long time relative to the model fitting and obtaining initial posterior draws.
Any particular model may provide an adequate fit to the data. However, there may be some plausible alternative models that are also capable in producing a fairly similar fit. Therefore, in the contexts of posterior inferences, where the model at hand differs from the others, posterior predictive checks are very informative. Any discrepancy that can be observed as a result of this self-consistency assessment is considered as consequences of either model misfit or chance or both.
Tail area probabilities can be used as they are in classical statistics, to obtain p-values The posterior distribution of the unknown model parameters is used to answer the Bayesian inferential problems. Hence, a test quantity which represents the level of dis-crepancy between the fitted Bayesian model and data, is a function of both data and the unknown model parameters.
Let us assume that the observed data and all the parameters of the fitted model are de-noted y and q respectively where all the hyper parameters in a hierarchical model are also included in parameter vector q . The simulated data drawn from the posterior distribution and the future observable data are denoted by yrep and y˜ respectively. Then the poste-rior predictive distribution of y˜ or the distribution of yrep is defined with the posterior of unknown parameter q as T (y; q ) denotes a discrepancy measure that summarises parameters and data into scalars and T (y) is a test statistic that depends only on data. The tail-area probabil-ity is calculated based on the posterior simulations of (q ; yrep) and used to measure the goodness-of-fit of the data with respect to the posterior predictive distribution. The ex-tremeness of the simulated data in comparison with the observed data is calculated as a probability pB, the Bayesian p-value:
The posterior density of q , p(q jy), posterior predictive density of yrep, p(yrepjq ; y) = p(yrepjq ), and the indicator function IT (yrep;q ) T (y;q ) are used to calculate the posterior predictive p-value, pB as follows.
Let us consider the observed data y consisting of n observations and assume that there are S draws from the posterior distribution of q . Then yreps also consists of n replicated values for each parameter value qs (where s = 1; 2; : : : ; S). Then the posterior predictive probability pB, which is defined as in Equation: 3.1 can be approximated based on the above replicated samples. In the context of Bayesian p-values, the realised and predictive test quantities denoted by T (y; qs) and T (yreps; qs) respectively are compared over S repli-cated draws to perform the posterior predictive checks. The proportion of predictive test quantities T (yreps; qs) which are not less than corresponding realised value T (y; qs) is an estimate of the Bayesian p-value. Mathematically, it can be written in the following way.
Since various aspects of the model can be tested using the concept of posterior predictive p-value, the selection of the test quantity is very important. The inferential aspect that is expected to be assessed by the test quantity must be in line with the practical purpose of the model.
The interpretation of posterior predictive p-values is more interesting compared to classical p-values. In the classical approach, a p-value that is close to zero implies a greater disagreement between the data and statistical concept being tested while a value that is close to one evidences a greater agreement between them. An extreme posterior predictive p-value (close to 0 or 1) in Bayesian approach, in contrast, implies a greater dis-crepancy between the data and model. However these extreme p-values can be omitted in the situations where the misfits of the model are practically very small in comparison with the variation within the model. In general, extreme p-values can be used to identify the possible departures of the test quantities from the model rather than rejecting the model. These are very important in practice to identify unusual observations and provide appro-priate suggestions to improve the model and data. A p-value close to 0.5 in a posterior predictive check exhibits a better adequacy of the model to data, except in some mislead-ing situations. Since the sample variance is always a sufficient statistic, a test quantity that is a function of sample variance may not be capable in assessing the quality of a posterior predictive distribution. Such discrepancy measures generally produce p-values close to 0.5 and are misleading. A scatter plot of T (y; qs) vs T (yrep; qs) or a histogram of T (y; qs) T (yrep; qs) can also be used to display the discrepancy between the data and the model. The scatter plot should be symmetric around T (y; qs) = T (yrep; qs) line and the value zero must be in the middle of the histogram for a better fit.
Gelman et al. suggested the following discrepancy quantity, which corresponds to the chi-squared goodness-of-fit measure, as an omnibus goodness-of-fit test where the model parameter q is known.
This can be calculated for both observed data yT = (y1; y2; : : : ; yn) and unobserved future data y˜T = (y˜1; y˜2; : : : ; y˜n) as D(y; q ) and D(y˜; q ) respectively . In the Bayesian context where the posterior distribution of q represents its behaviour, a p-value can be defined in the following way to evaluate the extremeness of future observations.
As it does in the other posterior checks, pD can be estimated over the posterior predictive simulations as below.
Marginal Predictive Checks
Marginal predictive distributions are calculated for each observation yi of yT = (y1; y2; : : : ; yn) in the observed data and used for overall model calibration or to find possible outliers Let us assume that yrepi denotes the replicated values of the ith observation in the data. Then the tail area probability pi corresponding to each observation yi is calculated
A natural discrepancy measure T (yi) is defined as T (yi) = yi, when yi is continuous. In this case the tail-area probability reduces to the computation of pi = Pr(yrepi yijy):
Similar to the way that the Bayesian p-value was calculated in the previous section, pi can be estimated as
It is important to perform a combined check by pooling these marginal predictive p-values into a single figure. Therefore, this study derives the following p-value pbM to estimate the overall average of the marginal predictive p-values.
In addition, the overall variability in marginal predictive p-values can be represented by their standard deviation.
The cross-validation predictive p-value is an alternative approach that can be used for posterior model check. However, p-values calculated based on marginal and posterior predictive checks generally reveal different behaviours. Here, the marginal distribution of yi is calculated based on all the other observations except yi (i.e. y i). Consequently, the cross-validation p-value for yi is defined as pi = Pr(yrepi yijy i):
Replicated data can be used to estimate this as it is calculated in marginal predictive p-values. Since the cross-validation predictive p-values involve additional computations, its computational cost has to be particularly considered in practice. However, in the situ-ations where new observations under exactly similar conditions of the model predictors are possible, the gap between cross-validation and full Bayesian predictive check can be fulfilled. This is regarded as mixed predictive check in Bayesian data analysis.
Measuring the accuracy of predictions made by a model is a common way of evaluat-ing models. In model assessment, various measures can be discussed. For instance, the scoring function is a method for measuring the predictive accuracy of a point prediction A value replicated using the model fitted for an observed value, which represents the future observation under similar circumstances corresponding to the observed value is regarded as a point prediction. Mean squared error, mean absolute error, and mean abso-lute percentage error of predictions are examples of simple scoring functions that can be used to evaluate the predictive accuracy of a model that is close to a normal distribution.
Predictive accuracy of probabilistic predictions is evaluated using scoring rules such as quadratic, logarithmic, and zero-one scores . The logarithmic score is a widely used scoring rule in probabilistic predictions and in selecting models [74, 174]. Let us consider model with parameter q , that is expected to fit on data yT = (y1; y2; : : : ; yn). Assuming the independence of data, the likelihood function p(yjq )
The log density of the unobserved future data given the model parameters and observed data is generally referred to as log predictive density. It is a well-known summary mea-sure of predictive fit . For normal models with constant variance, the log predictive density is proportional to the mean squared error. In statistical model comparison the log predictive density involved in a decisive role as it connected to the Kullback-Leibler infor-mation measure. Especially for large samples, expected log predictive density, Kullback-Leibler information, and posterior probabilities are greatly inter-connected. The model that produces the lowest Kullback-Leibler information leads to produce highest expected log predictive density, which will have the highest posterior probability compared to the other models. Hence, the expected log predictive density is used to measure the overall model fit.
The relationship between log predictive density and Kullback-Leibler information measure has been discussed in literature related to information theory (e.g. [7, 29, 74, 151, 156, 170]). The idea of measuring the conceptual distance between two models (or densities) as a directed divergence was originally introduced in 1951 by Solomom Kullback and Richard A. Liebler [110, 111]. The Kulback-Leibler (K-L) information measures the quality of approximation or information loss I( f ; g) [9, 101]. In a situation where one approximates the true density f (x) by g(x), where x is a q 1 random vector, K-L information is defined as I( f ; g) is always non-negative and is zero when f (x) = g(x). In model selection, the true function f (x) is treated as fixed, however, unknown. The function g(x) with parameter vector q (i.e. g(xjq ) ) is used to approximate f (x) . Then I( f ; g) becomes ,
The logarithmic term of the above equation can be further expanded into a difference of two logarithmic terms, Both integrals of the above equation are in the form of statistical expectations with respect to trappropriateness of g(xjq ) in approximating f (x) cannot be evaluated. Fortunately, the selection of the best candidate model among two or more alternative models in the context of information loss is obvious. It is known that the inferential aspects that are used in the calculation of information criteria, is highly conditional on the data. Hence, model comparisons cannot be accomplished across different datasets and completely restricted for a fixed given dataset. However, two or more models fitted to a fixed dataset can be compared. Let us assume that g1(xjq1) and g2(xjq2) are two models that used to approximate f (x). As the Kullback-Leibler information I( f ; gi), where i = 1; 2 measures the information loss or the closeness between the true and fitted models, the one that corresponds to the lowest information loss is the best relative to the other. Hence, the model g1(xjq1) is better than g2(xjq2) in approximating f (x), if I( f ; g1) < I( f ; g2).
Finally, this expression reveals the statistical basis of the use of the expected log predictive density as the key quantity in model comparison. The model that produces the highest expected log predictive density, especially for large samples, ensures uppermost posterior probability compared to the alternative candidates.
List of Figures
List of Tables
1 Literature Review
1.2 The Statistical Evaluation of DNA Evidence
1.3 Models for DNA Interpretation
1.4 Statistical Modelling of Stutter
1.6 Organisation of Chapters
2 Beyond the Log-normal
2.2 Existing Models for Stutter Ratio (SR)
2.3 The Data used for Testing the Models
2.4 The Basis for New Models
2.5 Model Fitting
2.6 Variations and Relationships among the Parameters of Similar Models
3 Measures of Model Assessment
3.2 Predictive Accuracy
3.3 Information Criteria
3.4 Leave-one-out Cross-validation (LOO-CV)
3.5 Importance Sampling (IS) for Calculating Leave-one-out Cross-validation (LOO-CV)
4 Assessment of Models
4.2 Graphical Assessment of Distributional Assumptions
4.3 Comparison of Existing and Proposed Models
4.4 Model Comparison Beyond AIC and BIC
4.5 Bayesian p-values and L-measure for Model Comparison
5 Investigation and Assessment of Hierarchical models
5.2 Investigation of Hierarchical models for Stutter Ratio
5.3 Evaluation of Hierarchical Models
6 Bayesian Multiple Linear Regression with a Conjugate Prior Distribution
6.2 The Likelihood Function
6.3 Selection of Prior Distributions
6.4 The Posterior Distribution of Model Parameters
6.5 The Prior Predictive Distribution
6.6 The Posterior Predictive Distribution
7 Infinite Mixtures of Linear Regression Models
7.2 Beyond Multiple Linear Regression Models .
7.3 History of Finite Mixture Models
7.4 Finite Mixtures of Normal Densities
7.5 Finite Mixtures of Multiple Linear Regression Models
7.6 Infinite Mixture Models
7.7 Collapsed Gibbs Sampling with CRP for Better Models
7.8 Results and Discussion
8 Conclusions and Future Work
8.2 Non-hierarchical Models
8.3 Hierarchical Models
8.4 Infinite Mixture Models
8.5 Directions for Future Research
GET THE COMPLETE PROJECT
Bayesian Models for PCR Stutter