Tooth Growth in Guinea Pigs – Project topics materials

Get Complete Project Material File(s) Now! »

Theoretical Models

In order to investigate if the RF approach can bring forecast improvements to economics we rst need to provide a brief overlook of the theoretical framework. We rst give a brief account of the econometric method that is often used, which we use for evaluation. We then proceed by describing the RF algorithm in a stylized manner for easier comprehension.

Autoregressive Process

Often when evaluating econometric models we use simple linear models as benchmarks. According to Marcellino (2008) the simplest linear time series models are still, in a time when more sophisticated models are abundant, jus-ti ed and perform well when tested against alternative models, as long as they are well speci ed. The autoregressive process (AR) of order p is such a simple time series model which can be written as Yt =iYt i + ut; (1) =1 where 1; :::; p are constants and ut is a Gaussian white noise term. The key assumption in this model is that the value of Yt i can explain the behavior of Y in time t. For this model to be stationary the relationship j ij < 1 must be ful lled for all i, otherwise the series will explode as t increases.
This type of model is frequently used in time series analysis and will therefore serve as our benchmark model for evaluating the performance of the RF. For more on the properties of the AR(p) process, see Asteriou and Hall (2011).

Example: Tooth Growth in Guinea Pigs

The RF method is based on regression trees and to illustrate how regression trees work we begin with an example. We have a dataset provided with R from C. I. Bliss (1952). It consists of observations on the tooth length (in mm) of 10 guinea pigs at three di erent dose levels of vitamin C (0.5, 1 and 2 mg) as well as two di erent methods of distribution, orange juice (OJ) and ascorbic acid (VC). The reason we can make di erent observations on the same guinea pigs is that their teeth are worn down when eating and as a result grow continuously. We t the data to a regression tree with tooth length as the response variable and dosage and delivery method as predictor variables. In Figure 2 we nd the regression tree diagram.
At each node in Figure 2 we have the splitting criterion. Observations that satisfy the criterion go to the left and those that do not go to the right. We get the node number on top and in the elliptical gure we have the average predicted tooth length of the observations that fall into that node. The node number corresponds to an output table provided by R and a sample of this is found in Table 2. To illustrate, the interpretation for node number 5 is is the following: the predicted tooth length of a guinea pig on a vitamin C dose less than 0.75 mg in form of orange juice is 13.23 mm with a MSE of 17.900.

An Introduction to Random Forests

The RF was introduced by Breiman (2001a) and was an extension from his previous work on bagging (Breiman, 1996). It is an algorithm that can handle both high-dimension classi cation as well as regression. This has made it one of the most popular methods in data mining. The method is widely used in di erent elds, such as biostatistics and nance, although it has not been applied to any greater extent in the eld of economics. The algorithm itself is still to some extent unknown from a mathematical viewpoint and only stylized outlines are presented in textbooks and articles (Biau and D’Elia, 2011). Work is ongoing to map the algorithm but most papers focus on just parts at a time. The outlining of the RF method in this section draws heavily from Zhang and Ma (2012) and the chapter on RFs by Adele Cutler, D. Richard Cutler, and John R. Stevensand. The notation used by them is reproduced here.

A Forest Made of Trees

As the name itself suggests, the RF is a tree-based ensemble method where all trees depend on a collection of random variables. That is, the forest is grown from many regression trees put together, forming an ensemble. We can formally describe this as a p-dimensional random vector X = (X1; X2; :::; Xp)T repre-senting the real-valued predictor variables and Y representing the real-valued response variable. Their joint distribution PXY (X; Y ) is assumed unknown. This is one advantage of the RF, we do not need to assume any distribution for our variables. The aim of the method is to nd a prediction function f(X) to predict Y . This is done by calculating the conditional expectation f(x) = E(Y jX = x); (2) known as the regression function. Generally ensemble methods construct f as a collection of « base learners » h1(x); h2(x); :::; hJ (x) which are combined into the « ensemble predictor » f(x) = J hj(x): (3) =1
In the RF the jth base learner is a regression tree which we denote hj(X; j), where j is a collection of random variables and for j = 1; 2; :::; J, these are independent.
In RFs the trees are based on binary recursive partitioning trees. They partition the predictor space in a sequence of binary partitions or « splits » on individual variables which form the branches of the tree. The « root » node in the tree is made up of the entire predictor space. Nodes that are not split are called « terminal nodes » or « leaves » and these form the nal partition of the predictor space. Each nonterminal node is split into two descendant nodes, one to the left and one to the right. This is done according to the value of one of the predictor variables based on a splitting criterion, called a « split point ». Observations of the predictor variables smaller than the split point goes to the left and the rest to the right.
The split of a tree is chosen by considering every possible split on every predictor variable and then selecting the « best » according to some splitting criterion. If the response values at the nodes are y1; y2; :::; yn then a common splitting criterion is the mean squared residual at the node Q = n (yi (4) =1 where y is the average predicted value at the node. The splitting criterion provides a « goodness of t » measure with large values representing poor t and vice versa. A possible split creates two descendant nodes, one on the left and one on the right. If we denote the splitting criterion for the possible descendants by QL and QR along with their respective sample sizes nL and nR, then the split is chosen to minimize Qsplit = nLQL + nRQR: (5)
Finding the best possible split means sorting the values of the predictor vari-able and then considering every distinct pair of values. Once the best possible split is found the data is partitioned into the two descendants nodes which are in turn split in the same way as the original node. This procedure is recursive and stops when a stopping criterion is met. This can for example be that a speci ed number of unsplit nodes should remain. The unsplit nodes remaining when the stopping criterion is met are the terminal nodes. A predicted value for the respons variable is then obtained as the average value from the terminal nodes for all observations.

The Random Forest

Up until now we have the theoretical workings of one regression tree. Now we can begin to understand the RF. As mentioned before the RF will use base learners hj(X; j) as trees. We de ne a training set D = f(x1; y1); (x2; y2); :::; (xN ; yN )g in which xi = (xi;1; xi;2; :::; xi;p)T denotes the p predictor variables and yi is the response, and a speci c realization j of the randomness component j, the tted tree is denoted hj(x; j; D). This is the original formulation by Breiman (2001a) but we do not consider j directly but rather implicitly as we inject randomness into the forest in two ways. First of all every tree is tted to an independent bootstrap sample of the original dataset. This is the rst part of the randomness. The second part comes from splitting the nodes. Instead of each split being considered over all p predictor variables we use a random subset of m predictors. This means that di erent randomly assigned predictor variables are included in di erent trees.
When drawing the bootstrap sample Dj of size N from the training set some observations are left out and do not make it into the sample. This is called « out-of-bag data » and is used to estimate generalization errors (to avoid over tting) and the variable importance measure described in section 3.7.

Tuning

Generally the RF is not sensitive and requires little to no tuning unlike other ensemble methods. There are three parameters that can tuned to improve performance if necessary. These are: m, the number of randomly assigned predictor variables at each node J, the number of trees grown in the forest tree size, measured e.g. by the maximum number of terminal nodes
The only parameter that seems to be sensitive in using RFs is m, the number of predictors at each node. When using RFs for regression m is chosen as m = p=3 where p is the total number of predictor variables (Zhang and Ma, 2012, p. 167). The potential problem that requires tuning is over tting but as Diaz-Uriatre and Alvarez de Andres (2006) found, the e ects of over tting are small. Many ensemble methods tend to over t when J becomes large. As Breiman (2001a) shows this is almost a non-issue with RFs; the number of trees can be large without consequence. Breiman (2001a) showes that the generalization error almost converges when J is grown beyond a certain number and so the only problem is that J not should be too small.

READ Computable randomness and diferentiability on the unit interval

Variable Importance

When dealing with regression with many dimensions we can use principal com-ponents analysis to reduce the number of variables to include. The RF has its own method for determining which predictors are the most important to include. For example, to measure the importance of variable k, rst the observation on the variable is passed down the tree and the predictions are computed. Then the values of k are randomly permuted in the out-of-bag data while keeping all other predictors xed. Next the modi ed out-of-bag data are passed down the tree and a new set of predictions are computed. Using these two sets of data, the real set and the one based on the permutations, the di erence in the MSE of the predictions from the two sets is obtained. The higher this number is, the more important the variable is deemed to be for the response.

Software

When estimating the RF and calculating the variable importance measure we use the open source statistical software R. The code produced in order to obtain our results is included in the appendix.

Empirical Findings

Before we can estimate our benchmark model and the RF we must establish that the benchmark will perform well. In order to have the best possible model for comparison, the time series it is based on must be stationary and its process speci ed correctly. In the interest of a fair comparison we begin this section by investigating the GDP time series which we aim to predict.

Stationarity

The de nition of a stationary process is a stochastic process that has a joint probability distribution that does not change over time. That means that if a process is stationary its mean and variance is also constant over time. This can be described as the process being in statistical equilibrium. The assumption of stationarity is important in order to make statistical inference based on the observed record of the process (Cryer and Chan, 2008).
To nd out whether the quarter on quarter GDP series is stationary we formally test it using the Augmented Dickey-Fuller (ADF) test. The ADF-test tests for a unit root in the time series sample and the alternative hypothesis states that the series is stationary. When choosing the lag length for the test we use the Bayesian information criteria (BIC). The result of the ADF-test proves to be signi cant and is presented in Table 3. This means that the null hypothesis of a unit root is rejected at the one percent signi cance level and we thus conclude that the series is stationary. When choosing lag length using other criterions such as the Akaike information criteria (AIC) we arrive at the same result. We therefore conclude that it is more or less arbitrary which information criterion we use and choose to present only the result where BIC is used.

Autocorrelation

In order to best estimate and forecast the GDP growth series we investigate the correlograms for the autocorrelation function (ACF) and the partial autocorre-lation function (PACF). These are featured in Figure 3.
In the correlogram of the ACF we notice the rapidly decreasing pattern in the spikes which indicates an autoregressive process with short memory. In the correlogram of the PACF we have one signi cant spike at the rst lag. When we weight these two results together we arrive at the conclusion that the observed process is an AR of order one.

AR(1)

After having examined the properties of the GDP growth series, we are now ready to estimate the benchmark model. We begin by partitioning our dataset into two parts. One training set which we use for building the model and one test set over which we evaluate by forecasting. The training set consists of the observations from the second quarter of 1996 to the third quarter of 2010. The test set thus contains observations from the fourth quarter of 2010 to the second quarter of 2014. We have chosen to divide the dataset this way to have approximately 80 percent of the observations in the training set and 20 percent in the test set. We t an AR(1) with a drift component to the GDP growth series to the training set.
The output produced shows the -coe cient to be 0:381 which is re ected in the ACF correlogram. When the model is tted we can use what it has learned to forecast over our test set. We use the built in prediction function available in R and arrive at an RMSE of 0:949 for the benchmark model. This value is interpreted, compared and further elaborated on in Table 7 in section 4.6.

Random Forest

The RF approach is very well suited to address this kind of estimation. We have a high dimensional regression problem (n << p) since we have 58 observations in the training set and p = 466 predictor variables to choose from.
We begin by partitioning the dataset in the same way as for the AR estima-tion. We estimate the model with J = 500 trees and the algorithm randomly selects m = 155 predictor variables to compare at each node split instead of the p = 466 available. We are given an R2 of 27:96 percent. This means that approximately 28 percent of the variation in quarterly GDP growth rate is explained by the variables included in the model. Table 4 lists the ten most important predictors as decided by the algorithm.
The predictors listed in Table 4 are calculated as described in section 3.7. The mean decrease is the mean decrease in prediction accuracy as measured by the MSE if the predictor was to be removed from the model. The most important predictor for the quarterly GDP growth rate according to the RF algorithm is a di erence series of consumer question three, alternative two. This question regards consumers and reads « [h]ow do you think the general economic situation in the country has changed over the past 12 months? » and the second alternative is « got a little better ». All questions and alternatives are fully outlined in the appendix. When examining the whole output displaying the relevance of the predictor variables we note that the di erence between one variable and the next in the ranking, decreases rapidly after the tenth variable. It could therefore be hard to motivate where to draw the line for how many variables to include after this point. This is why we choose to include the ten most relevant variables.
Again we use the model built to forecast the GDP growth rate over the test set. The RMSE is 0:753 for the RF model. This value is interpreted, compared and further elaborated on in Table 7 in section 4.6.

Ad Hoc Linear Model

We have used the RF approach to nd the key variables explaining GDP growth rate in our dataset. We now wish to estimate a linear model containing these variables and use it to make a prediction of GDP growth rate. To this end we use the same partitioning as previously where the training set consists of approximately 80 percent of the observations and the test set consists of 20 per-cent of the observations. Estimation of the model is done in two steps. Firstly, we estimate the model using the ten variables ranked as the most important in explaining GDP by the RF. Secondly, we choose those variables that showed to be signi cant and estimate the reduced model. The result of this procedure can be seen in Table 5 where we present the coe cients and p-values for each variable in the two models.

Table of contents :

1 Introduction
2 Data
3 Theoretical Models
3.1 Autoregressive Process
3.2 Example: Tooth Growth in Guinea Pigs
3.3 An Introduction to Random Forests
3.4 A Forest Made of Trees
3.5 The Random Forest
3.6 Tuning
3.7 Variable Importance
3.8 Software
4 Empirical Findings
4.1 Stationarity
4.2 Autocorrelation
4.3 AR(1)
4.4 Random Forest
4.5 Ad Hoc Linear Model
4.6 Evaluation
5 Conclusion
6 References
7 Appendix
7.1 R Code
7.2 Survey Questionnaire
7.2.1 Industry quarterly questions
7.2.2 Retail quarterly questions
7.2.3 Consumers monthly questions
7.2.4 Condence indicators