# Akaike’s Information Criterion (AIC)

Get Complete Project Material File(s) Now! »

## ARIMA

The abbreviation ARIMA model stands for integrated autoregressive moving average model and is one of the most commonly used forecasting methods for time series. The model is fitted to capture the autocorrelations with earlier observations in the data (Hyndman and Athanasopoulos, 2018). The autoregressive part of an ARIMA model is the process where data points being linear regressions on past values, together with an error term that captures what cannot be explained by the past values (Cryer & Chan 2008). We assume that the error term is independent of the past values of 𝑌 throughout the entire time series as well as σ𝑒2>0. The process is expressed generally as an AR(𝑝) i.e. an autoregressive model of order 𝑝, 𝑌𝑡=ϕ𝑌𝑡−1+ϕ𝑌𝑡−2+ … +ϕ𝑌𝑡−𝑝+ 𝑒𝑡.
The moving average part of an ARIMA model is in comparison to the autoregressive part not using past values of the variable but instead uses past error terms to forecast future values (Cryer & Chan 2008). The current value can be expressed by applying weights to the past error terms. A moving average process can generally be expressed as MA(𝑞) a moving average of order 𝑞, 𝑌𝑡=θ𝑒𝑡−1+θ𝑒𝑡−2+ … +θ𝑒𝑡−𝑞.
The ARIMA model is a general form for both stationary and nonstationary time series where the d:th difference is a stationary ARMA process. Stationarity in time series can generally be described as the condition where the properties of a time series are constant over time (Cryer & Chan 2008). For further explanation and assumptions for stationarity see the mentioned reference. The model consists of both weighted lags of past values and weighted lags of error terms with the estimated properties illustrated as ARIMA(𝑝,𝑑,𝑞). The entire model can be expressed concisely as ϕ(𝐵)(1−𝐵)𝑑𝑌𝑡=θ(𝐵)𝑒𝑡.
where 𝐵 is the backshift operator defined as 𝐵𝑌𝑡=𝑌𝑡−1. The term ϕ(𝐵) is the AR characteristic polynomial, θ(𝐵) is the MA characteristic polynomial and (1−𝐵)𝑑 is the 𝑑:th difference. The term 𝑌𝑡 is the independent variable of interest and 𝑒𝑡 is independent error terms with mean 0 and σ𝑒2>0 (Cryer & Chan, 2008). The order of the parameters 𝑝 and 𝑞 of the ARIMA model is decided with the function auto.arima by minimizing the AICc (see Section 2.4 for AICc) after differencing the data. The values of the coefficients 𝜙𝑖 and θ𝑖 are estimated by maximum likelihood estimations (Hyndman and Athanasopoulos, 2018).

### Naïve Method

Forecasting using the naïve method is a very simple method and means that the forecast is estimated to be equal to the last observed value (Hyndman and Athanasopoulos, 2018) i.e. 𝑌𝑡=𝑌𝑡−ℎ.
The simplest naïve will be used in this study with forecasts calculated as described above. There are several versions of naïve methods with small differences e.g. seasonal naïve which takes the last value from the same season and naïve with drift which takes into consideration the average change in the past (Hyndman and Athanasopoulos, 2018). These versions will not be included since the forecasting horizon will be short and the mean temperature does not change noticeably for only a couple of days.

Neural Network

Artificial neural networks are a type of machine learning models that got the name from the comparison to the connections in the nervous system of living beings. They can be described as networks being composed of processing units, called nods, carrying information and having many internal connections. The increasing interest in the field of neural networks is due to their ability of learning underlying structures in data and the ability to capture non-linear relationships (da Silva et al., 2017). Neural networks are applicable in many different fields and for different purposes with some of them being prediction, classification, and forecasting (da Silva et al., 2017).
The structure of a neural network can generally be separated into three parts. The input layer in the model has the responsibility of receiving data and the hidden layers are constructs of nods that are trained to carry information and patterns from the data. The last part is the output layer which creates and presents the final output (da Silva et al., 2017), which in the case of time series is the forecast.
The simplest neural network has only an input and an output layer and has, in that case, the same properties as linear regression. When adding hidden layers, the neural network can capture non-linear structures in the data as well (Hyndman & Athanasopoulos, 2018). In Figure 2.1 above, we can see an example of a neural network called a feed-forward network that moves information in only one direction i.e. is not cyclical. The input layer in the feed-forward network receives inputs that are then weighted in a linear combination to the nods in the hidden layer where the inputs are adjusted into a non-linear function before resulting in an output (Hyndman & Athanasopoulos, 2018). After a neural network is trained, the weights represented by 𝑏𝑗 and 𝑤𝑗,𝑖 are estimated by minimizing a cost function that in this study will be the mean squared error (MSE). The input in the 𝑗:𝑡ℎ node in the hidden layer is calculated as 𝑧𝑗= 𝑏𝑗+Σ𝑤𝑗,𝑖𝑥𝑖 𝑖=1.

Akaike’s Information Criterion (AIC)

Akaike’s Information Criterion abbreviated AIC is one of the most commonly used information criteria and is designed to compare models with the usefulness of choosing the one that minimizes the AIC value (Cryer & Chan 2008). The purpose of the AIC is to estimate the relative loss of information for different models and is defined as 𝐴𝐼𝐶=−2log(𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑙𝑖𝑘𝑙𝑖ℎ𝑜𝑜𝑑)+2𝑘.
The term 𝑘 takes into consideration the number of parameters in the model e.g. in the case of an ARIMA the term corresponds to 𝑘=𝑝+𝑞+1 if an intercept is included in the model and 𝑘=𝑝+𝑞 if not. The inclusion of the term 2𝑘 serves as a penalty term for overfitting the model by adding too many parameters in the model. The AIC is, however, considered a biased estimator in small samples which has given rise to a successor called the corrected AIC abbreviated AICc to reduce this bias by adding one more penalizing term (Cryer & Chan 2008). The AICc is defined as 𝐴𝐼𝐶𝑐=𝐴𝐼𝐶+2(𝑘+1)(𝑘+2)𝑛−𝑘−2 . Where the AIC as defined above has been accompanied with another term considering the number of parameters, where 𝑘 being defined by the parameters in the model as earlier and 𝑛 is the sample size. The preference of the AICc has been suggested to be preferred within forecasting to other approaches of selecting models, especially when working with many parameters and smaller sample sizes (Cryer & Chan 2008).

#### Residual Diagnostics

When a model is selected, the order of the model decided and the different parameters estimated, some diagnostics are done to ensure the goodness of fit to the time series. One approach of doing this is to analyze the residuals of the fitted model on the training set. A model can be said to have a good fit and is estimated to be close to representing the real process if the residuals show similarities in properties with white noise (Cryer & Chan 2008). Examining whether the residuals are close to white noise or not is done to ensure that no important patterns in the data are left out of consideration in the fitted model. The autocorrelation of the residuals is therefore investigated to ensure the independence of the residuals (Cryer & Chan 2008). This is done both visually for individual lags from an autocorrelation function (ACF) plot together with the more overall extent of the autocorrelation in the lags by the Ljung-Box test. The Ljung-Box test is based on the below statistic 𝑄∗=𝑛(𝑛+2)Σ(𝑛−𝑘)−1𝑟𝑘2ℎ𝑘=1.
where 𝑟𝑘 is the autocorrelation for lag 𝑘, 𝑛 is the number of observations in the training set and ℎ is the largest considered lag from the ACF (Athanasopoulos, 2018). Using ℎ=10 is suggested as a rule of thumb since too many lags can be bad for the test (Hyndman and Athanasopoulos, 2018). The test investigates the independence of the residuals with a null hypothesis that the residuals are indistinguishable to white noise and an alternative hypothesis that they are distinguishable to white noise. A large 𝑄∗ gives a small p-value and infer rejection of the null hypothesis.

Cross-validation

When forecasting with a horizon of one or just a few steps in the future, time-series cross-validation can be used to include many point-forecasts for evaluation (Hyndman and Athanasopoulos, 2018). Specifically, in this study, a so-called walk forward validation will be used with an expanding window. A walk forward validation is a way of including many forecasts with a short horizon by iteratively making point forecasts one step at the time, having multiple overlapping training sets. The expanding window implies that the training set is getting larger for every new forecast, keeping all the observations from the original training set. The procedure of the walk forward validation is iterative and can be divided into the four following steps (Brownlee, 2016).
(1) The different models are first estimated on the training set. (2) The models are used to do a point forecast with forecast horizon ℎ at the point 𝑡, where 𝑡 is the last point in the training set. (3) When the value of 𝑌𝑡+ℎ is predicted, the estimated value and the known real value from the test set is compared. (4) For the next forecast, the training set is expanded by including the observation at 𝑡+1 and the entire procedure in steps 1 to 4 are repeated for the entire test set. Cross-validation can be used for both one-step and multi-step forecasts. In this study, the original training set consists of 730 observations corresponding to all days in 2017 and 2018, and the test set consists of the days of the first three months of 2019 corresponding to 90 observations, as will be described in Section 3. When applying the walk forward validation to the dataset in this study, the original models are estimated on the training set and predictions are made on the following 90−(ℎ−1) observations in the test set. Forecasts horizons included in the study are ℎ=1,2,3,5.

Forecasting Accuracy Measures

Two different measures for the forecasting accuracy will be used. One scale-dependent accuracy measure and one which can be used to compare forecasting accuracy between time series on different scales.

Mean Absolute Error (MAE)

The scale-dependent accuracy measure used is the mean absolute error (MAE). The MAE is an easily interpreted measurement that can be used to compare different forecasting approaches when using them on the same time series, or for time series measured on the same unit (Hyndman and Athanasopoulos, 2018). The MAE is calculated as 𝑀𝐴𝐸= 𝑚𝑒𝑎𝑛(|𝑒𝑡|)=Σ|𝑦𝑡−𝑦𝑡|̂𝑇 .
Even though MAE is restricted to the same time series for comparison, it is meaningful to use because of the easy and direct interpretation of the measurement.

1. Introduction
2. Method
2.1 ARIMA
2.2 Naïve Method
2.3 Neural Network
2.4 Akaike’s Information Criterion (AIC)
2.5 Residual Diagnostics
2.6 Cross-validation
2.7 Forecasting Accuracy Measures
2.8 Diebold-Mariano Test
3. Data
4. Results
4.1 One-step Forecasting
4.2 Multi-step Forecasting
5. Discussion
Bibliography

GET THE COMPLETE PROJECT