Multiple Linear Regression

Get Complete Project Material File(s) Now! »

Multiple Linear Regression (MLR) is a supervised technique used to estimate the relationship between one dependent variable and more than one independent variables. Identifying the correlation and its cause-effect helps to make predictions by using these relations [4]. To estimate these relationships, the prediction accuracy of the model is essential; the complexity of the model is of more interest. However, Multiple Linear Regression is prone to many problems such as multicollinearity, noises, and overfitting, which effect on the prediction accuracy.
Regularised regression plays a significant part in Multiple Linear Regression because it helps to reduce variance at the cost of introducing some bias, avoid the overfitting problem and solve ordinary least squares (OLS) problems. There are two types of regularisation techniques L1 norm (least absolute deviations) and L2 norm (least squares). L1 and L2 have different cost functions regarding model complexity [5].

Lasso Regression

Least Absolute Shrinkage and Selection Operator (Lasso) is an L1-norm regularised regression technique that was formulated by Robert Tibshirani in 1996 [6]. Lasso is a powerful technique that performs regularisation and feature selection. Lasso introduces a bias term, but instead of squaring the slope like Ridge regression, the absolute value of the slope is added as a penalty term. Lasso is defined as: 𝐿=𝑀𝑖𝑛(𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑠 + 𝛼∗|𝑠𝑙𝑜𝑝𝑒|)(1).
Where 𝑀𝑖𝑛(𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑠) is the Least Squared Error, and 𝛼∗|𝑠𝑙𝑜𝑝𝑒| is the penalty term. However, alpha 𝑎 is the tuning parameter which controls the strength of the penalty term. In other words, the tuning parameter is the value of shrinkage. |𝑠𝑙𝑜𝑝𝑒| is the sum of the absolute value of the coefficients [7].
Cross-validation is a technique that is used to compare different machine learning algorithms in order to observe how these methods will perform in practice. Cross-validation method divides the data into blocks. Each block at a time will be used for testing by the algorithm, and the other blocks will be used for training the model. In the end, the results will be summarised, and the block that performs best will be chosen as a testing block [8]. However, 𝛼 is determined by using cross-validation. When 𝛼=0, Lasso becomes Least Squared Error, and when 𝛼 ≠0, the magnitudes are considered, and that leads to zero coefficients. However, there is a reverse relationship between alpha 𝑎 and the upper bound of the sum of the coefficients 𝑡. When 𝑡→∞, the tuning parameter 𝑎=0. Vice versa when 𝑡=0 the coefficients shrink to zero and 𝑎 →∞ [7]. Therefore, Lasso helps to assign zero weights to most redundant or irrelevant features in order to enhance the prediction accuracy and interpretability of the regression model.
Throughout the process of features selection, the variables that still have non-zero coefficients after the shrinking process are selected to be part of the regression model [7]. Therefore, Lasso is powerful when it comes to feature selection and reducing the overfitting.

Ridge Regression

The Ridge Regression is an L2-norm regularised regression technique that was introduced by Hoerl in 1962 [9]. It is an estimation procedure to manage collinearity without removing variables from the regression model. In multiple linear regression, the multicollinearity is a common problem that leads least square estimation to be unbiased, and its variances are far from the correct value. Therefore, by adding a degree of bias to the regression model, Ridge Regression reduces the standard errors, and it shrinks the least square coefficients towards the origin of the parameter space [10]. Ridge formula is: 𝑅=𝑀𝑖𝑛(𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑠 + 𝛼∗𝑠𝑙𝑜𝑝𝑒2 ) (2).
Where 𝑀𝑖𝑛(𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑠) is the Least Squared Error, and 𝛼∗𝑠𝑙𝑜𝑝𝑒2 is the penalty term that Ridge adds to the Least Squared Error.
When Least Squared Error determines the values of parameters, it minimises the sum of squared residuals. However, when Ridge determines the values of parameters, it reduces the sum of squared residuals. It adds a penalty term, where 𝛼 determines the severity of the penalty and the length of the slope. In addition, increasing the 𝛼 makes the slope asymptotically close to zero. Like Lasso, 𝛼 is determined by applying the Cross-validation method. Therefore, Ridge helps to reduce variance by shrinking parameters and make the prediction less sensitive.

Random Forest Regression

A Random Forest is an ensemble technique qualified for performing classification and regression tasks with the help of multiple decision trees and a method called Bootstrap Aggregation known as Bagging [11].
Decision Trees are used in classification and regression tasks, where the model (tree) is formed of nodes and branches. The tree starts with a root node, while the internal nodes correspond to an input attribute. The nodes that do not have children are called leaves, where each leaf performs the prediction of the output variable [12].
A Decision Tree can be defined as a model [13]: 𝜑=Χ ⟼Υ (3).
Where any node 𝑡 represents a subspace 𝑋𝑡⊆𝑋 of the input space and internal nodes 𝑡 are labelled with a split 𝑠𝑡 taken from a set of questions 𝑄. However, to determine the best separation in Decision Trees, the Impurity equation of dividing the nodes should be taken into consideration, which is defined as: Δi(s,t)= i(t)− pLi(t𝐿)− pRi(t𝑅)(4).
Where 𝑠∈𝑄, 𝑡𝐿 and 𝑡𝑅 are left and right nodes, respectively. 𝑝𝐿 and 𝑝𝑅 are the proportion 𝑁𝑡𝐿𝑁𝑡 and 𝑁𝑡𝑅𝑁𝑡 respectively of learning samples from ℒ𝑡 going to 𝑡𝐿 and 𝑡𝑅 respectively. 𝑁𝑡 is the size of the subset ℒ𝑡.
Random Forest is a model that constructs an ensemble predictor by averaging over a collection of decision trees. Therefore, it is called a forest, and there are two reasons for calling it random. The first reason is growing trees with a random independent bootstrap sample of the data. The second reason is splitting the nodes with arbitrary subsets of features [14]. However, using the bootstrapped sample and considering only a subset of the variables at each step results in a wide variety of trees. The variety is what makes Random Forest more effective than individual Decision Tree.

READ a new approach for a tree of shapes on multivariate images

Artificial Neural Network

Artificial neural network (ANN) is an attempt to simulate the work of a biological brain. The brain learns and evolves through the experiments that it faces through time to make decisions and predict the result of particular actions. Thus, ANN tries to simulate the brain to learn the pattern in a given data to predict the output of that data whether the expected data was provided in the learning process or not [17].
ANN is based on an assemblage of connected elements or nodes called neurons. Neurons act as channels that take an input, process it, and then pass it to other neurons for further processing. This transaction or the process of transferring data between neurons is handled in layers. Layers consist of at least three layers, input layer, one or more of hidden layers and output layer. Each layer holds a set of neurons that takes input and process data and finally pass the output to other neurons in the next layer. This process is repetitive until the output layer has been reached, so eventually, the result can be presented. ANN architecture is shown in the following figure as is also known as feed-forward, which values pass in one direction.
The data that is being held in each neuron is called activation. Activation value ranges from 0 to 1. As shown in figure 3, each neuron is linked to all neurons in the previous layer. Together, all activations from the first layer will decide if the activation will be triggered or not, which is done by taking all activations from the first layer and compute their weighted sum [18]. 𝑤1𝑎1+𝑤2𝑎2+𝑤3𝑎3+⋯+𝑤𝑛𝑎𝑛 (5).
However, the output could be any number when it should be only between 0 and 1. Thus, specifying the range of the output value to be within the accepted range. It can be done by using the Sigmoid function that will put the output to be ranging from 0 to 1. Then the bias is added for inactivity to the equation so it can limit the activation to when it is meaningfully active. 𝜎(𝑤1𝑎1+𝑤2𝑎2+𝑤3𝑎3+⋯+𝑤𝑛𝑎𝑛 −𝑏)(6).

Table of contents :

Table of Contents
1. Introduction
1.1. Aim and Purpose
1.2. Research Questions
1.3. Limitations
1.4. Thesis Structure
2. Background
2.1. Multiple Linear Regression
2.2. Lasso Regression
2.3. Ridge Regression
2.4. Random Forest Regression
2.5. Artificial Neural Network
3. Method
3.1. Literature Study
3.2. Experiment
3.2.1. Evaluation Metrics
3.2.2. Computer Specifications
3.2.3. Algorithms’ Properties/Design
4. Literature Study
4.1. Related Work
4.2. Feature Engineering
4.2.1. Imputation
4.2.2. Outliers
4.2.3. Binning
4.2.4. Log Transformation
4.2.5. One-hot Encoding
4.2.6. Feature Selection
4.3. Evaluation Metrics
4.4. Research Question 1 Results
4.5. Factors
4.5.1. Crime Rate
4.5.2. Interest Rate
4.5.3. Unemployment Rate
4.5.4. Inflation Rate
4.6. Correlation
4.7. Research Question 2 Results
5. Experiment
5.1. Data Used
5.2. Public Data
5.3. Local Data
5.4. Correlation
5.5. Experiment Results
5.5.1. Prediction Accuracy
5.5.2. Correlation
5.5.3. Factors
6. Discussion
7. Conclusion
7.1. Ethics
8. Bibliography
9. Appendixes