Data mining and machine learning
Data mining is the process of discovering patterns in data, by using methods and techniques from machine learning and statistics . This process is usually automated and performed on databases. Much of data mining involves the application of machine learning tools to extract information from the data and find underlying structures. The goal is to find gen-eral patterns that explain something about the data, and use these to guide future decision making.
In the context of data mining, machine learning can be explained as creating structured descriptions of data . The form of these descriptions depend on the algorithm used to create them, and can consist of, for example, mathematical functions, rule sets or decision trees. The descriptions represent what has been learned from the data, and are used to make assumptions about new and previously unseen data. By looking at what attributes of the data had the biggest impact in the creation of the descriptions, it is also possible to verify suspected relationships, or see other patterns that may have been unknown earlier.
The most common way of acquiring these descriptions is by looking at examples . The machine learning model is provided with input data, and the objective is to find the best transformation of input data into corresponding output data. In unsupervised machine learning, the model does not know the output in advance, while supervised learning models are given examples in pairs of inputs and correct outputs. Compared to unsupervised learning, a supervised model needs to receive feedback about how close its assumptions are to the real answers, and it learns by trying to minimize the error produced by the feedback. A classification model attempts to predict output data ranging over a discrete interval, while a regression model makes predictions for continuous data.
A decision tree is a data structure commonly used to represent a machine learning model’s knowledge. At each internal node of the tree, the data set is split in two parts based on the values of one input variable . Growing the tree involves deciding which variable to use for the split, and what value of the variable to split the data on. This is done through exhaustive search of the different variables, meaning that all possible splits are considered, and the split that produces the lowest error is chosen. The amount of splits to be considered can be reduced by using heuristic search and function optimization techniques.
The leaves of the tree contain the model’s prediction. In a classification model the values in the leaves are different categories, and in a regression model a leaf’s value is the mean of every data point in the training set that leads to this leaf. A few of the steps involved in creating a decision tree from a data set with two-dimensional input can be seen in Figure 2.1.
Figure 2.1: Selected steps in the creation of a decision tree, and its corresponding partitioning of the training set
Decision trees with many variables can grow very complex and often have problems with over-fitting: when the model is too specialized to the training set that it looses accuracy when applied to other, more general problems . One way to avoid this is to keep the tree as simple as possible. Typical criteria of when to stop growing the tree are when it has reached a maximum depth, when there are a certain number of leaves, or when the error reduction is less than a threshold value. It is also possible to prune the tree afterwards, a process that removes a split node from the tree. A node is removed if its elimination either reduces the error, or does not increase the error too much while reducing the size of the tree sufficiently.
The process of applying several machine learning models to the same problem and combining their results is called ensembling . This approach is based on an idea from probability theory in mathematics, that, if some independent predictors are correct with a probability higher than 50%, the combined prediction has a higher probability of being correct than any single predictor. If the ensemble consists of decision trees, it is called a forest.
There are two different ways of creating forests, bagging and boosting. Bagging involves sampling subsets of the training data and creating new independent trees for each subset, so that they hopefully reflect different aspects of the data. These trees are usually grown from small subsets to reduce their similarity, but allowed to grow deeper to be able to discover more complex patterns. The final prediction is decided using all of the models, for example by taking the most common, mean or median value.
In boosting, the trees are not independent but are instead created sequentially based on the error of the previous trees . By removing the already captured patterns, the following tree can focus more on finding new aspects of the data that are harder to discover. One issue with this approach is that, because the trees do not train on the whole training set, they are more likely to find patterns that do not reflect the actual data. Therefore, the trees are usually not allowed to grow too deep, and are sometimes even kept to only one split.
The boosting algorithm starts by creating an initial model that roughly approximates the training data, usually by simply using the minimum or mean value . The next step is to calculate how much the approximation differs from what is expected. The result is a set of vectors, one for every point in the training data, called the residual vectors, that describe in which direction and how far off each estimate is from its target. A second model is then trained to approximate these residuals. The new model’s estimate of the residuals is added to the first model. The differences from the expected results are once again calculated, and another set of residual vectors are produced. This continues until a certain number of models have been trained, or the algorithm stops making sufficient progress.
The combined prediction yˆ of a boosted model can be represented as the sum of all indi-vidual model’s predictions fi(x).
The current boosted model at step i can therefore be described as the previous boosted model combined with the current individual model. Fi(x) = Fi 1(x) + fi(x) (2.2)
We can show the usefulness of boosting machine learning models with a simplified exam-ple using the function y = 5 + x + sin(x), seen in figure 2.2a. The first model looks at where the curve intercepts with the y-axis at (0, 5), and so approximates the function as f1(x) = 5, shown together with the original function in figure 2.2b. The errors produced after the first model’s predictions are shown in figure 2.2d. The second model receives this curve as its in-put and sees that it can be represented quite well with a simple linear function, matching com-pletely at every 3.14 steps. It therefore approximates the curve as f2(x) = x, which is added to the previous function, as can be seen in the figure 2.2c. The third model, again, is trained on the aggregated error produced by the predictions from the previous models, shown in figure 2.2e. The model recognizes the patterns of a sine wave and adds f3(x) = sin(x) to the approximation, which eliminates the error completely, as the boosted models have found the correct function. In real use cases the input is not a single variable, the function is often much more complex, requires many more models to solve, and can not be exactly approximated, but the process remains the same.
In 2016, Tianqi and Carlos  published their study of a tree boosting system called eX-treme Gradient Boosting, or XGBoost (XGB). XGBoost is an effective implementation of the gradient boosted decision trees algorithm. The framework is fast and has both CPU and GPU implementations. It also supports multi-thread parallelism, which makes it even faster.
XGBoost has two types of split decision algorithms, a pre-sort-based algorithm called Exact Greedy, which is used by default, and a histogram-based algorithm. Exact Greedy first sorts each feature before enumerating over all the possible splits and calculating the gradients. Each tree that is created is given a score that will represent how good it is. The trees are built sequentially so that the result of the previous tree can be used to help build the next tree.
A common problem that can occur when using machine learning techniques on large data sets is that all the values may not fit in the CPU cache, which can greatly increase computation times. XGBoost has solved this problem with a cache-aware algorithm and therefore achieves better performance than similar machine learning frameworks when applied to large data sets. This makes XGBoost a great tool when computation resources are limited.
Another advantage of XGBoost is that it has implemented a way to handle missing values, which is another common problem in machine learning. XGBoost solves this by using a default direction in each split, which will be selected if values are missing for this feature.
Predicting future sales of food with machine learning is not a new subject, however much of the research in this area has focused on forecasting for retail grocery stores.
A survey by Tsoumakas  reviewed 13 research papers on machine learning in food sales predictions. Only in one of these articles had a restaurant been researched. The author found that most forecasts made daily predictions, while some used longer time spans up to weeks and even per quarter. The most commonly predicted output variable was amount sold in piece or weight, followed by monetary amount. According to the survey, the most common input variables to the machine learning algorithms are historical sales figures for different intervals. These can be, for example, the amount sold on this day last week or year, and average sales for the last week or month. Other usual inputs are characteristics of the date and time, such as the day of the week, month of the year, and if the date is a holiday. Some inputs that occur less often in these papers are external factors that would require collecting data from outside the company, for example financial, social and weather factors, and different types of events occurring in the vicinity of the business. According to the author, there exists an unexplored opportunity to make use of product information as input variables to predict demand for multiple products with the same model.
Doganis et. al.  combined a neural network with a genetic variable selection algorithm and tried to predict the daily sales of fresh milk for a dairy producer in Greece. They created additional input variables from the provided sales data, similar to what was used in current forecast methods. The examined variables were sales figures from the previous week of the current year, the previous week of last year, the corresponding day of last year, and the per-centile change in sales from the previous year. The authors compared the forecast accuracy of the neural network to other linear regression algorithms, and found that the neural network, which was the only model provided with the additional variables, had an average error of less than 5% compared to 7-10% for the other models. The most useful input variables identi-fied by the variable selection algorithm were the sales on the previous day and the same day last week.
Žliobaite˙ et. al.  predict the weekly sales of products for a food wholesaler in the Netherlands. They start by categorizing products into predictable and random sales patterns. This is performed by creating variables from the products’ sales history, including variations of the mean and median values, quartiles, and standard deviation, and feeding these to an ensemble of classification models that decide the product category by majority vote. Fore-casts are then made for the products with predictable sales patterns, using an ensemble of predictive algorithms. The input variables provided to the predictors in this study were daily product sales, average weekly product sales, daily total sales, product promotions, holidays, season, temperature, air pressure and rain. The most used variables were the product related variables, the season, the temperature and one of the included holidays. The results show that the presented solution outperformed the baseline moving average forecast method, and by reducing the threshold of which products were categorized as predictable, the accuracy could be improved further. They also discuss the possibility that the classification model can be excluded and its input variables incorporated directly into the prediction model.
I¸˙slek and Ögüdücü˘  developed a forecasting method for a Turkish distributor of dried nuts and fruits. The company has almost 100 main distribution warehouses, which have their own sub-distribution warehouses. The input data included warehouse related attributes, for example location, size, number of sub-warehouses and transportation vehicles, selling area in square meter, number of employees and amount of products sold weekly, as well as product information such as price and product categories. Their solution first used a bipartite graph clustering algorithm to group warehouses with similar sales patterns, and then a moving average and Bayesian network combination model to predict the weekly sales of individual products at each warehouse. The authors evaluated the forecast accuracy for three differ-ently trained models. One handled all of the warehouses grouped together, one used clusters of main warehouses, and the last was given sub-warehouse clusters. The clustering algo-rithm generated 29 different main warehouse clusters, and 97 clusters for the sub-distribution warehouses. The results showed that the error rate of the model dropped from 49% without clustering, to 24% with main distribution warehouse clusters, and 17% with sub-distribution warehouse clusters.
Liu and Ichise  performed a case study of a Japanese supermarket chain, in which they implemented a long short-term memory neural network machine learning model with weather data as input parameters to predict the sales of weather-sensitive products. They used six different weather factors: solar radiation, rainfall precipitation, relative humidity, temperature, and north and east wind velocity. Their results show that their model’s predic-tions had an accuracy of 61,94%. The authors mention plans to improve their work in the future by adding other factors known to affect sales, such as area population, nearby com-petitors, price strategy and campaigns.
An early research paper that focused on restaurants was written by Takenaka et. al. , who developed a forecasting method for service industry businesses based on the factors that interviewed managers took into account when forecasting manually. The examined fac-tors include the weekday, rain, temperature and holidays. They found that their regression model could provide more accurate predictions for a restaurant in Tokyo than the restaurant manager.
In a study from 2017, Bujisic et. al.  researched 17 different weather factors and their effects on restaurant sales. They analyzed a data set consisting of every meal sold at a restau-rant in southern Florida during 47 weeks, from March 2010 to March 2011. Their results showed that weather factors can have a significant effect on sales of individual products, however not all products are affected by the weather, and the same weather factor has differ-ent effects on different products. They also found that the most important weather factor was the temperature, followed by wind speed and air pressure.
Xinliang and Dandan  used a neural network to forecast daily sales of four restaurants located at a university campus in Shanghai. The input variables provided to their model were the restaurant’s name, the date, the teaching week, the week of the year, if the date was a holiday, temperature, precipitation, maximum wind speed, and 4 different search metrics from Baidu, China’s largest search engine. The authors found that the name of the restaurant was the most important variable, followed by the teaching week, holiday, and one of the Baidu variables. The weather factors were shown to have low importance to the model, with temperature slightly above the other two.
Ma et. al.  predicted future visitors to restaurants using a mix of K-nearest-neighbour, Random forests and XGBoost. Their data was obtained from large restaurant ordering sites, and included 150 different restaurants. To compare different restaurants, the authors constructed several input variables from restaurant attributes such as a unique ID, latitude, lon-gitude, genre, and location area. The results showed that XGBoost was the best individual model, and that the most important variables were the week of year, mean visitors, restaurant ID and maximum visitors.
Holmberg and Halldén  researched how to implement machine learning algorithms for restaurant sales forecasts. To improve the accuracy of the predictions, they included vari-ables based on sales history, date characteristics and weather factors. The weather factors used for their model were temperature, average temperature of the last 7 days, rainfall, min-utes of sunshine, wind speed, cloud cover, and snow depth. They examined two different ma-chine learning algorithms and found that the XGBoost algorithm was more accurate than the long short-term memory neural network. The date variables were the most significant and the weather factors had the least impact. They also found that the daily sales were weather de-pendent for all of the researched restaurants, and that introducing weather factors improved their models’ performance by 2-4 percentage points. They suggest that continued work could be to create more general models that can make predictions for multiple restaurants, possibly by categorizing different restaurants based on features such as latitude/longitude, inhabi-tants, size of restaurants, and opening hours.
From these studies a few key facts can be extracted. While models that perform time series forecasting by default have access to previous sales figures, results can be improved by emphasizing certain patterns with their own input variables. Which input variables prove to be most important seems to be different in every study, and depend on the variables included, and the behaviour of each restaurant’s customers. Some variables are however more likely to have a greater effect, with variables related to date and sales figures more often represented in the lists of most important variables. Almost all of the early studies make use of some kind of neural network architecture, but in more recent papers the use of XGBoost becomes more popular, and it has been shown to be one of the best performing algorithms for prediction and forecasting problems [19, 27–33].
Table of contents :
1.4 Research questions
2.1 Automated forecasting
2.2 Data mining and machine learning
2.3 Decision trees
2.4 Decision forests
2.5 Gradient boosting
2.7 Related work
2.7.1 Weather’s effects on sales
2.8 Forecast accuracy metrics
2.8.1 Mean absolute error
3.1 Raw data
3.2 Receipt data
3.3 Sales history data
3.4 External data collection
3.4.1 Weather data
3.4.2 Google Maps data
3.4.3 Product data
3.4.4 Calendar data
3.5 The combined data sets
3.6 Company selection
3.7 Model implementation
4.1 Model training
4.2 Forecast accuracy
4.3 Feature importance
5.1 Model training
5.2.1 Restaurant variables
5.2.2 Calendar variables
5.2.3 Sales history variables
5.2.4 Weather variables
5.3 This work in a wider context
6.1 Future work
SMHI forecast parameters
SMHI historical parameters
Product sales data set
Total sales data set
Individual and collective training comparison
Product sales MAE
Total sales MAE
Feature importance and Pearson correlation coefficient