An improved parameter estimation algorithm
The review of the literature in chapter two provides evidence that a quantitative process based modelling framework is the most promising for the development of a computational scientific discovery system. The latest development along this front is the discovery system is the system called Structurally Constrained Inductive Process Modeller (SC-IPM) (Bridewell & Langley, 2010). The first section of this chapter describes the approach taken by SC-IPM and how lessons from that approach lead to the development of a new algorithm to estimate parameters. The new algorithm addresses some of the scaling issues that hindered the SC-IPM system. Section two goes into more detail about the development and testing of the new regression-based algorithm used to estimate parameters.
The problem with SC-IPM
The SC-IPM system uses a framework based on quantitative processes. The system is designed to be a tool to assist scientists in making discoveries by automating the model structure and parameter search tasks resulting in a model that both accurately accounts for the data and explains the behaviour. The structure of a process model defines what processes appear as part of the model and defines relationships between objects in the system. Parameters are numerical constants that appear as part of the model. Background knowledge provided by the domain expert is incorporated into this framework through a generic process library. The library specifies all the ways that the objects within a particular domain can interact and often connects to domain theories and taxonomy. As a simple example, in the domain of organism ecosystems, there could be a predation process that involves one species consuming another that causes a decrease in the prey population and an increase in the predator population. The generic processes are made into specific instances of a process when applied to actual data. For example, the data may contain variables x and y and one of the specific process instance created from the generic process relationship is that x consumes y. Process instances form the components of the model structure and their interaction creates the dynamic behaviour of the system.
A complete model is a combination of processes instances and their associated the parameter values that accounts for and explains the given data. The process model can be expressed as a set of differential equations that can be simulated to make predictions about the future behaviour of the system. The processes within the model form the basis for explaining the function of the model, which can be traced back to the generic process library which connects to domain theory and taxonomy. The connection to domain theory through the processes provides the explanatory power necessary for computational scientific discovery.
Figure 1 shows an example model output from SC-IPM where the processes are identified with labels. The system being modelled is a predator prey eco system with the variable Aurelia playing the part of the prey species and Nasutum as the predator. The first equation states that the rate of change of the prey species involves the difference between “exponential-growth” and “grazing” with the former having a positive coefficient and the latter being negative. This translates to the idea that the prey species rate of change is positively influenced by growth and negatively influenced by being grazed upon, with the net rate of change of the prey species being dependent on the interaction of these two processes. Similarly, the second equation states that the rate of change of the predator species Nasutum involves the difference between “exponential loss” and “grazing” with the former being negative and the latter having a positive sign. This translates to the idea that the predator species is negatively influenced by a loss mechanism but is positively influenced by grazing. The grazing process captures the idea that grazing both reduces the population of the prey species while simultaneously increasing the predator species population.
The model shown in Figure 1 is a process model expressed in differential equation format. This model was one of the best fitting models found when SC-IPM was presented with empirical data of a predator-prey ecosystem described in (Bridewell & Langley, 2010). Trajectories that result from simulating this model are shown in Figure 2 and provide a synthetic data set that will be used to further to evaluate the performance of the SC-IPM system. The synthetic data set provides many advantages for testing over using the empirical data including a known search target, the ability to generate a data set of any arbitrary size, and a lack of noise. The synthetic data will help highlight some of the undesirable outcomes of the parameter search algorithm incorporated into SC-IPM. The system should have no trouble recovering the exact model of this perfect data when provided with the appropriate generic process library. Parameter estimation seems to be the main difficulty encountered by the SC-IPM system so it is provided with the known model structure, which eliminates the need to search for one, in order to focus the search effort solely on parameter estimation.
The parameter estimation routine involves a gradient descent search. An initial set of random parameter values are chosen and used to run a simulation. The error between the simulation output and the target data is computed and used as part of an objective function to minimize the sum of squared error between the predicted data and the target data. An error gradient with respect to each parameter is computed which gives an estimate of the effect of changing a parameter value on the overall error. The parameter values are adjusted in a direction that will drive the error to a smaller value then those values are simulated again and a new error is computed. The steps of adjusting parameters and simulating are repeated until the gradients are approximately zero. That condition indicates that the error has reached a minimum. In many cases the error space will contain many local minima so the parameters must be re-initialized to random values and the process repeated to increase the chances of finding the global minimum. The increased chance to find the set of parameters with the smallest error comes at the cost of additional computational effort. The effect of many local minima requiring additional restarts will prove to be one of the primary causes of SC-IPMs inefficient search.
Figure 3 shows the resulting distribution of model scores when SC-IPM is provided with the model structure and asked to run parameter estimation 2500 times with no random re-initializations. Note that both the vertical and horizontal axes have discontinuities to try to capture the extreme values without removing all of the detail. The reported score is a mean squared error measurement, it is calculated by first taking the mean value of the data and calculating the error of each data point relative to the mean value. It is a ratio of the reported error over the error of the mean model. The mean model can be thought of graphically as drawing a straight horizontal line through the middle of the data. Practically this measure states that a score of 0 is a perfect fit and a score greater than one is worse than a model that just predicts the average value of the data. Results from the 2500 estimations showed that over 1700 different unique scores, rounded to 3 decimal places, were reported and five out of the 2500 attempts resulted in the correct set of parameters, i.e. the parameters that were used to create the data. All results that were not the correct set of parameters scored 1.0 or higher, which indicates their fits were equivalent or worse than a mean value guess.
The experiment involved a single parameter estimation loop which gives some evidence of the minimum number of local minima in the parameter space. Several conclusions can be drawn from the results of this experiment. The first is that the parameter space contains a large number of minima. The parameter search algorithm appears to be inherently unreliable with this data set because the random initial conditions are unlikely (5 out of 2500 tries) to find the global minimum. The algorithm also doesn’t seem to find models that are close because the second best reported model score was greater than 1.0 and thus worse than a mean model.
Table 2 shows the results from an experiment to try to determine the success rate of the program as a function of the number of parameter estimation retries. Parameter estimation restarts refers to the number of times the parameter estimation algorithm is restarted with random initial values. The number of model finding attempts is how many times SC-IPM returns the best of parameters it found from those parameter estimation restarts. Successes are counted when the best set of parameters returned are the correct parameters. With 250 parameter estimation restarts, the system returned the correct set of parameters on 10 out of 30 runs. This suggests that if you provided the system with the correct model structure and allowed it to restart parameter estimation 250 times, SC-IPM would have about a 30% chance of finding the correct parameter values. At 500 parameter estimation restarts, it was able to find the correct parameters on 22 out of 30 attempts. At 1000 parameter estimation restarts it was successful 27 out of 30 times. A second, similar data set showed a success rate of 29 out of 30 with 1000 restarts. This provides evidence that on this particular data set if you configure SC-IPM to run 1000 parameter estimation restarts, odds are about 90% that it will return the correct parameters. An ordinary desktop computer circa 2013 took approximately two minutes to run through 1000 parameter estimation retries. This for a single model structure that only has two observed variables and five parameters. This effort must be repeated for every model structure in the space. Presumably the parameter estimation algorithm will take longer where models contain more parameters and more variables and there is also no guarantee that the levels of reliability for this data set will be consistent in other settings.
1. Background and introduction
1.1. The problem with process models
1.2. Aims and significance
1.4. Outline of the thesis
2. Existing relevant computational scientific discovery literature
2.1. Data mining and machine learning
2.2. System identification
2.3. Equation discovery
2.4. System dynamics
2.5. Qualitative reasoning
2.6. Quantitative process models
2.7. Concluding remarks
3. An improved parameter estimation algorithm
3.1. The problem with SC-IPM
3.2. The development of a regression based parameter identification algorithm
3.3. Sources of error
3.4. Identifiability of processes
3.6. Concluding remarks
4. Heuristic induction of rate-based process models
4.1. Simplifying assumptions for a discovery system
4.3. Regression-guided process modeller
4.4. Theoretical claims
4.5. Evaluation of the Approach
4.6. Limitations and next steps
5. The creation of synthetic data
5.1. The need and the challenge of synthetic data
5.2. Selecting components of a synthetic process model
5.3. Simulation algorithm
5.4. The model building method
5.5. Concluding remarks
6. Heuristic adaptation of rate based process models
6.2. An approach to process model adaptation
6.3. Experimental Evaluation
6.4. Related Research on Model Revision
6.5. Concluding Remarks
7. Induction of rate based process models using a selection heuristic
7.1. A Fourier analysis based selection algorithm .
7.2. A backwards elimination selection heuristic
7.3. Other features of SPM
7.4. Experimental evaluation of the selection algorithm
7.5. Scalable induction of differential equations
GET THE COMPLETE PROJECT
Computational Scientific Discovery Using Rate-Based Process Models