Sparsely constrained neural networks for model discovery of PDEs

Get Complete Project Material File(s) Now! »

Heuristics

Besides these regularizations, various other heuristic are often applied to improve the sparsity of the obtained coefficients. I discuss two commonly used ones: thresholding and sequential regression.

Thresholding

It is quite common that we recover a sparse vector which is approximately right: the required terms stand out, but several other small but non-zero components exist, despite regularization. Performance can be strongly improved by thresholding this sparse vector. Since all components have different dimensions, the features are first normalised by their ℓ2 norm 12, Θ∗𝑗 = Θ𝑗 ∥Θ𝑗∥ 2 .

Stability selection

In the previous section we considered the effect of different regularizations on sparsity. An additional important factor is the strength of this regularization, denoted by the factor 𝜆 in eq. (2.9). Figure 2.6 shows the solution 𝜉 ̂ as a function of the regularization strength 𝜆. The resulting sparse vector is strongly dependent on 𝜆; the lower the amount of regularization, the more features are selected. (CV). CV consists of splitting the data into 𝑘 folds13, training the model on 𝑘 − 1 folds and testing model performance on the remaining fold. This process is repeated for all folds and averaging over the results yields a data efficient but computationally expensive approach to hyperparameter tuning. CV optimizes for predictive performance – how well does the model predict unseen data? -, but model discovery is interested in finding the underlying structure of the model. Optimizing for predictive performance might not be optimal; a correct model certainly generalizes well, but optimizing for it on noisy data will likely bias the estimator to include additional, spurious, terms. An alternative metric which optimizes for variable selection is stability selection [Meinshausen and Buehlmann, 2009]. The key idea is that while a single fit probably will not be able to discriminate between required and unnecessary terms we expect that among multiple fits on sub-sampled datasets the required terms will consistently be non-zero. Contrarily, the unnecessary terms are essentially modelling the error and noise, and will likely be different for each subsample. More specifically, stability selection bootstraps a dataset of size 𝑁 into 𝑀 subsets 𝐼𝑀 of 𝑁 /2 samples, and defines the selection probability Π𝜆 𝑗 as the fraction of subsamples in which a term 𝑗 is non-zero Π𝜆 𝑗 = 1 𝑀 Σ 𝑀 1(𝑗 ∈ 𝑆𝜆(𝐼𝑀 ).

Multi-experiment datasets

A dataset rarely consists of a single experiment – they are comprised of several experiments, each with different parameters, initial and boundary conditions. Despite these variations, they all share the same underlying dynamics. To apply model discovery on these datasets we need a mechanism to exchange information about the underlying equation among the experiments. Specifically, we need to introduce and apply the constraint that all experiments share the same support (i.e. which terms are non-zero).
Consider 𝑘 experiments, all taken from the same system, {Θ ∈ ℛ𝑁×𝑀 , 𝑢𝑡 ∈ ℛ𝑁 }𝑘, with an unknown sparse coefficient vector 𝜉 ∈ ℛ𝑀×𝑘. Applying sparse regression on each experiment separately is likely violate this constraint; while it will yield a sparse solution for each experiment, the support will not be the same (see figure 2.8). Instead of penalizing single coefficients, we must penalize groups – an idea known as group lasso 14 [Huang and Zhang, 2009]. Group lasso first calculates a norm over all members in each group (typically an ℓ2 norm), and then applies an ℓ1 norm over these group-norms (see figure 2.8).

Time/space dependent coefficients

So far we have implicitly assumed that all the governing equations have fixed coefficients. This places a very strong limit on the applicability of model discovery; usually coefficients are fields depending on space or time (or even both!), so that 𝜉 = 𝜉(𝑥, 𝑡). As a first attempt at solving this, Rudy et al. [2018] learn the spatial or temporal dependence of 𝜉 by considering each corresponding slice as a separate experiment with different coefficients, and applying the group sparsity approach we introduced in the previous section. A Bayesian version of this approach was studied by Chen and Lin [2021]. it is not the functional form which is inferred, but simply its values at the locations of the samples. 15

Information-theoretic approaches

The sparse regression approach tacitly assumes that the sparsest equation is also the correct one. While not an unreasonable assumption, it does require some extra nuance, as there often is not a single correct model.
Many (effective) models are constructed as approximations of a certain order depending on the accuracy required. A first order approximation yields the sparsest equation, while the second order model could describe the model better. Another example is that a system can be locally modelled by a simpler, sparser equation. In both cases neither model is incorrect – they simply make a different trade-off. Information-theoretic approaches give a principled way to balance sparsity of the solution with accuracy. Consider two models 𝐴 and 𝐵, each with likelihood ℒ𝐴,𝐵 and 𝑘𝐴,𝐵 terms. Model B has a higher likelihood, ℒ𝐵 > ℒ𝐴, indicating a better fit to the data, but also consists of more terms, 𝑘𝐵 > 𝑘𝐴. The Bayesian Information Criterion (BIC) and the closely related Akaike Information Criterion (AIC) are two metrics to decide which balance these two objectives, BIC = 𝑘 ln 𝑛 − 2 ln ℒ.

Table of contents :

1 Introduction
1.1 Model discovery
1.2 Contributions
1.3 Organization of this thesis
2 Regression
2.1 Structure of differential equations
2.2 Regression for model discovery
2.3 Regularized regression
2.4 Heuristics
2.5 Extensions
3 Differentiation and surrogates
3.1 The need for surrogates
3.2 Surrogates: local versus global
3.3 Neural networks as surrogates
4 DeepMoD
4.1 DeepMoD: Deep Learning for Model discovery in noisy data .
4.2 Sparsely constrained neural networks for model discovery of PDEs
4.3 Fully differentiable model discovery
4.4 Temporal Normalizing Flows
4.5 Model discovery in the sparse sampling regime
5 Conclusion
5.1 Challenges and questions unanswered