Using the response variable to estimate the structure of sub-regression

Get Complete Project Material File(s) Now! »

Variable selection methods

Stepwise [Seber and Lee, 2012] is an algorithm to choose a subset of covariates to use in the final regression model. It is a variable selection method using ols for estimation. It is proposed in the R package stats with the function step. The main idea is to start with a first model (that can be either void or using the whole dataset or using any subset of covariates) and then to add and remove covariates step by step to improve the chosen criterion. The criterion to optimize can be adjusted R-square, Akaike information crite-rion, Bayesian information criterion,etc.
• Starting with a void model and having only adding steps is called Forward Selection. Covariates are added by choosing first the one that improves the most the criterion. The algorithm stops when all the covariates are in or when remaining covariates does not improve the model.
• Backward Elimination is the same as Forward selection but starting with the full model and removing at each step the covariates that improves the most the criterion once deleted.
• Bidirectional elimination is more flexible and allows to start from any model. Each step proposes to add a covariate or to delete another so it is not a hierarchical construction any more because successive models are not necessarily nested into each other.
A critical value can be defined to stop the algorithm when improvement becomes too small, in order to avoid over-fitting.
Stepwise regression is subject to over-fitting and the algorithm is in trouble when confronted to correlated covariates [Miller, 2002] giving unstable results, especially for nested strategies, just like regression trees that are unstable because of their discrete nested nature. Figure 3.7 illustrate the consequences of correlations in the dataset.

CLusterwise Eﬀect REgression (clere)

The CLusterwise Eﬀect REgression (clere [Yengo et al., 2012]) describes the —j no longer as fixed eﬀect parameters but as unobserved independent random variables with —j fol-lowing a Gaussian mixture distribution, allowing to group them by their component mem-bership.
The idea is that if the model has a small number of groups of covariates then the mixture will have few enough components to have a number of parameters to estimate significantly lower than d. In such a case, it improves interpretation and ability to yeld reliable prediction with a smaller variance on ˆ. A package clere for R does exist on cran ([Yengo and Canouil, 2014]).
But we have to choose the maximum number of components g and have no method to choose this value. Yengo recommends to use g = 5 in our case. It could be interpreted as the possibility to have a group of irrelevant covariates and groups with small or big values (both positives or negatives). The package is able to choose automatically the best number of components between 1 and g based on a bic criterion but setting g = d gives over-fitting. Here again, it has no specific protection against or specific model for correlations.

Spike and Slab

Spike and Slab variable selection [Ishwaran and Rao, 2005] also relies on Gaussian mix-ture (the spike and the slab) hypothesis for the —j and gives a subset of covariates (not grouped) on which to compute ols but has no specific protection against correlations issues. The —j are supposed to come from a mixture distribution as shown in Figure 3.15. It allows to have some coeﬃcients set exactly to zero after some draws. The package spikeslab for R is on cran.
Modeling the parameters implies to have no exact value to give to the coeﬃcient and it is not really user-friendly, especially in our industrial context.

Simultaneous Equation Model (sem) and Path Analysis

Applied statistics for non statisticians are well developed in sociology where interpretation stakes are fare beyond prediction. Sociologists use simple models like linear regression (often with R2 < 0.2) and describe complex situations with systems of linear regres-sions. Such systems are called Structural Equation Model or Simultaneaous Equation Model, better known as sem [Davidson and MacKinnon, 1993]. Several softwares, from the open-source Gretl [Cottrell and Lucchetti, 2007] to proprietary STATA, does imple-ment the sem. The systems allow to describe which covariates have an influence on others with an orientation that users can interpret as causality [Pearl, 2000, Pearl, 1998].
sem are easy to understand for non-statisticians and can be resumed by Directed Acyclic Graphs (DAG) as the Bayesian networks do. But the problem is that the struc-ture of regression between the covariates is defined a priori. sem are often used to confirm sociological theories, not to create new theories.
Moreover, estimation of recursive sem, without instrumental variables is exactly a succession of independent ols (confirmed with both Gretl and STATA) so the structure is only used for interpretation, not for estimation [Brito and Pearl, 2006]. Last but not least, there is no specific status for a response variable, each regression has the same status. We want to be able to model complex dependencies within the covariates and use this knowledge to estimate and understand a specific distinct response variable.

READ Traditional economy and the emergence of climate change

Table of contents :

Glossary of notations
1 Introduction
1.1 Industrial motivation
1.1.1 Steel making process
1.1.2 Impact of the industrial context
1.1.3 Industrial tools
1.2 Mathematical motivation
1.3 Outline of the manuscript
2 Résumé substantiel en français
2.1 Position du problème
2.2 Modélisation explicite des corrélations
2.3 Modèle marginal
2.4 Notion de prétraitement
2.5 Estimation de la structure
2.6 Relaxation des contraintes et nouveau critère
2.7 Résultats
2.8 Modèle plug-in sur les résidus du modèle marginal
2.9 Valeurs manquantes
3 State of the art in linear regression
3.1 Regression
3.1.1 General purpose
3.1.2 Linear models
3.1.3 Non-linear models
3.2 Parameter estimation
3.2.1 Maximum likelihood and Ordinary Least Squares (ols)
3.2.2 Ridge regression: a penalized estimator
3.3 Variable selection methods
3.3.1 Stepwise
3.3.2 Least Absolute Shrinkage and Selection Operator (lasso)
3.3.3 Least Angle Regression (lar)
3.3.4 Elasticnet
3.3.5 Octagonal Shrinkage and Clustering Algorithm for Regression (oscar)
3.4 Modeling the parameters
3.4.1 CLusterwise Effect REgression (clere)
3.4.2 Spike and Slab
3.5 Taking correlations into account
3.5.1 Principal Component Regression (pcr)
3.5.2 Partial Least Squares Regression (pls)
3.5.3 Simultaneous Equation Model (sem) and Path Analysis
3.5.4 Seemingly Unrelated Regression (sur)
3.5.5 Selvarclust: Linear regression within covariates for clustering
3.6 Conclusion
I Model for regression with correlation-free covariates
4 Structure of inter-covariates regressions
4.1 Introduction
4.2 Explicit modeling of the correlations
4.3 A by-product model: marginal regression with decorrelated covariates
4.4 Strategy of use: pre-treatment before classical estimation/selection methods
4.5 Illustration of the trade-off conveyed by the pre-treatment
4.6 Connexion with graphs
4.7 mse comparison on the running example
4.8 Numerical results with a known structure on more complex datasets
4.8.1 The datasets
4.8.2 Results when the response depends on all the covariates, true structure known
4.9 Conclusion
5 Estimation of the structure of sub-regression
5.1 Model choice: Brief state of the art
5.1.1 Cross validation
5.1.2 Bayesian Information Criterion (bic)
5.2 Revisiting the Bayesian approach for an over-penalized bic
5.2.1 Probability associated to the redundant covariates (responses)
5.2.2 Probability associated to the free covariates (predictors)
5.2.3 Probability associated to the discrete structure S
5.2.4 Penalization of the integrated likelihood by P(S)
5.3 Random walk to optimize the new criterion
5.3.1 Transition probabilities
5.3.2 Deterministic neighbourhood
5.3.3 Stochastic neighbourhood
5.3.4 Enlarged neighbourhood by constraint relaxation
5.3.5 Pruning
5.3.6 Initialization of the walk
5.3.7 Implementing and visualizing the walk by the CorReg software
5.4 Conclusion
6 Numerical results on simulated datasets
6.1 Simulated datasets
6.2 Results about the estimation of the structure of sub-regression
6.2.1 Comparison with Selvarclust
6.2.2 Computational time
6.3 Results on the main regression with specific designs
6.3.1 Response variable depends on all the covariates
6.3.2 Response variable depends only on free covariates (predictors)
6.3.3 Response variable depends only on redundant covariates
6.3.4 Robustness with non-linear case
6.4 Conclusion
7 Experiments on steel industry
7.1 Quality case study
7.2 Production case study
7.3 Conclusion
II Two extensions: Re-injection of correlated covariates and Dealing with missing data
8 Re-injection of correlated covariates to improve prediction
8.1 Motivations
8.2 A plug-in model to reduce the noise
8.3 Model selection consistency of lasso improved
8.4 Numerical results with specific designs
8.4.1 Response variable depends on all the covariates
8.4.2 Response variable depends only on free covariates (predictors)
8.4.3 Response variable depends only on redundant covariates
8.5 Conclusion
9 Using the full generative model to manage missing data
9.1 State of the art on missing data
9.2 Choice of the model of sub-regressions despite missing values
9.2.1 Marginal (observed) likelihood
9.2.2 Weighted penalty for bicú
9.3 Maximum likelihood estimation of the coefficients of sub-regression
9.3.1 Stochastic EM
9.3.2 Stochastic imputation by Gibbs sampling
9.3.3 Parameters computation for the Gibbs sampler
9.4 Missing values in the main regression
9.5 Numerical results on simulated datasets
9.5.1 Estimation of the sub-regression coefficients
9.5.2 Multiple imputation
9.5.3 Results on the main regression
9.6 Numerical results on real datasets
9.7 Conclusion
10 Conclusion and prospects
10.1 Conclusion
10.2 Prospects
10.2.1 Qualitative variables
10.2.2 Regression mixture models
10.2.3 Using the response variable to estimate the structure of sub-regression
10.2.4 Pre-treatment for non-linear regression
10.2.5 Missing values in classical methods
10.2.6 Improved programming and interpretation
References