Prediction Model for Pregnancy Outcome 

Get Complete Project Material File(s) Now! »

Chapter 2 A Likelihood Model for Identifying Differentially Expressed Proteins

This chapter was published as:
WU, S. H., BLACK, M. A., NORTH, R. A., ATKINSON, K. R. & RODRIGO, A. G.2009. A Statistical Model to Identify Differentially Expressed Proteins in 2D PAGE Gels.PLoS Comput Biol, 5, e1000509.I developed the methods, performed the analyses and wrote the manuscript. The other authors contributed to the development of the methods and the experimental design. They
also contributed to the writing of the manuscript.

Introduction

Two-dimensional polyacrylamide gel electrophoresis (2D PAGE) (O’Farrell 1975) separates thousands of proteins within a sample by their isoelectric points (pI) in the first dimension and their molecular weights in the second dimension. Gels are scanned and spot detection performed using commercial or in-house software packages. These programs convert gel images into vectors of matched spot volumes and most analyses are subsequently performed on these data (Biron et al. 2006). 2D PAGE may be used to identify proteins that differentiate or characterise certain patient groups or sample sets. For instance, by comparing specimens from patients with a specified disease to a control group, statistical differences in the levels of proteins can be determined to identify proteins associated with a disease state
that may serve as diagnostic or prognostic biomarkers (Atkinson et al. 2009) Several statistical tests have been applied to detect differences in protein expression.These include the use of classical Student‟s t-test, Analyses of Variance (Biron et al. 2006),
principle component analysis and partial least squares analysis (Jacobsen et al. 2007; Marengo et al. 2006). A key disadvantage with these methods is their failure to adequately address the difficulties of dealing with non-expressed or undetected proteins in some or all subjects within a group (Chang et al. 2004; Grove et al. 2006).There are three broad reasons to explain why a given protein may not be detected in 2D PAGE experiments: (1) the lack of sensitivity of the experimental setup or software to detect the presence of an expressed protein, usually a consequence of some threshold of detectable concentration (Dowsey et al. 2003); (2) the true absence or non-expression of a protein; and (3) software-induced error, when proteins are incorrectly designated as being absent (Wheelock and Buckpitt 2005). Some researchers have developed methods that impute missing values from the existing data (Chang et al. 2004). However, without knowing the true causes of these missing values, imputation may introduce additional errors to the dataset (Grove et al. 2006; Marengo et al. 2006); in particular, by ignoring the possibility that a protein may not be expressed in a certain group of subjects, imputation may lead to an elevation in the numbers of false negatives. The problem of missing values may be addressed through the incorporation of missing observations into a statistical model of the data. Under the principle of likelihood, estimates of parameters (such as the mean expression intensity or the probability of expression) may then be obtained by computing the probability of obtaining the observed data, given different values of these parameters. The best estimates are those that maximise this probability, which is also called the maximum likelihood. Wood and co-workers (Wood et al. 2004) first proposed a statistical method to compute the likelihood for expressed proteins which
simultaneously takes missing data into account along with expression profiles. Their method does not distinguish the processes that may account for why a protein is undetected. This
means that the probability associated with non-detection is a composite of the probabilities of protein non-expression or expression below the level of detection.In this chapter, a new likelihood model is proposed that extends the approach of Wood et al. and is specifically applicable to situations where subjects belong to either a Case group or a Control group, in keeping with a case-control experimental design. This extended model allows for non-detected proteins and classifies them into two categories: either (a) the protein truly is not expressed, or (b) the protein is expressed but the expression level is below the limit of detection. I show how the proposed new method performs under simulations and compare results with standard statistical approaches commonly applied to detect differences in protein expression between groups. I also present an example using a subset of spots from a Case-Control 2D PAGE experiment.

READ  How inhibition shapes cortical information processing: functional importance of PV and SOM interneurons

Abstract
Acknowledgements
Table of Contents
List of Figures
List of Tables
Chapter 1 Introduction
1.1 Overview of Bioinformatics 
1.2 Statistical Challenges in the “-Omic” Age
1.2.1 Missing data
1.2.2 Multiple comparisons and false discovery
1.2.3 Permutation methods
1.2.4 2D PAGE and microarray
1.3 Bioinformatics in Clinical Medicine
1.3.1 “-Omic” technologies to discover novel biomarkers
1.3.2 Clinical studies for biomarker discovery and validation
1.4 SCOPE (Screening for Pregnancy Endpoints) Project
1.4.1 Disease endpoints
1.4.2 Preeclampsia
1.4.3 Screening for preeclampsia
1.4.4 Discovery of protein biomarkers for preeclampsia
1.5 Disease Prediction Models Based on Clinical Variables
1.5.1 Database development, data monitoring and data cleaning
1.5.2 A priori clinical knowledge
1.5.3 Evaluation method
1.5.4 Measuring performance
1.5.5 Comparison group
1.5.6 Impact of misclassification in the prediction model
1.5.7 Receiver Operating Characteristic (ROC) curves
1.5.8 Variable selection
1.5.9 Multiple endpoints
1.6 Common Statistical Methods for Disease Prediction 
1.6.1 Cluster analysis
1.6.2 Regression
1.6.3 Multivariate analysis and dimension reduction methodologies
1.7 Statistical Methods for Biomarker Discovery
1.7.1 Maximum likelihood estimation
1.7.2 Bayesian inference
1.7.3 Markov chain Monte Carlo
1.8 My study 
Chapter 2 A Likelihood Model for Identifying Differentially Expressed Proteins
2.1 Introduction
2.2 Development of a Likelihood Model
2.2.1 Application of the Likelihood Ratio Test (LRT)
2.3 Simulation Analysis
2.3.1 Application of model to simulated datasets
2.3.2 Application of model to 2D PAGE example
2.4 Results
2.4.1 Application of model to 2D PAGE data
2.5 Discussion
Chapter 3 The Global Bayesian Model 
3.1 Introduction
3.2 The Global Bayesian Model 
3.2.1 Reparameterisation
3.2.2 The local layer
3.2.3 The global layer
3.3 Markov Chain Monte Carlo (MCMC)
3.3.1 Bayesian inference
3.3.2 Metropolis-Hasting algorithm
3.3.3 Prior distributions
3.3.4 Proposal distributions
3.3.5 Using 95% highest posterior density (HPD) to identify differentially expressed proteins
3.3.6 Adaptive MCMC
3.4 Simulation Analysis
3.5 Results
3.6 Discussion
Chapter 4 Prediction Model for Pregnancy Outcome 
4.1 The Dataset 
4.1.1 Training and Validation Datasets
4.1.2 Study Population
4.2 Prediction Algorithms
4.2.1 Logistic regression
4.2.2 Discriminant analysis
4.3 Variable Selection
4.3.1 Percentage cutoff point
4.3.2 Stepwise variable selection
4.3.3 AIC – Akaike Information Criterion
4.3.4 BIC – Bayesian Information Criterion
4.3.5 Stepwise variable selection for LDA
4.3.6 Stepwise variable selection for LDA (with phi correlation)
4.3.7 Grouping of variables
4.4 Prediction Models
4.4.1 M1 – Logistic regression with no variable selection
4.4.2 M2 – Logistic regression with top 50% of variables
4.4.3 M3 – Logistic regression with AIC
4.4.4 M4 – Logistic regression with BIC
4.4.5 M5 – LDA with percentage cutoff point
4.4.6 M6 – LDA using the variables selected by logistic regression
4.4.7 M7 – Stepwise LDA
4.4.8 M8 – Stepwise LDA with phi correlation
4.4.9 M4C – Logistic Regression with BIC (Including Clustering)
4.4.10 Measuring Performance
4.5 Result and Discussion
4.5.1 Uncomplicated pregnancy
4.5.2 Preeclampsia
4.5.3 Interaction terms
4.6 Conclusion 
Chapter 5 Hierarchical Prediction Framework
5.1 Overview of Methods of the Three Stage Hierarchical Prediction Framework 
5.1.1 Analysis for each training set
5.1.2 Analysis in each validation set
5.1.3 Example of the three stage hierarchal prediction framework
5.1.4 A two dimensional grid search
5.2 Overview of Methods of the Four Stage Hierarchical Prediction Framework 
5.2.1 Analysis in each training set
5.2.2 Analysis in each validation set
5.2.3 Three dimensional grid search
5.3 Results.
5.3.1 Three stage hierarchical prediction framework
5.3.2 Four stage hierarchical prediction framework
5.4 Conclusion 
Chapter 6 Conclusions
6.1 Identification of Differentially Expressed Proteins Using 2D PAGE 
6.1.1 Suggested Improvements for the Global Model
6.2 Prediction Model for Pregnancy Endpoints
6.2.1 Future work
6.3 Final Summary
Appendix A Publication.
Appendix B List of Variables in the SCOPE Database 
Appendix C Lists of Variables in Each Cluster Used in Model M4C 
Appendix D Lists of Frequently Selected Variables
D.1 By the uncomplicated pregnancy prediction model 
D.2 By the preeclampsia prediction model
References

GET THE COMPLETE PROJECT
Novel Statistical Methods For Analysing Proteomic And Clinical Data Associated With PregnancyRelated Diseases

Related Posts