Dataset Selection and Modeling from Large Datasets

Get Complete Project Material File(s) Now! »

Feature ranking based on a single sample

The problems and consequences of using a single small or large sample are investigated and made explicit in this section. Ten small samples, 10 medium samples, and 10 large samples were taken from the forest cover type dataset using sequential random sampling (SRS). Table 5.4 shows the number of features selected for each sample by the Gaussian probe and Z-test based on class-feature correlations measured using Kendall’s tau. For the probes, the selection criterion is a class-feature correlation coefficient greater than that of the Gaussian probe. The number of features selected by the uniform probe and uniform-binary probe are given in tables D.1, D.4 and D.7 of appendix D. Since only one correlation value is available for each predictive feature for these experiments, the Z-test for a single correlation value was used to test the hypothesis that the class-feature correlation value is greater or equal to 0.1, that is, features that have a correlation value which is of practical significance (Cohen, 1988). The Z-test for a single correlation measurement was discussed in chapter 3. The first problem that can be deduced from table 5.4 is that sample sizes of 100 result in very few features being selected. The second problem is that the number of selected features varies from sample to sample. Smyth (2001) has argued that if a single sample is used to measure correlations between variables, then features may be lucky (or unlucky) in the sample and get selected (or eliminated) based on the single correlation measurement.
It could be argued that as sample sizes get larger the variability in the measured correlation coefficient will decrease. However, even for sample sizes of 1000 which is large for statistical hypothesis testing, one can see from table 5.4 that the variability in the number of features selected is still high. A second problem that arises when a single sample is used for feature selection is illustrated in table 5.5. Table 5.5 shows
the class-feature correlation values for four of the features in the KDD Cup 1999 dataset, as measured using Kendall’s tau with samples of size 1000. It can be deduced from table 5.5 that a feature (e.g. NumFailedLogins) can have no correlation, small correlation, medium correlation, or high correlation with the class variable depending on the sample that is used, even when the sample size for correlation measurements is large.

Empirical study of feature subset search

Feature subset search is the process of searching for an optimal subset of features based on specified criteria. A common criterion is to select that subset of features (from a set of identified relevant features) that maximises relevance and minimises redundancy in the selected subset. Feature subset search methods and examples of the merit measures that are employed in heuristic search for feature subsets were discussed in detail in chapter 3. The experiments reported in this section are for feature subset selection using forward search. Forward search algorithms that employ the correlation-based feature selection (CFS) merit measure (Hall, 1999) and differential prioritisation (DP) measures (Ooi et al, 2007) were implemented and tested using the features selected in the last section as inputs. Section 5.4.1 provides a discussion and analysis of the implementation of feature relevance and redundancy definitions by the CFS (Hall, 1999) and DP (Ooi et al, 2007) search procedures. The weaknesses of the merit measures employed by the CFS and DP search procedures are made explicit. A new algorithm for feature subset search is proposed in section 5.4.2 and the algorithm’s feature selection performance is compared to the CFS and differential prioritisation methods.

1 Introduction
1.1 Motivation for the research
1.2 Current debates and practices in data mining from large datasets
1.3 Scope of the research
1.4 The claims of the thesis
1.5 Research paradigm
1.6 Research contributions
1.7 Overview of the thesis
2 Dataset Selection and Modeling from Large Datasets
2.1 The need for dataset selection
2.2 Classification modeling from very large datasets
2.3 The dataset selection problem
2.4 Theoretical methods for single sample selection
2.5 Empirical methods for single sample selection
2.6 Methods for selecting multiple training datasets
2.7 Conceptual views of classification modeling
2.8 Sources of classification error
2.9 The limitations of current methods of dataset selection
2.10 Proposed approach to selection of training data from very large datasets
2.11 Conclusions
3 The Feature Selection Problem
3.1 The need for feature selection
3.2 Implicit feature selection
3.3 Explicit feature selection
3.4 Merit measures for heuristic search of feature subsets
3.5 Measuring correlations
3.6 Validation methods for feature selection
3.7 Conclusions
4 Research Methods
4.1 Research questions and objectives
4.2 The central argument for the thesis
4.3 The research paradigm and methodology
4.4 The datasets used in the experiments
4.5 Sampling methods
4.6 The data mining algorithms used in the experiments
4.7 Measures of model performance
4.8 Software used for the experiments
4.9 Chapter summary
5 Feature Selection for Large Datasets
5.1 The feature selection problem revisited
5.2 Alternative approaches to feature selection for large datasets
5.3 Empirical study of feature ranking methods for large datasets
5.4 Empirical study of feature subset search
5.5 Predictive performance of features selected with different methods
5.6 Discussion
5.7 Conclusions
6 Methods for Dataset Selection and Base Model Aggregation
6.1 Problem decomposition for OVA and pVn modeling
6.2 Methods for improving predictive performance
6.3 Design and selection of training and test datasets
6.4 Methods for creating and testing OVA and pVn models
6.5 Chapter summary
7 Evaluation of Dataset Selection for One-Versus-All Aggregate Modeling
7.1 OVA modeling
7.2 Experiments to study OVA models for 5NN classification
7.3 Experiments to study OVA models for See5 classification
7.4 Discussion
7.5 Conclusions
8 Evaluation of Dataset Selection for Positive-Versus-Negative Aggregate Modeling
8.1 pVn modeling
8.2 Experiments to study pVn models for 5NN classification
8.3 Experiments to study pVn models for See5 classification
8.4 Comparison of performance variability for single and aggregate models
8.5 Discussion
8.6 Conclusions
9 ROC Analysis for Single and Aggregate Models
9.1 ROC analysis for 2-class predictive models
9.2 ROC analysis for multi-class predictive models
9.3 ROC analysis for 5NN models
9.4 ROC analysis for See5 models
9.5 Conclusions
10 Recommendations for Dataset Selection
10.1 Reduction of prediction error
10.2 Recommendations for feature selection
10.3 Recommendations for training dataset selection for aggregate modeling
10.5 Chapter summary
11 Discussion of Research Contributions
11.1 Outputs of design science research
11.2 Evaluation of design science research
11.3 Limitations of the proposed dataset selection methods
11.4 Chapter Summary
12 Conclusions
12.1 Summary of the thesis
12.2 Conclusions and reflection
12.3 Future work
References