Get Complete Project Material File(s) Now! »
Résumé
The main objective of this research is to provide a theoretical foundation for analysing grouped data, taking the underlying continuous nature of the variable(s) into account. Statistical techniques have been developed and applied extensively for continuous data, but the analysis for grouped data has been somewhat neglected. This creates numerous problems especially in the social and economic disciplines, where variables are grouped for various reasons. Due to a lack for the appropriate statistical techniques to evaluate grouped data, researchers are often tempted to ignore the underlying continuous nature of the data and employ e.g. the class midpoint values as an alternative. This leads to an oversimplification of the problem and valuable information in the data is being ignored. The first part of the thesis demonstrates how to fit a continuous distribution to a grouped data set.
By implementing the ML estimation procedure of Matthews and Crowther (1995: A maximum likelihood estimation procedure when modelling in terms of constraints. South African Statistical Journal, 29, 29-51) the ML estimates of the parameters are obtained. The standard errors of the ML estimates are derived from the multivariate delta theorem. It is interesting to note that not much accuracy has been lost by grouping the data, justifying that statistical inference may be done effectively from a grouped data set. The main concern of this part of the thesis was to foster the basic principles. The examples and distributions discussed are merely used to illustrate and explain the philosophy from basic principles. The fit of various other continuous distributions, not mentioned in the thesis, such as the gamma distribution and the lognormal distribution can also be done using the same approach.
The second part of the thesis concentrates on the analysis of generalised linear models where the
response variable is presented in grouped format. A cross classification of the independent variables leads to various so-called cells, each containing a frequency distribution of the response variable. Due to the nature of the response variable the usual analysis of variance and covariance models etc. can no longer be applied in the usual sense. A completely new approach, where a specified underlying continuous distribution for the grouped variable is fitted to each cell in the multifactor design is introduced. Certain measures such as the average, median or even any other percentile of the fitted distributions are modelled to explain the influence of the independent variables on the response variable. This evaluation may be done by means of a saturated model where no additional constraints are employed in the ML estimation procedure or by means of any other model where certain structures with regard to the independent variables are incorporated. The main objective is ultimately to provide a satisfactory model that describes the data as effectively as possible, revealing the various trends in the data. Employing the multivariate delta theorem, the standard errors for the ML estimates are calculated, enabling testing of relevant hypotheses.
The goodness of fit of the model is evaluated with the Pearson and Wald statistics. Two applications of multi-factor models are presented. In the first application normal distributions are fitted to the cells in a single factor design. The behavior of the mean of the fitted normal distributions revealed the effect of the single independent variable. Various models are employed to explain the versatility of the technique. Apart from the single factor model a two factor model was employed for data from short term insurance. The positive skewness of the grouped response variable suggested that a log-logistic distribution is to be fitted to the data. The median of the log-logistic distributions was modelled in a two factor model to explain the effect of the independent variable on the response variable. It is also illustrated how to incorporate a grouped independent variable as a covariate or regressor in the model. In the past where researchers might have been restricted to tabulations and graphical representations it is now shown that the possibilities of modelling a grouped response variable in a generalised model are in principle unlimited. The application of a three factor model or any higher order model follows similarly. A typical example pursue from the population census data where the grouped variable income can be explained utilising independent variables such as gender, province, population group, age, education level, occupation, etc.
Estimation
The frequency vector f is distributed according to a multinomial distribution and consequently belongs to the exponential class. Since the vector of cumulative relative frequencies is a one-to-one transformation of f, the random vector p may be implemented in the ML estimation procedure of Matthews and Crowther (1995) presented in Proposition 1. Utilizing the ML estimation, it is possible to find the ML estimate of π, under the restriction that π satisfies the constraints defined in the ML estimation procedure. The basic foundation of this research are given in the following two propositions. The proofs are given in Matthews and Crowther (1995). 7 Proposition 1 (ML estimation procedure) Consider a random vector of cumulative relative frequencies p, which may be considered as a non-singular (one-to-one) transformation of the canonical vector of observations, belonging to the exponential family, with E(p) = π and Cov(p) = V . The observed p is the unrestricted ML estimate of π and the covariance matrix V may be a function of π. Let g(π) be a continuous vector valued function of π, for which the first order partial derivatives, Gπ= ∂g(π) ∂π (2.10) with respect to π exist. The ML estimate of π, subject to the constraints g(π) = 0 is obtained iteratively from π = p− (GπV) (GpVG π ) ∗ g(p) (2.11) where Gp = ∂g(π) ∂π π=p and (GpVG π ) ∗ is a generalized inverse of (GpVG π ). The iterative procedure implies a double iteration over p and π. The procedure starts with the unrestricted ML estimate of π, as the starting value for both p and π. Convergence is first obtained over p using (2.11). The converged value of p is then used as the next value of π, with convergence over p starting again at the observed p. In this procedure V is recalculated for each new value of π in the iterative procedure. Convergence over π ultimately leads to π , the restricted ML estimate of π . Proposition 2 The asymptotic covariance matrix of π , under g(π) = 0, is Cov (π ) V− (GπV) (GπVG π ) ∗ (GπV) (2.12) which is estimated by replacing π by π . In Matthews and Crowther (1995) it is assumed that the restrictions are linearly independent, but in Matthews and Crowther (1998), it is shown that if the restrictions are linearly dependent, it leads to the generalized inverse, (GπVG π ) ∗ , to be introduced in (2.11) and (2.12).
Goodness of fit
In order to test the deviation of the observed probabilities p from the restricted ML estimates π , imposed by the constraints g(π) = 0, it is convenient to formulate and test the null hypothesis H0 : g(π) = 0 by some goodness of fit statistic like the Pearson χ 2 -statistic χ 2 = k i=1 (pi − π i) 2 π i (2.15) or the Wald statistic W = g(p) (GpVG p ) ∗ g(p) . (2.16) Both the Pearson and the Wald statistic have a χ 2 -distribution with r degrees of freedom, where r is equal to the number of linear independent constraints in g(π). Another useful measure, is the measure of discrepancy D = W/n (2.17) which will provide more conservative results for large sample sizes. As a rule of thumb the observed and expected frequencies are considered to not deviate significantly from each other if the discrepancy is less than 0.05.
The exponential distribution
To illustrate the underlying methodology of fitting a distribution via the ML estimation process described in Proposition 1, it will be shown how to fit an exponential distribution to the frequency data in Table 2.1. The probability density function (pdf) of an exponential random variable with expected value µ is given by f(x; µ) = 1 µ e −x/µ (3.1) and the cumulative distribution function (cdf) is given by F(x; µ) = 1 − e −x/µ . (3.2) To fit an exponential distribution it is required (see 2.13) that 1 − exp(−θx) = π (3.3) where 1 : (k − 1) × 1 is a vector of ones, x is the vector of upper class boundaries and θ = µ −1 . From this requirement (3.3) two alternative ways of performing the estimation procedure are described. In Sections 3.1 and 3.2 it will be shown that although the specifications of the two sets of constraints, g(π) = 0, seem completely different, the final results obtained are identical.
Contents :
- 1 Introduction
- I Fitting distributions to grouped data
- 2 The ML estimation procedure
- 2.1 Formulation
- 2.2 Estimation
- 2.3 Goodness of fit
- 3 The exponential distribution
- 3.1 Direct set of constraints
- 3.2 Constraints in terms of a linear model
- 3.3 Simulation study
- 4 The normal distribution
- 4.1 Direct set of constraints
- 4.2 Constraints in terms of a linear model
- 4.3 Simulation study
- 5 The Weibull, log-logistic and Pareto distributions
- 5.1 The Weibull distribution
- 5.2 The log-logistic distribution
- 5.3 The Pareto distribution
- 5.4 Generalization
- II Linear models for grouped data
- 6 Multifactor design
- 6.1 Formulation
- 6.2 Estimation
- 7 Normal distributions
- 7.1 Estimation of distributions
- 7.2 Equality of variances
- 7.3 Multifactor model
- 7.4 Application: Single-factor model
- 7.4.1 Model 1: Unequal variances
- 7.4.2 Model 2: Equal variances
- 7.4.3 Model 3: Ordinal factor
- 7.4.4 Model 4: Regression model
- 8 Log-logistic distributions
- 8.1 Estimation of distributions
- 8.2 Multifactor model
- 8.3 Application: Two-factor model
- 8.3.1 Model 1: Saturated model
- 8.3.2 Model 2: No interaction model
- 8.3.3 Model 3: Regression model with no interaction
- 8.3.4 Model 4: Regression model with interaction
- III Bivariate normal distribution
- 9 Bivariate grouped data
- 9.1 Formulation
- 9.2 Estimation
- 10 The bivariate normal distribution
- 10.1 Joint distribution
- 10.2 Marginal distributions
- 10.3 Standard bivariate normal distribution
- 10.4 Conditional distributions
- 10.5 Bivariate normal probabilities
- 10.5.1 Calculation of bivariate normal probabilities
- 10.5.2 Calculation of ρ
- 11 Estimating the bivariate normal distribution
- 11.1 Bivariate normal probabilities
- 11.2 Parameters
- 11.2.1 Marginal distribution of x
- 11.2.2 Marginal distribution of y
- 11.2.3 Joint distribution of x and y
- 11.3 Vector of constraints
- 11.3.1 Marginal distribution of x
- 11.3.2 Marginal distribution of y
- 11.3.3 Joint distribution of x and y
- 11.4 Matrix of Partial Derivatives
- 11.4.1 Marginal distribution of x
- 11.4.2 Marginal distribution of y
- 11.4.3 Joint distribution of x and y
- 11.5 Iterative procedure
- 11.6 ML estimates
- 11.6.1 ML estimates of the natural parameters
- 11.6.2 ML estimates of the original parameters
- 11.7 Goodness of fit
- 12 Application
- 12.1 ML estimation procedure
- 12.1.1 Unrestricted estimates
- 12.1.2 ML estimates
- 13 Simulation study
- 13.1 Theoretical distribution
- 14 Résumé
- V Appendix
- A SAS programs: Part I
- A.1 EXP1.SAS
- A.2 EXP2.SAS
- A.3 EXPSIM.SAS
- A.4 NORM1.SAS
- A.5 NORM2.SAS
- A.6 NORMSIM.SAS
- A.7 FIT.SAS