Construction and validation and of a synthetic panel in Mexico: 2002- 2005pseudo-panel equations (8) and (9)
The procedure detailed above was applied forwardly to estimate the synthetic income in 2005 for the households sampled in 2002. Synthetic panel can be obtained either at household or individual level, although the former bring about access to a larger set of time-invariant attributes. We focused on households as observational units, as these tend to offer a wider perspective of family wellbeing and gave access to a larger set of time-invariant attributes. The results in terms of income mobility were then compared to the actual income mobility observed in the panel.
The data corresponds to the Mexican Family Life Survey (MxFLS onwards). It is based on a sample of households that is representative at national, regional and urban-rural level. It was fielded by the National Institute of Statistics (INEGI by its acronym in Spanish) but was coded and critically assessed by its study directors. The MxFLS is a multi-thematic and longitudinal database, which gathers information on socioeconomic indicators, migration, demographics and health indicators on the Mexican population. This panel survey is expected to track the Mexican population throughout a period of at least ten years. 17
The first and second waves, conducted in 2002 and 2005 respectively, rely on a baseline sample size of 8,400 households and collected data on the socio-demographic characteristics of each household member, individual occupation and earnings, household income and expenditures, and assets ownership. The sample in 2005 was expanded to compensate for attrition, which amounted to 10% of the original sample in the second wave. Due confidentiality data on the simple design (primary/secondary sampling units) are not public (see MXFLS website).
Household income data follow the official definition for computing income poverty in Mexico. They include both monetary and non-monetary resources. The former comprise receipts from employment, own businesses, rents from assets and public and private transfers. Non-monetary income includes in-kind gifts received and the value of services provided within the household, such as the rental value of owner occupied dwelling or self-consumption.18 Total income is then divided by the household size in order to obtain per capita income and is deflated by the Consumer Price Index (August 2005=100) to make 2002 and 2005 data comparable.
In order to focus on the steadiest set of households and to facilitate the use pseudo-panel instruments, the sample was restricted to households whose head was aged between 25 and 62 years in 2002 which is the baseline (28-65 years old in 2005). Finally, to overcome possible adverse effects due to atypical observations two percent of the sample in the two ends of the income distribution and households with missing income were discarded.
Time invariant attributes and the income models
This section describes the income models being estimated in the initial and terminal waves. It first introduces the criteria for the selection of time-invariant attributes. Estimation of the returns to these attributes in 2002 and 2005 are discussed in a second stage.
Time invariant attributes
Time-invariant attributes could stem from multiple criteria and sources. Individual deterministic attributes like the year of birth, sex, educational achievement and ethnicity are the most natural set of characteristics. Depending on the issue of interest, the time horizon and country studied other variables can be obtained from the household characteristics like the household size which could be introduced in terms of its demographic composition. Consider also the location, the population density in the area of residence (urban or rural localities), and the state or regional fixed effects depending on the territorial representativeness of the survey. Needless to say, all variables ought to strictly follow the same definition and construction in all periods.
It is reasonably questioning how realistic is the assumption of ‘time invariance’ of these attributes. In this respect, it helps to bear in mind that the longer the period between the cross-sections, the more severe ought to be the time invariability criterion. The long-standing feature of these attributes is perhaps more important than the number of variables when conceiving the specification of the income model. Many variables are not strictly time-invariant and should easily be discarded like current employment status and occupation but this has to be considered on the particular case of the country under analysis. Other variables could be considered time-invariant under reasonable circumstances, like marital status and highly-valuable wealth possessions (dwelling or physical assets) during periods of economic stability.
Estimating the income model and residuals
We followed a grading approach by the use of alternative model specifications to assess the sensibility of variables selection. All of them stuck to a strict degree of time invariability. The first specification uses the head’s individual characteristics like gender, formal years of schooling, birth year and the household composition by age groups. It also includes variables for the size of the locality (urban/rural), marital status and regions. 19 An alternative specification includes long-lasting productive assets such as real estate and farming assets (land for to agricultural production and cattle), and household dwelling as well as the possession of other dwellings other than the one in use. This is our most preferred model. See Annex 1 for descriptive statistics and OLS estimates.
It is important to mention some restrictions encountered to enrich the income model. The survey collected data on ethnicity, religious conviction and household head literacy. Also contains data on historic or retrospective data like birth city size; the year of marriage; household’s head’s parents’ education, place of birth and migration records. Those attributes, like many others, were gathered by the survey but finally not included in the income model due to: 1) high prevalence of missing data, 2) lack of statistical significance, or 3) extremely low frequency.
Although the proposed method does not assume normality for the residuals, neither for the initial nor for the final year, we tested this assumption for these models. For illustrative purposes the Graph 1 shows the kernel distribution of (log) income residuals for the last model specification, in both years, and compares it with the normal distribution. These and other tests (Skewness and Kurtosis tests and Shapiro-Wilk normality tests -see note on the referred graph) confirm that the normality assumption in the distribution of residuals is strongly rejected.
The autocorrelation coefficient and the calibration parameters
Estimating the autocorrelation coefficient is a central, and a sensitive task in this procedure. This section aims at obtaining an autocorrelation coefficient with the two waves of cross-sectional data at hand. Firstly, observations were grouped by some common characteristics. In our case, thirty-five clusters were obtained by the interaction of seven birth-year cohorts, of 6 years interval each, and five groups of education: incomplete primary education, complete primary but incomplete secondary education, complete secondary education but incomplete high school and complete high school or more.20
Table 1 shows separate estimates from equations 8 and 9 using. Results have the expected signs and order of magnitude. The genuine coefficient here served as a benchmark and appears to be around 0.25. Though the genuine parameter is close to these set of estimates it is reassuring that the combined use of these two approaches, through a non-linear equation system, delivers a more accurate estimate whose confidence intervals are fully consistent with those from the actual panel.21
The rho estimate and its corresponding�� 95% confidence� interval� enabled us to determine the set of calibration parameters, (JU| ) = W , (>=, X=, =), (>�, X�, �)Y, from the empirical basis of two normal variables. We followed two regimes. The first regime employs the point estimate of rho. The second is based on 100 different rhos, with the corresponding set of calibration parameters for each. In this case, rho is randomly obtained from a normal distribution within its 95% confidence interval. This means that the mean of this random drawings corresponds to the point estimate, but some of them might deviate. Table 2 shows the descriptive statistics of the resulting parameters for each model. These parameters characterize the distribution of the innovation terms to compute the expected value of mobility measures and their distribution as described next.
This section provides empirical estimates from a household level synthetic panel over a period characterized by positive economic growth in Mexico.22 We first examined the shape of a synthetic distribution for 2005 compared with the genuine income distribution. Graph 2 shows the kernel density of one hundred simulations to provide a first visual element to assess the shape of the distribution at every income level. This preliminary inspection shows that even a basic model specification is capable of reproducing the shape of the actual income distribution.
Table 3 shows the resulting synthetic panel 2002-2005 through a transition matrix defined on the income brackets from the quintiles of 2002. This means that the marginal income distribution in the base year is the same in the genuine and the synthetic panels by construction. Each line shows the movement of individuals that belonged to a specific income quantile in the baseline over the same, real income, references in the final year. The table contains three sections. Section 1 and 3 correspond to two different regimes: regime 1 computes the virtual income using the point estimate of rho and 500 repetitions, whereas regime 2 performs it with 100 random draws of rho and the corresponding calibration parameters for each. Section 2 contains the genuine estimates. In most cases, the synthetic figures appear close to the genuine ones and fall within their 95% confidence intervals (reported in parentheses). As expected, working with various values of rho, i.e. regime 2, deliver slightly larger confidence intervals. In general, both the genuine and the synthetic panels suggest a process of upward mobility implied by a reduction in the share of households below the income limits of the first quintile from 2002 to 2005.
We also used the Mann-Whitney test to assess the synthetic rank distribution of 2005 conditioned on its rank at the origin. The test delivers a statistic based on the difference between the sum of the ranks of both distributions: the genuine and the synthetic one. To increase the sensitivity of the test we use twenty equally sized groups. 23 Results in Table 4 shows that our synthetic estimates satisfactorily reproduce the dynamic described by the genuine panel in almost all points of the distribution, the exception being the ventile at the bottom of the 2002 income distribution. At least 90% of samples passed this test in most of remaining groups.
Poverty dynamics is the most popular empirical application of this type of procedures. To illustrate the performance of this approach on this issue we computed two sets of poverty transitions, in-and-out of poverty, using the upper limits from the first two income quintiles as poverty thresholds. These thresholds constitute a direct reference to the ‘shared prosperity’ goal adopted by the World Bank recently. Table 5 shows that the proposed approach delivers an encouraging approximation to actual figures in all poverty transitions. For instance, our estimate for persistent poverty for=0.20 (and 0.30), being poor in both periods, using the first poverty line is 6.5% (7.3% respectively) whereas the actual figure is 6.6%. The largest difference is found in the downward mobility group. Larger values of rho, also in Table 5, illustrate the sensitivity of this type of methodologies to this parameter. Interestingly substantial differences emerge when using a correlation coefficient that separates from the actual parameter.24 Note that this occurs on top of the calibration procedure here implemented. These results reinforce the utterly importance of this parameter with this and similar methodological approximations.
Table of contents :
Introduction in English
From snapshots to a motion picture: On synthetic income panels
Income mobility over a generation: a ‘long run’ motion picture
Mobility across three generations
Résumé Substantiel en Français
Chapter 1: On synthetic income panels
2. The construction of a synthetic panel
3. Construction and validation and of a synthetic panel in Mexico: 2002-2005
4. The autocorrelation coefficient and the calibration parameters
5. Estimation results
6. Concluding remarks
Appendix 1. Algorithm to calibrate the distribution of the innovation terms
Chapter 2: Three decades of income mobility with synthetic panels: empirical evidence from Mexico
2. Analytical framework
3. Trends of income mobility in Mexico
4. Concluding remarks
Chapter 3: Intergenerational transmission of education across three generations
2. Data and sample
3. The general approach to the intergenerational transmission of education.
4. The effect of grandfather’s education (G0) on their offspring education (G1 and G2)
5. The effect of parents’ education on their offspring, G1-G2
A. Héctor Moreno M | PhD dissertation | Université Paris 1 – Sorbonne | Paris School of Economics
6. The ‘conditioned’ effect of grandparents’ education on their grandchildren, G0-G2 106
7. Discussion. The direct Vs the extrapolated effect on long run mobility (G02)