Nonparametric estimation in conditional moment restricted models via Generalized Empirical Likelihood

Get Complete Project Material File(s) Now! »

Relaxing the i.i.d assumption matters in econometrics

There are many natural reasons to go beyond the i.i.d assumption. Perhaps the most pervasive one is time: when time plays a role, which is the case in time series or panel data, data is allowed to be dependent over time and to have a time-varying distribution. As a result, observations are neither independent nor identically distributed anymore. We do not discuss the issue of time in statistical modelling any longer since time is never a core element of the models we study in this dissertation.
Even in the context of cross-sectional data (data that is not indexed by time), i.i.d-ness is often deemed implausible by applied econometricians. Let us take a simple example: we observe a sample of n workers and we have information on their commuter zone and industry. It is quite standard to allow for unobserved aggregate economic shocks at the geographical area and industry levels ([1, 27, 110]). The goal is to build CSs that are robust to the presence of such shocks. CSs are called robust if they have (asymptotic) coverage at the desired level should the data be i.i.d or not. The i.i.d assumption is also not very credible with interaction data, that is data that stems from the interactions of the individuals of one population among themselves. In this setting, datasets have the form (Wi;j)1 i6=j n where Wi;j is an observation relative to the pair formed by individuals i and j. Those notions of cross-sectional dependence exist in other statistical fields such as spatial statistics or network analysis. In those fields however, dependence tends to be the main topic of interest, i.e a model on the dependence structure is formed and the goal is to recover the parameters of the former. In econometrics (or part of it at least), the aim is quite different: dependence is mainly seen as a nuisance term that has to be accounted for to conduct valid inference on some other quantity. Cross-sectional dependence is at the heart of Chapter 4.
In the preceding paragraph we do not relax the assumption that observations are identically distributed. We never give up on that assumption in that dissertation and we view it as quite fundamental (except in the case of data that exhibit a time dimension): as a matter of fact, it seems farily natural to assume that two individuals from the same sample – no matter how different they may be in terms of education and wage for instance – are simply two distinct draws from the same distribution. Some researchers have a different view on the matter: they take the observed explanatory variables (Ze;i)ni=1 as fixed and nonrandom which leads to a non identically distributed sample (see Chapter 2.8 in [136]).

Causality and machine learning

Causality is one of the pillars of the econometric discipline. This notion became popular in econometrics following an article by Donal Rubin ([124]). It relies on a thought experiment: there exist two states of the nature (labelled 0 and 1) and each individual is placed in one of the two. Individuals are given an outcome variable Zo(0) or Zo(1) depending on which state they are in. At the individual level, the causal impact of changing states simply is the difference Zo(1) Zo(0): Why is causality interesting in econometrics? It is a convenient framework to model the impact of a public policy at the aggregate level. If the government could observe Zo(1) Zo(0) for everybody, this government could measure the consequence of making people switch states according to some predefined criterion. In this context, enforcing a public policy is equivalent to making individuals switch states.
In reality, the government observes either Zo(1) or Zo(0) but never both: the causal framework is an example of a missing data statistical problem ([121]). Denoting D the state individuals are in, the government only observes Zo = DZo(1) + (1 D)Zo(0): Without further restrictions, it is only possible
Chapter 2. Introduction in English 32
to recover QZo(1)jD=1 and QZo(0)jD=0: Imposing further (Zo(1); Zo(0)) ? D ensures that QZo(1)jD=1 = QZo(1) and QZo(0)jD=0 = QZo(0): We refer to [81] for a thorough presentation of the identification question in Rubin’s causal framework. Identifying QZo(0) and QZo(1) allows to compute the average change associated with the treatment D: EQZo(0);Zo(1) [Zo(1) Zo(0)]; or the change in the -th quantile: qQZo(1) ( )
qQZo(0) ( ): On the other hand, it does not allow to recover the -th quantile of the treatment effect qQZo(1) Zo(0) ( ). To get qQZo(1) ( ) qQZo(0) ( ) = qQZo(1) Zo(0) ( ); one has to assume that the rank of an individual under QZo(0) is the same under QZo(1) (rank invariance property, cf [65]).
In the remaining of this paragraph, we focus on the parameter EQZo(0);Zo(1) [Zo(1) Zo(0)] which we denote by : One drawback of the assumption (Zo(1); Zo(0)) ? D is its non-testability. It is often replaced with (Zo(1); Zo(0)) ? D j Ze which is not testable either but strictly weaker. Under this last assumption, one can show ([81]) = EQZe [E[Zo j D = 1; Ze] E[Zo j D = 0; Ze]] : The right-hand side depends only on observable variables. The two tasks researchers are mainly interested in are i) estimation of and inference on , ii) testing for heterogeneity of the treatment effect for different
individual profiles ze. This second goal consists in testing whether EQ(Zo(0);Zo(1))jZe [Zo(1) Zo(0) j Ze = z1] = EQ(Zo(0);Zo(1))jZe [Zo(1) Zo(0) j Ze = z2] when z1 6= z2: In both cases, a first step consists in estimating the functions E[Zo j D = 1; Ze = ] and E[Zo j D = 0; Ze = ] (only evaluated at points z1 and z2 in the second case). How to estimate those functions in a flexible fashion? One possibility is to use classical nonparametric tools such as Nadaraya-Watson or local linear regressions ([135]). The theoretical guarantees of these methods have been long established ([59, 70]). Their main limitation is their poor performance in practice when the dimension of Ze is large. On the other hand, machine learning techniques such as random forests or deep neural networks perform well on simulations even when the dimension of Ze is large but their theoretical properties are much less known. Recent efforts both from the econometrics and statistical learning communities have led to theoretical advances on machine learning algorithms: Theorem 3 in [71] shows the asymptotic normality of an estimator of based on a deep neural network architecture, [138] prove the asymptotic normality of a random forest method to estimate EQ(Zo(0);Zo(1))jZe [Zo(1) Zo(0) j Ze = ze] for a fixed ze: Quite interestingly, the theoretical properties are not very different from those of more classical nonparametric tools: deep neural networks have been shown to work for exactly the same functions as more classical nonparametric tools and suffer from the same curse of dimensionality in the Ze vector; random forests can approximate functions that are less smooth than standard methods but are still subject to the curse of dimensionality.

Summary of Chapter 3

In this chapter, we focus on the generic problem (2.6). As was explained before, many research articles (actually most) interested in this problem build an estimator based on (2.7) ([3], [112], [20], and [37] to name a few). There are actually other possibilities to construct an estimator for this class of problems and we look at the family of Generalized Empirical Likelihood (GEL) estimators ([113], [99]). To present GEL estimators, it is easier to start with a simplified version of (2.6): we assume that h is replaced with a finite-dimensional parameter 2 B and the true value is such that EQZ [ (Z; )] = 0 () = : As explained in [113, 99], can equivalently be identified by = argmin sup EQZ [ ( 0 (Z; ))] ; (2.8) 2B 2 ( ;QZ) where ( ; QZ ) := z2supp(QZ ) f : ( 0 (z; )) existsg and 2 u+1 2 +1 : u 7! ( +1) 2 ( +1) : Taking the sample analogue of the previous saddle point problem yields one estimator for each function T
: We thus have a family of estimators called the GEL family. The most popular estimators in this class are: the Empirical Likelihood (EL) estimator which was popularized by [117], the Exponential Tilting (ET)
estimator of [100] and the Continuously Updating Estimator (CUE) of [88]. The previous ideas extend to problems of the form (2.6). [93] shows that (2.6) can be reformulated in the form of (2.8) with a number of moment equalities that diverges with n: h is the unique parameter value To construct GEL estimators, the previously cited articles take the sample analogue of (2.9) or (2.10) and use some nonparametric estimator to approximate EQZjX [ j X = ] : Very few contributions that allow h to be infinite-dimensional exist. The main ones are [116] and [40].
The class of functions H is always chosen as a subset of a metric space endowed with a norm k kH : This metric space is usually taken equal to the space of square-integrable functions with respect to the Lebesgue measure or the space of uniformly bounded functions with respect to the same measure. In [116], the author focuses on models where is a smooth function of h and studies the behaviour of the EL estimator based on (2.10). He uses a Nadaraya-Watson method to estimate EQZjX [ j X = ] : His main results are the consistency of his estimator in k kH -norm and the asymptotic normality of a specific functional of his estimator. The main limitation of this article is that unnecessary restrictions are placed on the parameter space H to avoid the use of regularization. In [40], the authors study the behaviour of the whole family of GEL estimators based on (2.9) in one specific model, namely the Nonparametric Quantile Instrumental Variables (NPQIV) ([41]). The NPQIV does not verify the smoothness property on the function and is not covered by [116]. Larger classes of functions H are considered than in [116] thanks to a regularization term that is added to the estimation procedure. Their main results are the consistency with rate in k kH -norm and the asymptotic normality of a large class of functionals of the estimator.
In Chapter 3, we study the properties of the whole GEL family of estimators for a class of functions, and therefore of models, that encompasses both those studied in [116] and the NPQIV. Similar to [40], we consider a regularized estimation procedure and consider larger classes H than those in [116]. One specificity of our approach is that we rely on a slightly modified version of (2.10) to build our estimation method. As explained earlier, the use of regularization does not mean that H can be taken arbitrarily large: we assume that H is a subset of the space of square-integrable functions with respect to the Lebesgue measure that contains functions that are differentiable up to a certain order with all partial derivatives square-integrable. Let us denote k kL2(leb) the norm on the space of square-integrable functions with respect to the Lebesgue measure. In our work, we prove the consistency without rate of our estimators in k kL2(leb)-norm and we derive an upper bound on the rate at which EQX kEQZjX [ (Z; hn)] k2 converges to 0. We prove a generic slow rate that requires weak moment assumptions and we show that the rate can be improved under more stringent moment conditions. We also discuss how those results could be used to derive consistency with rate of our estimators in k kL2(leb)-norm. As we recall in Chapter 3, to obtain the latter the key is to control the ratio kh h kL2(leb)=EQX kEQZjX [ (Z; h)] k2 uniformly over h in a suitable neighbourhood of h : This ratio measures the discrepancy between a norm in the numerator and another quantity in the denominator that can be seen loosely speaking as a weaker norm. The supremum of the ratio is sometimes called the degree of ill-posedness of the model ([37]). A large
Chapter 2. Introduction in English 34
body of work has investigated and is still actively looking for general sufficient conditions to control the degree of ill-posedness (see [39, 33] for extensive reviews). As we explain at the end of Chapter 3, we believe there is still room to find more transparent conditions to control the degree of ill-posedness. This is definitely an avenue for future research that has implications beyond the models we consider in this chapter. Other relevant extensions of our results are: i) to derive the asymptotic normality for the same class of functionals as in [40]; ii) in a more statistics-oriented way, build oracle inequalities on the estimation performance of our estimator.

Summary of Chapter 4

Even with cross-sectional data, the i.i.d assumption can be too restrictive. In applied econometrics, it is often plausible that the data is affected by several sources of aggregate shocks: suppose you observe several economic variables at the industry-area level. The data can be written (Zi1;i2 )1 i1 n1;1 i2 n2 ; where n1 (resp. n2) is the number of industries (resp. areas). Observations correspond to industry-area cells and they are likely to be correlated whenever they share the same industry or area because of shocks at the industry or area level. One usually says that the data is clustered at the industry and area levels. This is an instance of multiway clustering. Polyadic data are another data type that naturally exhibit dependence: polyadic data stem from the interactions of several individuals from the same population together. Data on interactions between pairs of individuals are called dyadic for instance and are the most common. Dyadic data can be written (Zi1;i2 )1 i16=i2 n: Intuitively, polyadic data should exhibit more dependence than multiway-clustered data: in the first case, observations are dependent because of shocks that stem from a unique population while in the second case, shocks come from two distinct sources. To capture these ideas, we impose the data be jointly exchangeable in the polyadic case and separately exchangeable under multiway clustering. The two notions of exchangeability are presented in great detail in [96]. Those assumptions are powerful as they allow us to use deep and very useful probabilistic results ([89, 4, 95]) that ensure the data can be represented in terms of a series of independent shocks in the different dimensions. While separate exchangeability is a subcase of joint exchangeability, we still have to handle multiway clustering on its own: the unbalanced number of clusters in each dimension makes the problem more complicated. Quite importantly, exchangeability implies that observations remain identically distributed: the dependence we introduce is therefore very different from times series dependence.
When exchangeability is assumed instead of the i.i.d assumption, the construction of estimators is not affected. However, one has to show that estimators are still consistent and asymptotically normal. Existing results are mostly concerned with sample means and linear regression models: in the joint exchangeable case, asymptotic normality for sample means can be traced back to [66] and asymptotic normality for t-statistic in linear regression models is studied in [131]; under multiway clustering, [109] studies the limit in distribution of sample means when the number of relevant clustering dimensions is unknown and he shows the consistency of a bootstrap procedure (we define what a bootstrap procedure is in a few lines). A number of articles also propose estimators of the asymptotic variance for a large class of models without proving their consistency ([69, 30]). When one is interested in more models beyond the linear regression case, theoretical results for sample means are in general not enough. In the i.i.d case, a powerful generic approach consists in controlling the asymptotic behaviour of the empirical process associated to the model (see [137] for more details). We extend well-known results on empirical processes in the i.i.d case to multiway-clustered and polyadic data. To extend results, the definition of an empirical process has to be modified. As an example, under two-way clustering, the empirical process associated with the class of functions F is the random map Gn1;n2 : f 2 F 7!

Table of contents :

Remerciements
1 Introduction/ Résumé substantiel en français
2 Introduction in English
3 Nonparametric estimation in conditional moment restricted models via Generalized Empirical Likelihood
3.1 Introduction
3.2 A general presentation of GEL estimators
3.2.1 From GMCs to GELs
3.2.2 Construction of the estimation procedure
3.3 Results
3.3.1 Consistency
3.3.2 Rate
3.4 Conclusion
3.5 Proofs of the main results
3.5.1 Proof of Theorem 3.1
3.5.2 Proof of Theorem 3.2
3.6 Appendix
3.6.1 Lemmas
3.6.2 Proofs
3.6.2.1 Proof of Lemma 3.2
3.6.2.2 Proof of Lemma 3.3
3.6.2.3 Proof of Lemma 3.4
3.6.2.4 Proof of Lemma 3.5
3.6.2.5 Proof of Lemma 3.6
3.6.2.6 Proof of Lemma 3.7
3.6.2.7 Proof of Lemma 3.8
3.6.2.8 Proof of Lemma 3.9
3.6.2.9 Proof of Lemma 3.10
3.6.2.10 Proof of Lemma 3.11
3.6.2.11 Proof of Lemma 3.12
3.6.2.12 Proof of Lemma 3.13
3.6.2.13 Proof of Lemma 3.14
4 Empirical Process Results for Exchangeable Arrays
4.1 Introduction
4.2 The set up and main results
4.2.1 Set up
4.2.2 Uniform laws of large numbers and central limit theorems
4.2.3 Convergence of the bootstrap process
4.2.4 Application to nonlinear estimators
4.3 Extensions
4.3.1 Heterogeneous number of observations
4.3.2 Separately exchangeable arrays
4.4 Simulations and real data example
4.4.1 Monte Carlo simulations
4.4.2 Application to international trade data
4.5 Conclusion
4.6 Appendix A
4.6.1 Proof of Lemma 2.2
4.6.1.1 A decoupling inequality
4.7 Appendix B
4.7.1 Proofs of the main results
4.7.1.1 Lemma 4.3
4.7.1.2 Theorem 4.1
4.7.1.2.1 Uniform law of large numbers
4.7.1.2.2 Uniform central limit theorem
4.7.1.3 Theorem 4.2
4.7.1.4 Theorem 4.3
4.7.1.5 Theorem 4.4
4.7.2 Proofs of the extensions
4.7.2.1 Theorem 4.5
4.7.2.1.1 Uniform law of large numbers
4.7.2.1.2 Uniform central limit theorem
4.7.2.2 Convergence of the bootstrap process
4.7.2.3 Theorem 4.6
4.7.2.3.1 Uniform law of large numbers
4.7.2.3.2 Uniform central limit theorem
4.7.2.3.3 Convergence of the bootstrap process
4.7.3 Technical lemmas
4.7.3.1 Results related to the symmetrisation lemma
4.7.3.1.1 Proof of Lemma S4.4
4.7.3.1.2 Proof of Lemma S4.5
4.7.3.2 Results related to laws of large numbers
4.7.3.2.1 Proof of Lemma S4.6
4.7.3.2.2 Proof of Lemma S4.7
4.7.3.2.3 Proof of Lemma S4.8
4.7.3.2.4 Proof of Lemma S4.9
4.7.3.3 Covering and entropic integrals
4.7.3.3.1 Proof of Lemma S4.10
4.7.3.3.2 Proof of Lemma S4.11
5 On the construction of confidence intervals for ratios of expectations
5.1 Introduction
5.2 Our framework
5.3 Limitations of the delta method: when are asymptotic confidence intervals valid?
5.3.1 Asymptotic approximation takes time to hold
5.3.2 Asymptotic results may not hold in the sequence-of-model framework
5.3.3 Extension of the delta method for ratios of expectations in the sequence-of-model framework
5.3.4 Validity of the nonparametric bootstrap for sequences of models
5.4 Construction of nonasymptotic confidence intervals for ratios of expectations
5.4.1 An easy case: the support of the denominator is well-separated from 0
5.4.2 General case: no assumption on the support of the denominator
5.5 Nonasymptotic CIs: impossibility results and practical guidelines
5.5.1 An upper bound on testable confidence levels
5.5.2 A lower bound on the length of nonasymptotic confidence intervals
5.5.3 Practical methods and plug-in estimators
5.6 Numerical applications
5.6.1 Simulations
5.6.2 Application to real data
5.7 Conclusion
5.8 General definitions about confidence intervals
5.9 Proofs of the results in Sections 5.3, 5.4 and 5.5
5.9.1 Proof of Theorem 5.1
5.9.2 Proof of Theorem 5.2
5.9.2.1 Proof of Lemma 5.4
5.9.2.2 Proof of Lemma 5.5
5.9.3 Proof of Example 5.3
5.9.4 Proof of Theorem 5.3
5.9.5 Proof of Theorem 5.4
5.9.5.1 Proof of Lemma 5.6
5.9.6 Proof of Theorem 5.5
5.9.6.1 Proof of Lemma 5.7
5.9.7 Proof of Theorem 5.6
5.9.7.1 Proof of Lemma 5.8
5.10 Adapted results for “Hoeffding” framework
5.10.1 Concentration inequality in an easy case: the support of the denominator is well-separated from 0
5.10.2 Concentration inequality in the general case
5.10.3 An upper bound on testable confidence levels
5.10.4 Proof of Theorems 5.8 and 5.9
5.10.5 Proof of Theorem 5.10
5.10.5.1 Proof of Lemma 5.9
5.11 Additional simulations
5.11.1 Gaussian distributions
5.11.2 Student distributions
5.11.3 Exponential distributions
5.11.4 Pareto distributions
5.11.5 Bernoulli distributions
5.11.6 Poisson distributions
5.11.7 Delta method and nonparametric percentile bootstrap confidence intervals
6 Fuzzy Differences-in-Differences with Stata
6.1 Introduction
6.2 Set-up
6.2.1 Parameters of interest, assumptions, and estimands
6.2.2 Estimators
6.3 Extensions
6.3.1 Including covariates
6.3.2 Multiple periods and groups
6.3.3 Other extensions
6.3.3.1 Special cases
6.3.3.2 No “stable” control group
6.3.3.3 Non-binary treatment
6.4 The fuzzydid command
6.4.1 Syntax
6.4.2 Description
6.4.3 Options
6.4.4 Saved results
6.5 Example
6.6 Monte Carlo Simulations
6.7 Conclusion