Get Complete Project Material File(s) Now! »

## Measuring segregation on small units

** Introduction**

We consider a population made of two groups (minority and majority) whose individuals are spread across units. Units can be geographical areas, residential neighborhoods, firms, school classes, or other clusters, provided that every individual belongs to exactly one unit. We seek to measure the extent to which individuals from the minority group are concentrated in some units more than in others. Throughout the paper, we follow the literature and use the word “segregation” as a neutral term to refer to such concentration. Measuring the magnitude of segregation is a necessary step to understand the underlying mechanisms and design adequate policies.

A natural way to measure segregation is to start from the minority shares Xi/Ki, where Xi is the number of individuals from the minority group and Ki the number of individuals (or unit’s size) in unit i ∈ {1, …, n}, and then compute an inequality index based on the distribution of the proportions Xi/Ki across the n units.

There are two possible benchmarks to assess the magnitude of these indices. Evenness relates to the case where all minority shares Xi/Ki are equal across units. Randomness relates to the case where the underlying allocation process assigns minority individuals at random across units. If pi is the probability that an arbitrary individual in unit i belongs to the minority, randomness means that the probabilities pi are equal across units. Past research has stressed the diﬀerence between both benchmarks, especially when the units are of small size (Cortese et al., 1976). The minority share Xi/Ki is only an estimate of pi, and even if p1, . . . , pn are all equal, there will be some variation in the Xi/Ki, all the more so as the units’ sizes Ki are small. If one is interested in the deviations from the randomness case, indices based on minority shares, which measure the deviation from evenness, will overestimate the level of segregation. This issue is known as the small-unit bias.

The problem is pervasive in applied research. For workplace and school segregation, a large share of firms have less than ten employees, and classes usually have between twenty and forty students. The bias also arises when the units are not small per se, but only surveys of individuals are available. This is the case when one attempts to measure residential segregation using the local strata of households surveys.

Two main approaches have been proposed in the literature to deal with the small-unit bias. One strand proposes to correct the so-called naive inequality indices based on the minority shares Xi/Ki. The idea was initially proposed by Cortese et al. (1976) and Winship (1977) for the Duncan index. Carrington and Troske (1997, CT hereafter) extend the correction to other indices. Åslund and Skans (2009) adapt it to measure segregation conditional on covariates. Allen et al. (2015) develop another adjustment based on bootstrap. These corrections all aim to switch the benchmark from evenness to randomness by subtracting an estimate of the bias from the initial, naive index. Another approach, adopted by Rathelot (2012, R hereafter) and D’Haultfœuille and Rathelot (2017, HR hereafter), defines segregation using an inequality index based on the unobserved probabilities pi, as a functional of the distribution Fp of pi. In line with the rest of the literature, they assume that the Xi are independent and follow a Bin(Ki, pi) distribution. Conditional on Ki and pi, R assumes a mixture of Beta distributions for Fp and derives the segregation index as a function of the distribution parameters. HR follow a nonparametric method leaving Fp unspecified; they show that the first moments of Fp are identified under the previous binomial assumption and obtain partial identification results on the segregation measure. Both R and HR construct confidence intervals for the segregation indices. HR also extend the methodology to study conditional segregation indices, namely measures of “net” or “residual” segregation taking into account other covariates (either of units or individuals) that may influence the allocation process.

The Stata command segregsmall allows social researchers to measure segregation in the context of small units. The command implements the methods proposed by R, HR, and CT. Conditional indices are available for all three methods. With R and HR, the command computes confidence intervals obtained by bootstrap. Finally, the command also implements a test of the binomial assumption.

This paper describes the command and presents the three methods it implements. Section 1.2 defines the set-up, the parameters of interest and synthesizes the estimation and inference methods of R, HR, and CT. Section 1.3 details the syntax, options, stored results of the segregsmall command, and discusses its execution time. Section 1.4 presents an application of the command on French firm data to measure workplace segregation between foreigners and natives across workplaces. Section 1.5 concludes.

### Set-up, estimation, and inference

**The setting and the parameters of interest**

The population studied is assumed to be split into two groups: a group of interest, henceforth the minority group, and the rest of the population.1 Individuals are distributed across units. For each unit, we assume that there exists a random variable p that represents the probability for any individual belonging to this unit to be a member of the minority. The total number of individuals in a unit is denoted by K.

We now introduce the segregation indices we focus on hereafter. We consider first unconditional indices; conditional indices are introduced in Section 1.2.6. Let us first assume that K is fixed. A segregation index θ is then a functional of the cumulative distribution function (c.d.f) Fp of p and of m01 = E(p), that is θ = g(Fp, m01).2 Roughly speaking, one expects such an index to be minimal when Fp is degenerate (Dirac), and maximal when p ∈ {0, 1} (Bernoulli). In the former case, the probability of belonging to the minority is the same in all units. In contrast, the minority group is concentrated in a subset of units only in the latter case.

The small-unit bias To estimate θ, we assume hereafter that the researcher has at her disposal K; however, the probability p remains unobserved. Instead, she only observes X, the number of individuals belonging to the minority in the unit. By definition of p, we have E[X | K, p] = Kp, which implies that the proportion of individuals from the minority, X/K, is an unbiased estimator of p. However, because it varies conditional on p, X/K is more dispersed than p. As a result, we have for usual segregation indices, including the five ones above, g FX/K , m01 > g(Fp, m01) = θ.

In other words, even in the absence of statistical uncertainty on the distribution of X/K, we would still overestimate the segregation index by using X/K in place of p. Moreover, this bias increases as K decreases. We refer to this issue as the small-unit bias hereafter.3

The binomial assumption We assume henceforth that individuals are allocated into units independently from each other. Namely, X is assumed to follow, conditional on p and K, a binomial distribution Bin(K, p). This hypothesis may be restrictive when the allocation process is in some way sequential and influenced by the composition of units. But importantly, this assumption is testable (see Section 1.2.5).

**Nonparametric approach**

Identification This approach, followed by HR, leaves the distribution Fp of p unre-stricted. Combined with the binomial assumption, it entails a nonparametric binomial mixture model for X. Let us first suppose that K is constant; if not, we can simply retrieve aggregated indices θu and θi using (1.1) and (1.2). We also assume that K > 1; if K = 1, the distribution of X is not informative on θ and we only get trivial bounds on it, namely 0 and 1 for the five indices above.

Remark that as soon as for one unit size k the index θk is not point identified, the resulting aggregated index will be partially identified too. In other words, point identification of θu or θi requires to be in the constrained case for each k ∈ K. This is unlikely to happen when the support of K contains very small sizes k, typically lower than 10.

Similar to the constant unit case, confidence intervals for the aggregated indices θu and θi are constructed by the modified bootstrap procedure detailed in HR. The randomness of K just involves an additional step that consists in drawing K in its empirical distribution.

Assuming independence between K and p The previous estimation and inference procedures are fully agnostic as regards possible dependence between K and p, which is a safe option when unit size may be a potential determinant of segregation. However, if one is ready to impose independence between these two variables, the identified bounds on θu = θi get closer to each other. This is because the Fpk coincide with the unconditional distribution of p. Thus, we can gather all units and identify the first K moments of Fp, with K = max(K). Estimation and inference are performed as in the case of constant unit size, with K replaced by K. Thus, assuming independence between K and p improves identification since we identify more moments of Fp. It also leads to more accurate estimators since one estimates a single vector P on the whole sample, instead of doing so on each subsample {i : Ki = k}, for all k ∈ K.

An important particular case occurs when only some individuals in the unit are observed (e.g., survey data). Imagine units are of size (Ki)i=1,…,n but that, for each unit i, only nK,i individuals are sampled and observed. We let Xi denotes the number of individuals belonging to the reference group in this subgroup of nK,i people. As previously, Xi follows a binomial distribution Bin(nK,i, pi) conditional on pi and nK,i. The previous results apply by simply replacing the unit size K with the number nK of individuals observed in each unit. Moreover, in such settings, it is usually plausible to assume that the random variable nK is independent of p conditional on the unit size as nK depends on the survey process, which, a priori, is orthogonal to the segregation phenomenon.

#### Parametric approach

This approach, followed by R, is similar to that of HR, except that it imposes a parametric restriction on Fp. Specifically, it is supposed to be a mixture of Beta distributions. Combined with the binomial assumption for the conditional distribution of X, the model becomes fully parametric and thus can be estimated by maximum likelihood. Therefore, the indices are point identified, contrary to the nonparametric approach of HR.

A concern might be that the parametric restriction leads to invalid results when the model is misspecified. However, R shows through simulations that segregation indices associated with various distributions, both continuous and discrete, are accurately proxied by his parametric approach.

Random unit size The adaptation to this case is exactly similar to HR method. For each k ∈ K, the MLE of θk is obtained using the subsample of units of size k. The weights are estimated by their empirical counterparts. The estimated aggregated indices are then obtained by plug-in, using (1.1) and (1.2). When K and p are assumed independent, all units can be pooled, independently of their size, to compute the MLE of v for the whole sample. As above, the resulting estimator vb allows us to estimate the distribution of p, and then θ.

**Correction of the naive index**

The approaches of HR and R are immune to the small-unit bias as they directly esti-mate g(Fp, m01). Other, previous approaches instead start from the naive index θN = g(FX/K , m01) and attempt to modify it so that the parameter becomes less sensitive to changes in K. We present here the correction proposed by CT, which is the most popular in applied work.

CT’s correction relies on the distinction between the randomness and evenness bench-marks, introduced notably by Cortese et al. (1976) and Winship (1977). Evenness corresponds to X/K being constant, whereas randomness refers to the case where p is constant. Under the binomial model, however, evenness cannot occur. The central idea of CT is then to convert θN , which measures departure from evenness, into a distance to randomness. To do so, CT compare θN to its expected value θNra under the random allocation of individuals into units.

**Table of contents :**

Introduction in English

Introduction en français

**1 Measuring segregation on small units **

1.1 Introduction

1.2 Set-up, estimation, and inference

1.2.1 The setting and the parameters of interest

1.2.2 Nonparametric approach

1.2.3 Parametric approach

1.2.4 Correction of the naive index

1.2.5 Test of the binomial assumption

1.2.6 Conditional segregation indices

1.3 The segregsmall command

1.3.1 Syntax

1.3.2 Description and main options

1.3.3 Additional options

1.3.4 Saved results

1.3.5 Execution time

1.4 Example

1.5 Conclusion

Appendix 1.A Magnitude of the small-unit bias

Appendix 1.B Plug-in estimators of naive indices

Appendix 1.C Indices in the parametric approach

Appendix 1.D Supplements to the example

**2 Measures of residential segregation in France **

2.1 Introduction

2.2 Data and dimensions

2.2.1 Labor Force Survey and units/neighborhoods

2.2.2 Dimensions of segregation and individual covariates

2.3 Methodology and robustness

2.3.1 Parameters of interest, estimation, and inference

2.3.2 Specification choices and robustness

2.4 Results

2.4.1 Unconditional analysis

2.4.2 Conditional analysis

Appendix 2.A Additional figures

**3 Identification and estimation of speech polarization **

3.1 Introduction

3.2 The framework

3.2.1 Statistical model

3.2.2 Parameter of interest

3.3 Main theoretical results

3.3.1 Bounds under minimal conditions

3.3.2 Point-identification and estimation through extrapolation

3.4 Extensions

3.4.1 Including covariates

3.4.2 Test of the binomial assumption

3.5 Application to speech polarization in the U.S. Congress (1873-2016)

3.5.1 The evolution of speech polarization over time

3.5.2 Dictionaries, processing operations, and small-unit bias

3.5.3 Testing our approach

3.6 Conclusion

Appendix 3.A Proofs of main theorems

3.A.1 Proof of Theorem 3.1 (definition of the index)

3.A.2 Proof of Theorem 3.2 (partial identification)

Appendix 3.B Supplements to the methodology

3.B.1 Connection with the Coworker index of segregation

3.B.2 Restriction to K > 0 without loss of generality

3.B.3 Formal results for estimation and inference

3.B.4 Proof of Proposition 3.1

3.B.5 Proof of Proposition 3.2

3.B.6 Proof of Proposition 3.3

3.B.7 Special case of independence between K and

3.B.8 Naive method and small-unit bias

Appendix 3.C Supplements to the application

3.C.1 Processing

3.C.2 Additional figures

Appendix 3.D Formal link with the model of Gentzkow et al. (2019)

3.D.1 GST’s framework

3.D.2 Connection between GST’s and our model

**4 Nonasymptotic bounds for Edgeworth expansions **

4.1 Introduction

4.2 Control of n,E under moment conditions only

4.3 Improved bounds on n,E under assumptions on the tail behavior of fSn

4.4 Conclusion and statistical applications

Appendix 4.A Proof of the main results

4.A.1 A smoothing inequality

4.A.2 Outline of the proofs of Theorems 4.1 and 4.2

4.A.3 Proof of Theorem 4.1 under Assumption 4.1

4.A.4 Proof of Theorem 4.1 under Assumption 4.2

4.A.5 Proof of Theorem 4.2 under Assumption 4.1

4.A.6 Proof of Theorem 4.2 under Assumption 4.2

Appendix 4.B Technical lemmas

4.B.1 Control of the term

4.B.2 Control of the residual term in an Edgeworth expansion under Assumption 4.1

4.B.3 Control of the residual term in an Edgeworth expansion under Assumption 4.2

4.B.4 Two bounds on incomplete Gamma-like integrals

4.B.5 Proof of Proposition 4.4

**5 Nonasymptotic confidence intervals in linear models **

5.1 Introduction

5.2 Quality measures for confidence sets

5.3 Linear regression without endogeneity

5.3.1 Model and standard asymptotic inference

5.3.2 Our confidence interval

5.3.3 Theoretical results

5.4 Practical considerations

5.5 Simulations

5.6 Extension: linear regression with endogeneity

5.6.1 Model and standard Anderson-Rubin inference

5.6.2 Our confidence set

5.6.3 Theoretical results

Appendix 5.A Proof of results in Section 5.3

5.A.1 Nonasymptotic conservativeness for the Edgeworth regime

5.A.2 Nonasymptotic conservativeness for the Exponential regime

5.A.3 Proof of Theorem 5.1

5.A.4 Proof of Theorem 5.2

Appendix 5.B Proof of results in Section 5.6

5.B.1 Nonasymptotic conservativeness for the Edgeworth and Exponential regimes

5.B.2 Proof of Theorem 5.3

5.B.3 Proof of Theorem 5.4

Appendix 5.C Additional lemmas

Appendix 5.D Definition of the remainders of the bound on Edgeworth expansion253

5.D.1 Definition of rgen

5.D.2 Definition of rcont

Appendix 5.E Additional figures