dfl_decompose: DFL reweighting decomposition

Description

dfl_decompose divides between-group differences in distributional statistics of an outcome variable into a structure effect and a composition effect. Following DiNardo, Fortin, and Lemieux (1996), the procedure reweights the sample distribution of a reference group such that the group's covariates distribution matches the covariates distribution of a comparison group.

The function derives counterfactual distributions with inverse probability weigthing. Reweighting factors are estimate by modelling the probability of belonging to the comparison group conditional on covariates.

The function allows detailed decompositions of the composition effect by sequentially reweighting (conditional) covariate distributions. Standard errors can be bootstrapped.

Usage

dfl_decompose(
  formula,
  data,
  weights,
  group,
  na.action = na.exclude,
  reference_0 = TRUE,
  subtract_1_from_0 = FALSE,
  right_to_left = TRUE,
  method = "logit",
  estimate_statistics = TRUE,
  statistics = c("quantiles", "mean", "variance", "gini", "iq_range_p90_p10",
    "iq_range_p90_p50", "iq_range_p50_p10"),
  probs = c(1:9)/10,
  custom_statistic_function = NULL,
  trimming = FALSE,
  trimming_threshold = NULL,
  return_model = TRUE,
  estimate_normalized_difference = TRUE,
  bootstrap = FALSE,
  bootstrap_iterations = 100,
  bootstrap_robust = FALSE,
  cores = 1,
  ...
)

Value

an object of class dfl_decompose containing a data.frame with the decomposition results for the quantiles and for the other distributional statistics, respectively, a data.frame with the estimated reweighting factor for every observation, a data.frame with sample quantiles of the reweighting factors and a list with standard errors for the decomposition terms, the quantiles of the reweighting factor, the bootstrapped Kolmogorov-Smirnov distribution to construct uniform confidence bands for quantiles, as well as a list with the normalized differences between the covariate means of the comparison group and the reweighted reference group.

Arguments

formula: a formula object with an outcome variable Y on the left-hand side and the covariates X on the right-hand side. For sequential decompositions, the sequence of covariates X are distinguished by the | operator. Covariates are used to estimate the conditional probabilities for the reweighting factors.
data: a data.frame containing all variables and observations of both groups.
weights: name of the observation weights variable or vector of observation weights.
group: name of a binary variable (numeric or factor) identifying the two groups for which the differences are to be decomposed. The group identified by the lower ranked value in group (i.e., 0 in the case of a dummy variable or the first level of factor variable) is defined as group 0. Per default, group 0 is the reference group (see reference_0).
na.action: a function to filter missing data (default na.exclude).
reference_0: boolean: if TRUE (default), then the group 0 -- i.e., the group identified by the lower ranked value in group -- will be defined as reference group. The reference group will be reweighted to match the covariates distribution of the sample of the comparison group.
subtract_1_from_0: boolean: By default (`FALSE`), the distributional statistic of group 0 is subtracted from the one of group 1 to compute the overall difference. Setting `subtract_1_from_0` to `TRUE` merely changes the sign of the decomposition results.
right_to_left: determines the direction of a sequential decomposition. If TRUE (default), the sequential decomposition starts right and reweights first the reference group using only the variables entered last into the formula sequence. Sequentially, the other variables are added. Otherwise, the decomposition start left and using all variables entered into formula object from the start, sequentially removing variables.
method: specifies the method to fit and predict conditional probabilities used to derive the reweighting factor. At the moment, "logit", "fastglm", and "random_forest" are available.
estimate_statistics: boolean: if TRUE (default), then distributional statistics are estimated and the decomposition is performed. If FALSE, the function only returns the fitted inverse propensity weights.
statistics: a character vector that defines the distributional statistics for which the decomposition is performed. Per default, c("quantiles", "mean", "variance", "gini", "iq_range_p90_p10", "iq_range_p90_p50", "iq_range_p50_p10") are estimated and decomposed. Also implemented are `c("iq_ratio_p90_p10", "iq_ratio_p90_p50", "iq_ratio_p50_p10")`. Note: The function calculates the Gini coefficient for the untransformed variable (i.e., exp(log(Y))), if the logarithm of a variable Y is set as outcome variable in formula).
probs: a vector of length 1 or more with the probabilities of the quantiles to be estimated with default c(1:9)/10.
custom_statistic_function: a function estimating a custom distributional statistic that will be decomposed (NULL by default). Every custom_statistic_function needs the parameters dep_var (vector of the outcome variable) and weights (vector with observation weights); additional arguments are not allowed or need to be 'hardcoded'. See examples for further details.
trimming: boolean: If TRUE, observations with dominant reweighting factor values are trimmed according to rule of Huber, Lechner, and Wunsch (2013). Per default, trimming is set to FALSE.
trimming_threshold: numeric: threshold defining the maximal accepted relative weight of the reweighting factor value (i.e., inverse probability weight) of a single observation. If NULL, the threshold is set to $sqrt(N)/N$, where $N$ is the number of observations in the reference group.
return_model: boolean: If TRUE (default), the object(s) of the model fit(s) used to predict the conditional probabilities for the reweighting factor(s) are returned.
estimate_normalized_difference: boolean: If TRUE (default), the normalized differences between the covariate means of the comparison group and the reweighted reference group are calculated.
bootstrap: boolean: If FALSE (default), then the estimation is not boostrapped and no standard errors are calculated.
bootstrap_iterations: positive integer with default 100 indicating the number of bootstrap iterations to be executed.
bootstrap_robust: boolean: if FALSE (default), then bootstrapped standard errores are estimated as the standard deviations of the bootstrapp estimates. Otherwise, the function uses the bootstrap interquartile range rescaled by the interquantile range of the standard distribution to estimate standard errors.
cores: positive integer with default 1 indicating the number of cores to use when computing bootstrap standard errors.
...: other parameters passed to the function estimating the conditional probabilities.

Details

The observed difference to be decomposed equals the difference between the values of the distributional statistic of group 1 and group 0, respectively:

$$\Delta_O = \nu_1 - \nu_0,$$

where $\nu_t = \nu(F_g)$ denotes the statistics of the outcome distribution $F_g$ of group $g$. Group 0 is identified by the lower ranked value of the group variable.

If reference_0=TRUE, then group 0 is the reference group and its observations are reweighted such that they match the covariates distribution of group 1, the comparison group. The counterfactual combines the covariates distribution $F_1(x)$ of group 1 with the conditional outcome distribution $F_0(y|x)$ of group 0 and is derived by reweighting group 0

$$F_C(y) = \int F_0(y|x) dF_1(x) = \int F_0(y|x) \Psi(x) dF_0(x),$$

where $\Psi(x)$ is the reweighting factor, i.e., the inverse probabilities of belonging to the comparison group conditional on covariates x.

The distributional statistic of the counterfactual distribution, $\nu_C = \nu(F_C)$, allows to decompose the observed difference into a (wage) structure effect ($\Delta_S = \nu_1 - \nu_C$) and a composition effect ($\Delta_C = \nu_C - \nu_0$).

If reference_0=FALSE, then the counterfactual is derived by combining the covariates distribution of group 0 with the conditional outcome distribution of group 1 and, thus, reweighting group 1

$$F_C(y) = \int F_1(y|x) dF_0(x) = \int F_1(y|x) \Psi(x) dF_1(x).$$

The composition effect becomes $\Delta_C = \nu_1 - \nu_C$ and the structure effect $\Delta_S = \nu_C - \nu_0$, respectively.

The covariates are defined in formula. The reweighting factor is estimated in the pooled sample with observations from both groups. method = "logit" uses a logit model to fit the conditional probabilities. method = "fastglm" also fits a logit model but with a faster algorithm from fastglm. method = "random_forest" uses the Ranger implementation of the random forests classifier.

The counterfactual statistics are then estimated with the observed data of the reference group and the fitted reweighting factors.

formula allows to specify interaction terms in the conditional probability models. If you are interested in an aggregate decomposition, then all covariates have to be entered at once, e.g., Y ~ X + Z.

The procedure allows for sequential decomposition of the composition effect. In this case, more than one reweighting factor based on different sets of covariates are estimated.

If you are interested in a sequential decomposition, the decomposition sequence has to be distinguished by the | operator in the formula object. For instance, Y ~ X | Z would decompose the aggregate composition effect into the contribution of covariate(s) X and the one of covariate(s) Z, respectively.

In this two-fold sequential decomposition, we have the detailed composition effects

$$\Delta_{C_X} = \nu_1 - \nu_{CX},$$ and $$\Delta_{C_Z} = \nu_{CX} - \nu_C,$$

which sum up to the aggregate composition effect $\Delta_C$. $\nu_C$ is defined as above. It captures the contribution of all covariates (i.e., X and Z). In contrast, $\nu_{CX}$ corresponds to the statistic of the counterfactual distribution isolating the contribution of covariate(s) X in contrast to the one of covariate(s) Z.

If right_to_left=TRUE, then the counterfactual is defined as $$F_{CX}(y) = \iint F_0(y|x,z) dF_0(x|z) dF_1(z),$$

where $F_1(x|z)$ is the conditional distribution of X given Z of group 1 and $F_0(z)$ the distribution of Z. If right_to_left=FALSE, we have $$F_{CX}(y) = \iint F_0(y|x,z) dF_1(x|z) dF_0(z).$$

Note that it is possible to specify the detailed models in every part of formula. This is useful if you want to estimate in every step a fully saturated model, e.g., Y ~ X * Z | Z. If not further specified, the variables are additively included in the model used to derived the aggregate reweighting factor.

The detailed decomposition terms are path-dependent. The results depend on the sequence the covariates enter the decomposition (e.g, Y ~ X | Z yields different detailed decomposition terms than Y ~ Z | X) . Even for the same sequence, the results differ depending on the 'direction' of the decomposition. In the example above using right_to_left=TRUE, the contribution of Z is evaluated using the conditional distribution of X given Z from group 0. If we use right_to_left=FALSE instead, the same contribution is evaluated using the conditional distribution from group 1.

Per default, the distributional statistics for which the between group differences are decomposed are quantiles, the mean, the variance, the Gini coefficient and the interquantile range between the 9th and the 1st decile, the 9th decile and the median, and the median and the first decile, respectively. The interquantile ratios between the same quantiles are implemented, as well.

The quantiles can be specified by probs that sets the corresponding probabilities of the quantiles of interest. For other distributional statistics, please use custom_statistic_function

The function bootstraps standard errors and derives a bootstrapped Kolmogorov-Smirnov distribution to construct uniform confindence bands. The Kolmogorov-Smirnov distribution is estimated as in Chen et al. (2017).

References

Chen, Mingli, Victor Chernozhukov, Iván Fernández-Val, and Blaise Melly. 2017. "Counterfactual: An R Package for Counterfactual Analysis." *The R Journal* 9(1): 370-384.

DiNardo, John, Nicole M. Fortin, and Thomas Lemieux. 1996. "Labor Market Institutions and the Distribution of Wages, 1973-1992: A Semiparametric Approach." Econometrica, 64(5), 1001-1044.

Firpo, Sergio P., Nicole M. Fortin, and Thomas Lemieux. 2018. "Decomposing Wage Distributions Using Recentered Influence Function Regressions." Econometrics 6(2), 28.

Fortin, Nicole M., Thomas Lemieux, and Sergio Firpo. 2011. "Decomposition methods in economics." In Orley Ashenfelter and David Card, eds., Handbook of Labor Economics. Vol. 4. Elsevier, 1-102.

Firpo, Sergio P., and Cristine Pinto. 2016. "Identification and Estimation of Distributional Impacts of Interventions Using Changes in Inequality Measures." Journal of Applied Econometrics, 31(3), 457-486.

Huber, Martin, Michael Lechner, and Conny Wunsch. 2013. "The performance of estimators based on the propensity score." Journal of Econometrics, 175(1), 1-21.

Examples

Run this code

## Example from handbook chapter of Fortin, Lemieux, and Firpo (2011: 67)
## with a sample of the original data

# \donttest{
data("men8305")

flf_model <- log(wage) ~ union * (education + experience) + education * experience

# Reweighting sample from 1983-85
flf_male_inequality <- dfl_decompose(flf_model,
  data = men8305,
  weights = weights,
  group = year
)

# Summarize results
summary(flf_male_inequality)

# Plot decomposition of quantile differences
plot(flf_male_inequality)

# Use alternative reference group (i.e., reweight sample from 2003-05)
flf_male_inequality_reference_0305 <- dfl_decompose(flf_model,
  data = men8305,
  weights = weights,
  group = year,
  reference_0 = FALSE
)
summary(flf_male_inequality_reference_0305)

# Bootstrap standard errors (using smaller sample for the sake of illustration)

set.seed(123)
flf_male_inequality_boot <- dfl_decompose(flf_model,
  data = men8305[1:1000, ],
  weights = weights,
  group = year,
  bootstrap = TRUE,
  bootstrap_iterations = 100,
  cores = 1
)

# Get standard errors and confidence intervals
summary(flf_male_inequality_boot)

# Plot quantile differences with pointwise confidence intervals
plot(flf_male_inequality_boot)

# Plot quantile differences with uniform confidence intervals
plot(flf_male_inequality_boot, uniform_bands = TRUE)



## Sequential decomposition

# Here we distinguish the contribution of education and experience
# from the contribution of unionization conditional on education and experience.


model_sequential <- log(wage) ~ union * (education + experience) +
  education * experience |
  education * experience

# First variant:
# Contribution of union is evaluated using composition of
# education and experience from 2003-2005 (group 1)

male_inequality_sequential <- dfl_decompose(model_sequential,
  data = men8305,
  weights = weights,
  group = year
)

# Summarize results
summary(male_inequality_sequential)

# Second variant:
# Contribution of union is evaluated using composition of
# education and experience from 1983-1985 (group 0)

male_inequality_sequential_2 <- dfl_decompose(model_sequential,
  data = men8305,
  weights = weights,
  group = year,
  right_to_left = FALSE
)

# Summarize results
summary(male_inequality_sequential_2)

# The domposition effects associated with (conditional) unionization for deciles
cbind(
  male_inequality_sequential$decomposition_quantiles$prob,
  male_inequality_sequential$decomposition_quantiles$`Comp. eff. X1|X2`,
  male_inequality_sequential_2$decomposition_quantiles$`Comp. eff. X1|X2`
)


## Trim observations with weak common support
## (i.e. observations with relative factor weights > \sqrt(N)/N)

set.seed(123)
data_weak_common_support <- data.frame(
  d = factor(c(
    c("A", "A", rep("B", 98)),
    c(rep("A", 90), rep("B", 10))
  )),
  group = rep(c(0, 1), each = 100)
)
data_weak_common_support$y <- ifelse(data_weak_common_support$d == "A", 1, 2) +
  data_weak_common_support$group +
  rnorm(200, 0, 0.5)

decompose_results_trimmed <- dfl_decompose(y ~ d,
  data_weak_common_support,
  group = group,
  trimming = TRUE
)

identical(
  decompose_results_trimmed$trimmed_observations,
  which(data_weak_common_support$d == "A")
)



## Pass a custom statistic function to decompose income share of top 10%

top_share <- function(dep_var,
                      weights,
                      top_percent = 0.1) {
  threshold <- Hmisc::wtd.quantile(dep_var, weights = weights, probs = 1 - top_percent)
  share <- sum(weights[which(dep_var > threshold)] *
    dep_var[which(dep_var > threshold)]) /
    sum(weights * dep_var)
  return(share)
}

flf_male_inequality_custom_stat <- dfl_decompose(flf_model,
  data = men8305,
  weights = weights,
  group = year,
  custom_statistic_function = top_share
)
summary(flf_male_inequality_custom_stat)
# }

Run the code above in your browser using DataLab