oaxaca: Blinder-Oaxaca Decomposition

Description

oaxaca performs a Blinder-Oaxaca decomposition for linear regression models (Blinder, 1973; Oaxaca, 1973). This statistical method decomposes the difference in the means of outcome variables across two groups into a part that is due to cross-group differences in explanatory variables and a part that is due to differences in group-specific coefficients. Economists have used Blinder-Oaxaca decompositions extensively to study labor market discrimination. In principle, however, the method is appropriate for the exploration of cross-group differences in any outcome variable. The oaxaca function allows users to estimate both a threefold and a twofold variant of the decomposition, as described and implemented by Jann (2008). It supports a variety of reference coefficient weights, as well as pooled model estimation. It can also adjust coefficients on indicator variables to be invariant to the choice of the omitted reference category. Bootstrapped standard errors are calculated (e.g., Efron, 1979). The function returns an object of class "oaxaca" that can be visualized using the plot.oaxaca method.

Usage

oaxaca(formula, data, group.weights = NULL, R = 100, reg.fun = lm, ...)

Arguments

formula

a formula that specifies the model that the function will run. Typically, the formula is of the following form: y ~ x1 + x2 + x3 + ... | z where y is the dependent variable, x1 + x2 + x3 + ... are explanatory variables and z is an indicator variable that is TRUE (or equal to 1) when an observation belongs to Group B, and FALSE (or equal to 0) when it belongs to Group A. The formula can also take on an alternative form: y ~ x1 + x2 + x3 + ... | z | d1 + d2 + d3 + ... Here, d1 + d2 + d3 + ... are indicator ("dummy") variables that will be adjusted so that the decomposition results do not change depending on the user's choice of the reference category (Gardeazabal and Ugidos, 2004).

data

a data frame containing the data to be used in the Blinder-Oaxaca decomposition.

group.weights

a vector of numeric values between 0 and 1. These values specify the weight given to Group A relative to Group B in determining the reference set of coefficients (Oaxaca and Ransom, 1994). By default, the following weights are included in each estimation:

0: Group A coefficients used as reference.
1: Group B coefficients used as reference.
0.5: Equally weighted average (each 0.5) of Group A and B coefficients used as reference, as in Reimers (1983).
an average of Group A and B coefficients weighted by the number of observations in Group A and B, following Cotton (1988).
-1: Coefficients from a pooled regression (that does not include the group indicator variable) used as reference, as suggested by Neumark (1988).
-2: Coefficients from a pooled regression (that includes the group indicator) used as reference. See Jann (2008).

number of bootstrapping replicates for the calculation of standard errors. No bootstrapping is performed when the value of R is set to NULL.

reg.fun

a function that estimates the desired regression model. The function must accept arguments formula and data, and be treated by functions model.frame and model.matrix, in the same way that the standard functions lm and glm do. Additional arguments can be passed on via the … argument. By default, an Ordinary Least Squares (OLS) regression is performed via the lm function.

…

additional arguments that will be passed on to the regression function specified by reg.fun.

Value

oaxaca returns an object of class "oaxaca". The corresponding summary function (i.e., summary.oaxaca) returns the same object.

An object of class "oaxaca" is a list containing the following components:

beta

a list that contains information about the regression coefficients used in estimating the decomposition. If dummy variables d1 + d2 + d3 + ... are specified in the formula argument, this list contains coefficients that have been adjusted to make estimation results invariant to the choice of the omitted baseline category (Gardeazabal and Ugidos, 2004). The beta list contains the following components:

beta.A: coefficients from a regression on observations in Group A
beta.B: coefficients from a regression on observations in Group B
beta.diff: equal to beta.A - beta.B
beta.B: a matrix that contains the reference coefficients for each of the estimated twofold decompositions

call

the matched call.

a list that contains information about the number of observations used in the analysis. It contains the following components:

n.A: the number of observations in Group A
n.B: the number of observations in Group B
n.pooled: the number of observations in the pooled model that includes both Group A and Group B

a numeric vector that contains the number of bootstrapping replicates.

reg

a list that contains estimated regression objects:

reg.A: a regression on observations in Group A
reg.B: a regression on observations in Group B
reg.pooled.1: a pooled regression that does not include the group indicator variable (Neumark, 1988)
reg.pooled.2: a pooled regression that does includes the group indicator variable (Jann, 2008)

threefold

a list that contains the result of the threefold Blinder-Oaxaca decomposition. It decomposes the difference in mean outcomes into three parts:

endowments: the contribution of differences in explanatory variables across groups.
coefficients: part that is due to group differences in the coefficients (or "effect size"). Includes differences in the model intercept.
interaction: part that accounts for the fact that cross-group differences in explanatory variables and coefficients occur at the same time.

The list threefold contains two sub-components: overall and variables. The former is a numeric vector that stores results - coefficients (coef) and standard errors (se) - for the overall decomposition of the difference in outcomes into the three parts described above. The latter is a numeric matrix that contains the results of a variable-by-variable threefold Blinder-Oaxaca decomposition.

twofold

a list that contains the result of the twofold Blinder-Oaxaca decomposition. It decomposes the difference in mean outcomes into two parts:

explained: the portion that is explained by cross-group differences in the explanatory variables.
unexplained: the remaining part that is not explained by differences in the explanatory variables. Often attributed to discrimination, but may also result from the influence of unobserved variables.

The unexplained part can be further decomposed into two sub-parts, unexplained A and unexplained B, that represent discrimination in favor of Group A and against Group B, respectively. See Jann (2008) for details on these sub-parts' interpretation. The list twofold contains two sub-components: overall and variables. The former is a numeric matrix that stores results - coefficients (coef) and standard errors (se) - for the overall decomposition of the difference in outcomes into the two parts described above. The latter is a list of numeric matrices that contains the results of a variable-by-variable twofold Blinder-Oaxaca decomposition. In all matrices, the weight column indicates the weight given to Group A relative to Group B in determining the reference coefficients.

a list that contains:

x.mean.A: the mean values of explanatory variables for Group A
x.mean.B: the mean values of explanatory variables for Group B
x.mean.diff: equal to x.mean.A - x.mean.B

a list that contains the mean values of the dependent variable (i.e., the outcome variable). It contains the following components:

y.A: the mean outcome value for observations in Group A
y.B: the mean outcome value for observations in Group B
y.diff: the difference between the mean outcomes values in Groups A and B. Equal to y.A - y.B.

Please cite as:

Hlavac, Marek (2022). oaxaca: Blinder-Oaxaca Decomposition in R. R package version 0.1.5. https://CRAN.R-project.org/package=oaxaca

References

Blinder, Alan S. (1973). Wage Discrimination: Reduced Form and Structural Estimates. Journal of Human Resources, 8(4), 436-455.

Cotton, Jeremiah. (1988). On the Decomposition of Wage Differentials. Review of Economics and Statistics, 70(2), 236-243.

Efron, Bradley. (1979). Bootstrap Methods: Another Look at the Jackknife. Annals of Statistics, 7(1), 1-26.

Gardeazabal, Javier and Arantza Ugidos. (2004). More on Identification in Detailed Wage Decompositions. Review of Economics and Statistics, 86(4), 1034-1036.

Jann, Ben. (2008). The Blinder-Oaxaca Decomposition for Linear Regression Models. Stata Journal, 8(4), 453-479.

Neumark, David. (1988). Employers' Discriminatory Behavior and the Estimation of Wage Discrimination. Journal of Human Resources, 23(3), 279-295.

Oaxaca, Ronald L. (1973). Male-Female Wage Differentials in Urban Labor Markets. International Economic Review, 14(3), 693-709.

Oaxaca, Ronald L. and Michael R. Ransom. (1994). On Discrimination and the Decomposition of Wage Differentials. Journal of Econometrics, 61(1), 5-21.

Reimers, Cordelia W. (1983). Labor Market Discrimination Against Hispanic and Black Men. Review of Economics and Statistics, 65(4), 570-579.

Examples

Run this code

# NOT RUN {
# set random seed
set.seed(03104)

# load data set of Hispanic workers in Chicago
data("chicago")

# perform Blinder-Oaxaca Decomposition:
# explain differences in log real wages across native and foreign-born groups
oaxaca.results.1 <- oaxaca(ln.real.wage ~ age + female + LTHS + some.college + 
                                          college + advanced.degree | foreign.born, 
                           data = chicago, R = 30)

# print the results
print(oaxaca.results.1)

# Next:
# - adjust gender and education dummy variable coefficients to make results
#   invariant to the choice of omitted baseline (reference category)
# - include additional weights for the twofold decomposition that give
#   weights of 0.2 and 0.4 to Group A relative to Group B in the choice
#   of reference coefficients

oaxaca.results.2 <- oaxaca(ln.real.wage ~ age + female + LTHS + some.college + 
                                          college + advanced.degree | foreign.born |
                                          LTHS + some.college + college + advanced.degree,
                           data = chicago, group.weights = c(0.2, 0.4), R = 30)

# plot the results
plot(oaxaca.results.2)

# }
# NOT RUN {
<!-- % Add one or more standard keywords, see file 'KEYWORDS' in the -->
# }
# NOT RUN {
<!-- % R documentation directory. -->
# }

Run the code above in your browser using DataLab