oaxaca
performs a Blinder-Oaxaca decomposition for linear regression models (Blinder, 1973; Oaxaca, 1973). This statistical method decomposes the difference in the means of outcome variables across two groups into a part that is due to cross-group differences in explanatory variables and a part that is due to differences in group-specific coefficients. Economists have used Blinder-Oaxaca decompositions extensively to study labor market discrimination. In principle, however, the method is appropriate for the exploration of cross-group differences in any outcome variable.
The oaxaca
function allows users to estimate both a threefold and a twofold variant of the decomposition, as described and implemented by Jann (2008). It supports a variety of reference coefficient weights, as well as pooled model estimation. It can also adjust coefficients on indicator variables to be invariant to the choice of the omitted reference category. Bootstrapped standard errors are calculated (e.g., Efron, 1979). The function returns an object of class "oaxaca"
that can be visualized using the plot.oaxaca
method.
oaxaca(formula, data, group.weights = NULL, R = 100, reg.fun = lm, ...)
a formula that specifies the model that the function will run. Typically, the formula is of the following form: y ~ x1 + x2 + x3 + ... | z
where y
is the dependent variable, x1 + x2 + x3 + ...
are explanatory variables and z
is an indicator variable that is TRUE
(or equal to 1) when an observation belongs to Group B, and FALSE
(or equal to 0) when it belongs to Group A. The formula can also take on an alternative form: y ~ x1 + x2 + x3 + ... | z | d1 + d2 + d3 + ...
Here, d1 + d2 + d3 + ...
are indicator ("dummy") variables that will be adjusted so that the decomposition results do not change depending on the user's choice of the reference category (Gardeazabal and Ugidos, 2004).
a data frame containing the data to be used in the Blinder-Oaxaca decomposition.
a vector of numeric values between 0 and 1. These values specify the weight given to Group A relative to Group B in determining the reference set of coefficients (Oaxaca and Ransom, 1994). By default, the following weights are included in each estimation:
0
: Group A coefficients used as reference.
1
: Group B coefficients used as reference.
0.5
: Equally weighted average (each 0.5) of Group A and B coefficients used as reference, as in Reimers (1983).
an average of Group A and B coefficients weighted by the number of observations in Group A and B, following Cotton (1988).
-1
: Coefficients from a pooled regression (that does not include the group indicator variable) used as reference, as suggested by Neumark (1988).
-2
: Coefficients from a pooled regression (that includes the group indicator) used as reference. See Jann (2008).
number of bootstrapping replicates for the calculation of standard errors. No bootstrapping is performed when the value of R
is set to NULL
.
a function that estimates the desired regression model. The function must accept arguments formula
and data
, and be treated by functions model.frame
and model.matrix
, in the same way that the standard functions lm
and glm
do. Additional arguments can be passed on via the …
argument. By default, an Ordinary Least Squares (OLS) regression is performed via the lm
function.
additional arguments that will be passed on to the regression function specified by reg.fun
.
oaxaca
returns an object of class "oaxaca"
. The corresponding summary
function (i.e., summary.oaxaca
) returns the same object.
An object of class "oaxaca"
is a list containing the following components:
a list that contains information about the regression coefficients used in estimating the decomposition. If dummy variables d1 + d2 + d3 + ...
are specified in the formula
argument, this list contains coefficients that have been adjusted to make estimation results invariant to the choice of the omitted baseline category (Gardeazabal and Ugidos, 2004). The beta
list contains the following components:
beta.A
: coefficients from a regression on observations in Group A
beta.B
: coefficients from a regression on observations in Group B
beta.diff
: equal to beta.A
- beta.B
beta.B
: a matrix that contains the reference coefficients for each of the estimated twofold decompositions
the matched call.
a list that contains information about the number of observations used in the analysis. It contains the following components:
n.A
: the number of observations in Group A
n.B
: the number of observations in Group B
n.pooled
: the number of observations in the pooled model that includes both Group A and Group B
a numeric vector that contains the number of bootstrapping replicates.
a list that contains estimated regression objects:
reg.A
: a regression on observations in Group A
reg.B
: a regression on observations in Group B
reg.pooled.1
: a pooled regression that does not include the group indicator variable (Neumark, 1988)
reg.pooled.2
: a pooled regression that does includes the group indicator variable (Jann, 2008)
a list that contains the result of the threefold Blinder-Oaxaca decomposition. It decomposes the difference in mean outcomes into three parts:
endowments
: the contribution of differences in explanatory variables across groups.
coefficients
: part that is due to group differences in the coefficients (or "effect size"). Includes differences in the model intercept.
interaction
: part that accounts for the fact that cross-group differences in explanatory variables and coefficients occur at the same time.
threefold
contains two sub-components: overall
and variables
. The former is a numeric vector that stores results - coefficients (coef
) and standard errors (se
) - for the overall decomposition of the difference in outcomes into the three parts described above. The latter is a numeric matrix that contains the results of a variable-by-variable threefold Blinder-Oaxaca decomposition.a list that contains the result of the twofold Blinder-Oaxaca decomposition. It decomposes the difference in mean outcomes into two parts:
explained
: the portion that is explained by cross-group differences in the explanatory variables.
unexplained
: the remaining part that is not explained by differences in the explanatory variables. Often attributed to discrimination, but may also result from the influence of unobserved variables.
unexplained
part can be further decomposed into two sub-parts, unexplained A
and unexplained B
, that represent discrimination in favor of Group A and against Group B, respectively. See Jann (2008) for details on these sub-parts' interpretation.
The list twofold
contains two sub-components: overall
and variables
. The former is a numeric matrix that stores results - coefficients (coef
) and standard errors (se
) - for the overall decomposition of the difference in outcomes into the two parts described above. The latter is a list of numeric matrices that contains the results of a variable-by-variable twofold Blinder-Oaxaca decomposition. In all matrices, the weight
column indicates the weight given to Group A relative to Group B in determining the reference coefficients.a list that contains:
x.mean.A
: the mean values of explanatory variables for Group A
x.mean.B
: the mean values of explanatory variables for Group B
x.mean.diff
: equal to x.mean.A
- x.mean.B
a list that contains the mean values of the dependent variable (i.e., the outcome variable). It contains the following components:
y.A
: the mean outcome value for observations in Group A
y.B
: the mean outcome value for observations in Group B
y.diff
: the difference between the mean outcomes values in Groups A and B. Equal to y.A
- y.B
.
Hlavac, Marek (2022). oaxaca: Blinder-Oaxaca Decomposition in R. R package version 0.1.5. https://CRAN.R-project.org/package=oaxaca
Blinder, Alan S. (1973). Wage Discrimination: Reduced Form and Structural Estimates. Journal of Human Resources, 8(4), 436-455.
Cotton, Jeremiah. (1988). On the Decomposition of Wage Differentials. Review of Economics and Statistics, 70(2), 236-243.
Efron, Bradley. (1979). Bootstrap Methods: Another Look at the Jackknife. Annals of Statistics, 7(1), 1-26.
Gardeazabal, Javier and Arantza Ugidos. (2004). More on Identification in Detailed Wage Decompositions. Review of Economics and Statistics, 86(4), 1034-1036.
Jann, Ben. (2008). The Blinder-Oaxaca Decomposition for Linear Regression Models. Stata Journal, 8(4), 453-479.
Neumark, David. (1988). Employers' Discriminatory Behavior and the Estimation of Wage Discrimination. Journal of Human Resources, 23(3), 279-295.
Oaxaca, Ronald L. (1973). Male-Female Wage Differentials in Urban Labor Markets. International Economic Review, 14(3), 693-709.
Oaxaca, Ronald L. and Michael R. Ransom. (1994). On Discrimination and the Decomposition of Wage Differentials. Journal of Econometrics, 61(1), 5-21.
Reimers, Cordelia W. (1983). Labor Market Discrimination Against Hispanic and Black Men. Review of Economics and Statistics, 65(4), 570-579.
# NOT RUN {
# set random seed
set.seed(03104)
# load data set of Hispanic workers in Chicago
data("chicago")
# perform Blinder-Oaxaca Decomposition:
# explain differences in log real wages across native and foreign-born groups
oaxaca.results.1 <- oaxaca(ln.real.wage ~ age + female + LTHS + some.college +
college + advanced.degree | foreign.born,
data = chicago, R = 30)
# print the results
print(oaxaca.results.1)
# Next:
# - adjust gender and education dummy variable coefficients to make results
# invariant to the choice of omitted baseline (reference category)
# - include additional weights for the twofold decomposition that give
# weights of 0.2 and 0.4 to Group A relative to Group B in the choice
# of reference coefficients
oaxaca.results.2 <- oaxaca(ln.real.wage ~ age + female + LTHS + some.college +
college + advanced.degree | foreign.born |
LTHS + some.college + college + advanced.degree,
data = chicago, group.weights = c(0.2, 0.4), R = 30)
# plot the results
plot(oaxaca.results.2)
# }
# NOT RUN {
<!-- % Add one or more standard keywords, see file 'KEYWORDS' in the -->
# }
# NOT RUN {
<!-- % R documentation directory. -->
# }
Run the code above in your browser using DataLab