Fits one of three possible Bayesian Instrumental Regression for Disparity
Estimation (BIRDiE) models to BISG probabilities and covariates. The simplest
Categorical-Dirichlet model (cat_dir()
) is appropriate when there are no
covariates or when all covariates are discrete and fully interacted with
another. The more general Categorical mixed-effects model (cat_mixed()
) is
a supports any number of fixed effects and up to one random intercept. For
continuous outcomes a Normal linear model is available (gaussian()
).
birdie(
r_probs,
formula,
data,
family = cat_dir(),
prior = NULL,
weights = NULL,
algorithm = c("em", "gibbs", "em_boot"),
iter = 400,
warmup = 50,
prefix = "pr_",
ctrl = birdie.ctrl()
)
An object of class birdie
, for which many
methods are available. The model estimates may be accessed with
coef.birdie()
, and updated BISG probabilities (conditioning on the
outcome) may be accessed with fitted.birdie()
. Uncertainty estimates, if
available, can be accessed with $se
and vcov.birdie()
.
A data frame or matrix of BISG probabilities, with one row per
individual. The output of bisg()
can be used directly here.
A two-sided formula object describing the model structure. The
left-hand side is the outcome variable, which must be discrete. A single
random intercept term, denoted with a vertical bar ("(1 | <term>)"
), is
supported on the right-hand side.
An optional data frame containing the variables named in formula
.
A description of the complete-data model type to fit. Options are:
cat_dir()
: Categorical-Dirichlet model. All covariates must be fully
interacted.
cat_mixed()
: Categorical mixed-effects model. Up to one random effect
is supported.
gaussian()
: Linear model.
See the Details section below for more information on the various models.
A list with entries specifying the model prior.
For the cat_dir
model, the only entry is alpha
, which should be a matrix
of Dirichlet hyperparameters. The matrix should have one row for every
level of the outcome variable and one column for every racial group. The
default prior (used when prior=NULL
) is an empirical Bayes prior equal to
the weighted-mean estimate of the outcome-race table. A fully
noninformative prior with all entries set to \(\epsilon\) can be obtained
by setting prior=NA
. When prior=NULL
and algorithm="em"
or
"em_boot"
, 1 is added to the prior so that the posterior mode, rather
than the mean, is shrunk toward these values.
For the cat_mixed
model, the prior
list should contain three scalar entries:
scale_int
, the standard deviation on the Normal prior for the intercepts
(which control the global estimates of Y|R
), scale_beta
, the standard
deviation on the Normal prior for the fixed effects, and scale_sigma
, the
prior mean of the standard deviation of the random intercepts. These can be
a single scalar or a vector with an entry for each racial group.
For the gaussian
model, the prior
list should contain two entries:
scale_int
, controlling, the standard deviation on the Normal prior for
the intercepts (which control the global estimates of Y|R
), and
scale_beta
, controlling the standard deviation on the Normal prior for
the fixed effects. These must be a single scalar. Each is expressed in
terms of the estimated residual standard deviation (i.e., they are
multiplied together to form the "true" prior).
The prior is stored after model fitting in the $prior
element of the
fitted model object.
An optional numeric vector specifying likelihood weights.
The inference algorithm to use. One of 3 options:
"em"
: An expectation-maximization algorithm which will perform inference
for the maximum a posteriori (MAP) parameter values. Computationally
efficient and supported by all the model families. No uncertainty
quantification.
"gibbs"
: A Gibbs sampler for performing full Bayesian inference.
Generally more computationally demanding than the EM algorithm, but
provides uncertainty quantification. Currently supported for cat_dir()
and
gaussian()
model families. Computation-reliability tradeoff can
be controlled with iter
argument.
"em_boot"
: Bootstrapped version of EM algorithm. Number of bootstrap
replicates controlled by iter
parameter. Provides approximate uncertainty
quantification. Currently supported for cat_dir()
and
gaussian()
model families.
The number of post-warmup Gibbs samples, or the number of
bootstrap replicates to use to compute approximate standard errors for the
main model estimates. Only available when family=cat_dir()
or
gaussian()
. Ignored if algorithm="em"
.
For bootstrapping, when there are fewer than 1,000 individuals or 100 or fewer replicates, a Bayesian bootstrap is used instead (i.e., weights are drawn from a \(\text{Dirichlet}(1, 1, ..., 1)\) distribution, which produces more reliable estimates.
Number of warmup iterations for Gibbs sampling.
Ignored unless algorithm="gibbs"
.
If r_probs
is a data frame, the columns containing racial
probabilities will be selected as those with names starting with prefix
.
The default will work with the output of bisg()
.
A list containing control parameters for the EM algorithm and
optimization routines. A list in the proper format can be made using
birdie.ctrl()
.
By default, birdie()
uses an expectation-maximization (EM) routine to find
the maximum a posteriori (MAP) estimate for the specified model. Asymptotic
variance-covariance matrices for the MAP estimate are available for the
Categorical-Dirichlet and Normal linear models via bootstrapping.
Full Bayesian inference is supported via Gibbs sampling for the
Categorical-Dirichlet and Normal linear models as well.
Whatever model or method is used, a finite-population estimate of the
outcome-given-race distribution for the entire observed sample is always
calculated and stored as $est
in the returned object, which can be accessed
with coef.birdie()
as well.
The Categorical-Dirichlet model is specified as follows: $$
Y_i \mid R_i, X_i, \Theta \sim \text{Categorical}(\theta_{R_iX_i}) \\
\theta_{rx} \sim \text{Dirichlet}(\alpha_r),
$$ where \(Y\) is the outcome variable, \(R\) is race, \(X\) are
covariates (fixed effects), and \(\theta_{rx}\) and \(\alpha_r\) are
vectors with length matching the number of levels of the outcome variable.
There is one vector \(\theta_{rx}\) for every combination of race and
covariates, hence the need for formula
to either have no covariates or a
fully interacted structure.
The Categorical mixed-effects model is specified as follows: $$
Y_i \mid R_i, X_i, \Theta \sim \text{Categorical}(g^{-1}(\mu_{R_iX_i})) \\
\mu_{rxy} = W\beta_{ry} + Zu_{ry} \\
u_{r} \mid \vec\sigma_{r}, L_r \sim \mathcal{N}(0,
\text{diag}(\vec\sigma_{r})C_r\text{diag}(\vec\sigma_{r})) \\
\beta_{ry} \sim \mathcal{N}(0, s^2_{r\beta}) \\
\sigma_{ry} \sim \text{Inv-Gamma}(4, 3s_{r\sigma}) \\
C_r \sim \text{LKJ}(2),
$$ where \(\beta_{ry}\) are the fixed effects, \(u_{ry}\) is the random
intercept, and \(g\) is a softmax link function.
Estimates for \(\beta_{ry}\) and \(\sigma_{ry}\) are stored in the
$beta
and $sigma
elements of the fitted model object.
The Normal linear model is specified as follows: $$
Y_i \mid R_i, \vec X_i, \Theta \sim \mathcal{N}(\vec X_i^\top\vec\theta, \sigma^2) \\
\sigma^2 \sim \text{Inv-Gamma}(n_\sigma/2, l_\sigma^2 n_\sigma/2) \\
\beta_{\text{intercept}} \sim \mathcal{N}(0, s^2_\text{int}) \\
\beta_k \sim \mathcal{N}(0, s^2_\beta), \\
$$ where \(\vec\theta\) is a vector of linear model coefficients.
Estimates for \(\theta\) and \(\sigma\) are stored in the
$beta
and $sigma
elements of the fitted model object.
More details on the models and their properties may be found in the paper referenced below.
McCartan, C., Fisher, R., Goldin, J., Ho, D.E., & Imai, K. (2025). Estimating Racial Disparities when Race is Not Observed. Journal of the American Statistical Association. Available at tools:::Rd_expr_doi("10.1080/01621459.2025.2526695").
# \donttest{
data(pseudo_vf)
r_probs = bisg(~ nm(last_name) + zip(zip), data=pseudo_vf)
# Process zip codes to remove missing values
pseudo_vf$zip = proc_zip(pseudo_vf$zip)
fit = birdie(r_probs, turnout ~ 1, data=pseudo_vf)
print(fit)
fit$se # uncertainty quantification
fit = birdie(r_probs, turnout ~ zip, data=pseudo_vf, algorithm="gibbs")
fit = birdie(r_probs, turnout ~ (1 | zip), data=pseudo_vf,
family=cat_mixed(), ctrl=birdie.ctrl(abstol=1e-3))
summary(fit)
coef(fit)
fitted(fit)
# }
Run the code above in your browser using DataLab