did Function ArgumentsFunction to process arguments passed to the main methods in the
did package as well as conducting some tests to make sure
data is in proper format / try to throw helpful error messages.
pre_process_did2(
yname,
tname,
idname,
gname,
xformla = NULL,
data,
panel = TRUE,
allow_unbalanced_panel,
control_group = c("nevertreated", "notyettreated"),
anticipation = 0,
weightsname = NULL,
fix_weights = NULL,
alp = 0.05,
bstrap = FALSE,
cband = FALSE,
biters = 1000,
clustervars = NULL,
est_method = "dr",
base_period = "varying",
print_details = TRUE,
faster_mode = FALSE,
pl = FALSE,
cores = 1,
call = NULL
)a DIDparams object
The name of the outcome variable
The name of the column containing the time periods
The individual (cross-sectional unit) id name
The name of the variable in data that
contains the first period when a particular observation is treated.
This should be a positive number for all observations in treated groups.
It defines which "group" a unit belongs to. It should be 0 for units
in the untreated group.
A formula for the covariates to include in the
model. It should be of the form ~ X1 + X2. Default
is NULL which is equivalent to xformla=~1. This is
used to create a matrix of covariates which is then passed
to the 2x2 DID estimator chosen in est_method.
For time-varying covariates: (1) With balanced panel data,
in each 2x2 comparison, the covariates
are taken to be the value of the covariates in the earlier time
period, and all of the underlying computations involve changes in Y
as a function of those values of covariates. (2) With repeated cross
sections data and unbalanced panel data, the covariates are taken
from each time period and computations involve Y_post conditional
on X_post minus Y_pre conditional on X_pre. A byproduct of this
is that, with balanced panel data and in the presence of
time-varying covariates, it is possible to get different numerical
results according to whether or not allow_unbalanced_panel=TRUE or
FALSE.
The name of the data.frame that contains the data
Whether or not the data is a panel dataset.
The panel dataset should be provided in long format -- that
is, where each row corresponds to a unit observed at a
particular point in time. The default is TRUE. When
using a panel dataset, the variable idname must
be set. When panel=FALSE, the data is treated
as repeated cross sections.
Whether or not function should
"balance" the panel with respect to time and id. The default
value is FALSE which means that att_gt() will drop
all units where data is not observed in all periods.
The advantage of this is that the computations are faster
(sometimes substantially).
Which units to use as the control group.
The default is "nevertreated" which sets the control group
to be the group of units that never participate in the
treatment. This group does not change across groups or
time periods. The other option is to set
group="notyettreated". In this case, the control group
is set to the group of units that have not yet participated
in the treatment in that time period. This includes all
never treated units, but it includes additional units that
eventually participate in the treatment, but have not
participated yet.
The number of time periods before participating in the treatment where units can anticipate participating in the treatment and therefore it can affect their untreated potential outcomes
The name of the column containing the sampling weights.
If not set, all observations have same weight. When weights are
time-invariant (constant within each unit across periods), all
fix_weights options produce identical results and no special
handling is needed.
When weights vary across time (e.g., time-varying population sizes), the default behavior differs by panel type:
Each 2x2 DiD comparison uses the weight from the
earlier of the two time periods involved. For post-treatment cells,
this is the base period (g-1). For pre-treatment cells with
base_period="varying", this is the pre-treatment period itself.
The panel DRDID estimators are used.
Both periods' per-observation weights are passed directly to the RC DRDID estimators, so each observation carries its own period-specific weight.
Use the fix_weights argument to override the default behavior.
Controls how time-varying sampling weights are resolved. Only relevant when weights vary across time; with time-invariant weights, all options produce identical results. Options:
NULL (default)For balanced panel: uses the weight from
the earlier of the two time periods in each 2x2 comparison. For
post-treatment cells, this is the base period (g-1). For
pre-treatment cells, this depends on the base_period setting.
For RC/unbalanced panel: uses per-observation weights from both
periods.
"varying"Uses per-observation, period-specific weights
for all panel types. For balanced panel data, this switches to the
repeated cross-section DRDID estimators so that pre-period and
post-period observations each carry their own weight. Covariates
are held fixed at their pre-period values (same as the default
panel estimator). This is the most flexible option for weights but
sacrifices the efficiency of the panel estimator. For RC/unbalanced
panel, this is identical to the default. Not supported with custom
est_method functions.
"base_period"Fixes weights at the base period (g-1) for
all (g,t) cells within a group, for both pre-treatment and
post-treatment comparisons. Ensures all cells within a group use the
same weights. For unbalanced panels, units not observed in the base
period are dropped with a warning. Not supported for repeated cross
sections (panel = FALSE).
"first_period"Fixes weights at the first time period in
the dataset for all (g,t) cells. For unbalanced panels, units not
observed in the first period are dropped with a warning. Not supported
for repeated cross sections (panel = FALSE).
the significance level, default is 0.05
Boolean for whether or not to compute standard errors using
the multiplier bootstrap. Default is TRUE (in addition, cband
is also by default TRUE indicating that uniform confidence bands
will be returned). If bstrap=FALSE, analytical standard errors are
reported; these are cluster-robust when clustervars is supplied.
Boolean for whether or not to compute a uniform confidence
band that covers all of the group-time average treatment effects
with fixed probability 1-alp. In order to compute uniform confidence
bands, bstrap must also be set to TRUE. The default is
TRUE.
The number of bootstrap iterations to use. The default is 1000,
and this is only applicable if bstrap=TRUE.
A vector of variables names to cluster on. At most, there
can be two variables (otherwise will throw an error) and one of these
must be the same as idname which allows for clustering at the individual
level. Clustered standard errors are available with the multiplier bootstrap
(bstrap=TRUE) or analytically (bstrap=FALSE).
the method to compute group-time average treatment effects. The default is "dr" which uses the doubly robust
approach in the DRDID package. Other built-in methods
include "ipw" for inverse probability weighting and "reg" for
first step regression estimators. The user can also pass their
own function for estimating group time average treatment
effects. The required signature depends on the data structure:
Panel data (panel=TRUE): f(y1, y0, D, covariates, i.weights, inffunc, ...) where y1 is an n x 1 vector of
post-treatment outcomes, y0 is an n x 1 vector of
pre-treatment outcomes, D is a binary vector indicating
treatment group membership, covariates is an n x k matrix,
i.weights is a vector of sampling weights, and inffunc is a
logical requesting influence-function computation.
Repeated cross sections / unbalanced panel (panel=FALSE):
f(y, post, D, covariates, i.weights, inffunc, ...) where y is
the outcome vector (length n), post is a binary indicator for
the post-treatment period, D is a binary treatment indicator,
covariates is an n x k matrix, i.weights is a vector of
sampling weights, and inffunc is a logical.
In both cases the function should return a list that includes
ATT (the estimated group-time average treatment effect) and
att.inf.func (an n x 1 influence function — one entry per
observation passed into the estimator).
The function can return other things as well, but these are
the only two that are required. With no covariates
(xformla = NULL), the built-in methods ("dr", "ipw", "reg") all
reduce to the unconditional difference-in-differences estimator, so
the choice among them is irrelevant; a custom est_method function
is still called (with an intercept-only covariates matrix) and
determines the estimates.
Whether to use a "varying" base period or a "universal" base period. Either choice results in the same post-treatment estimates of ATT(g,t)'s. In pre-treatment periods, using a varying base period amounts to computing a pseudo-ATT in each treatment period by comparing the change in outcomes for a particular group relative to its comparison group in the pre-treatment periods (i.e., in pre-treatment periods this setting computes changes from period t-1 to period t, but repeatedly changes the value of t)
A universal base period fixes the base period to always be (g-anticipation-1). This does not compute pseudo-ATT(g,t)'s in pre-treatment periods, but rather reports average changes in outcomes from period t to (g-anticipation-1) for a particular group relative to its comparison group. This is analogous to what is often reported in event study regressions.
Using a varying base period results in an estimate of ATT(g,t) being reported in the period immediately before treatment. Using a universal base period normalizes the estimate in the period right before treatment (or earlier when the user allows for anticipation) to be equal to 0, but one extra estimate in an earlier period.
Whether or not to show details/progress of computations.
Default is FALSE.
This option enables a faster version of did, optimizing
computation time for large datasets by improving data management within the package.
The default is set to TRUE. Both modes produce identical results up to
numerical precision; while the difference is minimal for small datasets,
the speedup is substantial for large ones.
Whether or not to use parallel processing
The number of cores to use for parallel processing
Function call to att_gt