ca
conducts CA estimation and inference on user-specified objects of
interest: first (weighted) moment or (weighted) distribution. Users can use
t
to specify variables in interest. When object of interest is
moment, use cl
to specify whether want to see averages or difference
of the two groups.
ca(
fm,
data,
method = c("ols", "logit", "probit", "QR"),
var_type = c("binary", "continuous", "categorical"),
var,
compare,
subgroup = NULL,
samp_weight = NULL,
taus = c(5:95)/100,
u = 0.1,
interest = c("moment", "dist"),
t = c(1, 1, rep(0, dim(data)[2] - 2)),
cl = c("both", "diff"),
cat = NULL,
alpha = 0.1,
b = 500,
parallel = FALSE,
ncores = detectCores(),
seed = 1,
bc = TRUE,
range_cb = c(1:99)/100,
boot_type = c("nonpar", "weighted")
)
Regression formula
The data in use: full sample or subpopulation in interset
Models to be used for estimating partial effects. Four
options: "logit"
(binary response),
"probit"
(binary response), "ols"
(interactive linear with additive errors), "QR"
(linear model with non-additive errors). Default is
"ols"
.
The type of parameter in interest. Three options:
"binary"
, "categorical"
,
"continuous"
. Default is "binary"
.
Variable T in interset. Should be a character.
If parameter in interest is categorical, then user needs
to specify which two category to compare with. Should be
a 1 by 2 character vector. For example, if the two levels
to compare with is 1 and 3, then c=("1", "3")
,
which will calculate partial effect from 1 to 3. To use
this option, users first need to specify var
as
a factor variable.
Subgroup in interest. Default is NULL
.
Specifcation should be a logical variable. For example,
suppose data contain indicator variable for women (female
if 1, male if 0). If users are interested in women SPE,
then users should specify
subgroup = data[, "female"] == 1
.
Sampling weight of data. Input should be a n by 1 vector,
where n denotes sample size. Default is NULL
.
Indexes for quantile regression. Default is
c(5:95)/100
.
Percentile of most and least affected. Default is set to be 0.1.
Generic objects in the least and most affected
subpopulations. Two options:
(1) "moment"
: weighted mean of Z in the
u-least/most affected subpopulation.
(2) "dist"
: distribution of Z in the u-least/most
affected subpopulation.
Default is interest = "moment"
.
An index for ca object. Should be a 1 by ncol(data)
indicator vector. Users can either (1) specify names of
variables of interest directly, or (2) use 1 to indicate
the variable of interest. For example, total number of
variables is 5 and interested in the 1st and 3rd vars,
then specify t = c(1, 0, 1, 0, 0)
.
If moment = "interest"
, cl
allows the user
to get the variables of interest (specified in t
option) of the most and least affected groups. The
default is "both"
, which shows the variables of
the two groups; the alternative is "diff"
, which
shows the difference of the two groups. The user can
use the summary.ca
to tabulate the
results, which also contain the standard errors and p-
values. If interest = "dist"
, this option doesn't
have any bearing and user can leave it to be the default
value.
P-values in classification analysis are adjusted for
multiplicity to account for joint testing of zero
coefficients on for all variables within a category.
Suppose we have selected specified 3 variables in
interest: t = c("a", "b", "c")
. Without loss of
generality, assume "a"
is not a factor, while
"b"
and "c"
are two factors. Then users
need to specify as cat = c("b", "c")
. Default is
NULL
.
Size for confidence interval. Shoule be between 0 and 1. Default is 0.1
Number of bootstrap draws. Default is 500.
Whether the user wants to use parallel computation.
The default is FALSE
and only 1 CPU will be used.
The other option is TRUE
, and user can specify
the number of CPUs in the ncores
option.
Number of cores for computation. Default is set to be
detectCores()
, which is a function from package
parallel
that detects the number of CPUs on the
current host. For large dataset, parallel computing is
highly recommended since bootstrap is time-consuming.
Pseudo-number generation for reproduction. Default is 1.
Whether want the estimate to be bias-corrected. Default
is TRUE
. If FALSE
uncorrected estimate and
corresponding confidence bands will be reported.
When interest = "dist"
, we sort and unique
variables in interest to estimate weighted CDF. For large
dataset there can be memory problem storing very many of
observations, and thus users can provide a Sort value and
the package will sort and unique based on the weighted
quantile of Sort. If users don't want this feature, set
range_cb = NULL
. Default is c(1:99)/100
.
Type of bootstrap. Default is "nonpar"
, and the
package implements nonparametric bootstrap. The
alternative is "weighted"
, and the package
implements weighted bootstrap.
If subgroup = NULL
, all outputs are whole sample. Otherwise output
are subgroup results. When interest = "moment"
, the output is a list
showing
est
Estimates of variables in interest.
bse
Bootstrap standard errors.
joint_p
P-values that are adjusted for multiplicity to
account for joint testing for all variables.
pointwise_p
P-values that doesn't adjust for join testing
If users have further specified cat
(e.g., !is.null(cat)
), the
fourth component will be replaced with p_cat
: P-values that are a
djusted for multiplicity to account for joint testing for all variables
within a category. Users can use summary.ca
to tabulate the
results.
When interest = "dist"
, the output is a list of two components:
infresults
A list that stores estimates, upper and lower
confidence bounds for all variables in interest for least and most
affected groups.
sortvar
A list that stores sorted and unique variables in
interest.
We recommend using plot.ca
command for result visualization.
All estimates are bias-corrected and all confidence bands are monotonized. The bootstrap procedures follow algorithm 2.2 as in Chernozhukov, Fernandez-Val and Luo (2018).
# NOT RUN {
data("mortgage")
### Regression Specification
fm <- deny ~ black + p_irat + hse_inc + ccred + mcred + pubrec +
ltv_med + ltv_high + denpmi + selfemp + single + hischl
### Specify characteristics of interest
t <- c("deny", "p_irat", "black", "hse_inc", "ccred", "mcred", "pubrec",
"denpmi", "selfemp", "single", "hischl", "ltv_med", "ltv_high")
### issue ca command
CA <- ca(fm = fm, data = mortgage, var = "black", method = "logit",
cl = "diff", t = t, b = 50, bc = TRUE)
# }
Run the code above in your browser using DataLab