ca conducts CA estimation and inference on user-specified objects of
interest: first (weighted) moment or (weighted) distribution. Users can use
t to specify variables in interest. When object of interest is
moment, use cl to specify whether want to see averages or difference
of the two groups.
ca(
fm,
data,
method = c("ols", "logit", "probit", "QR"),
var_type = c("binary", "continuous", "categorical"),
var,
compare,
subgroup = NULL,
samp_weight = NULL,
taus = c(5:95)/100,
u = 0.1,
interest = c("moment", "dist"),
t = c(1, 1, rep(0, dim(data)[2] - 2)),
cl = c("both", "diff"),
cat = NULL,
alpha = 0.1,
b = 500,
parallel = FALSE,
ncores = detectCores(),
seed = 1,
bc = TRUE,
range_cb = c(1:99)/100,
boot_type = c("nonpar", "weighted")
)Regression formula
The data in use: full sample or subpopulation in interset
Models to be used for estimating partial effects. Four
options: "logit" (binary response),
"probit" (binary response), "ols"
(interactive linear with additive errors), "QR"
(linear model with non-additive errors). Default is
"ols".
The type of parameter in interest. Three options:
"binary", "categorical",
"continuous". Default is "binary".
Variable T in interset. Should be a character.
If parameter in interest is categorical, then user needs
to specify which two category to compare with. Should be
a 1 by 2 character vector. For example, if the two levels
to compare with is 1 and 3, then c=("1", "3"),
which will calculate partial effect from 1 to 3. To use
this option, users first need to specify var as
a factor variable.
Subgroup in interest. Default is NULL.
Specifcation should be a logical variable. For example,
suppose data contain indicator variable for women (female
if 1, male if 0). If users are interested in women SPE,
then users should specify
subgroup = data[, "female"] == 1.
Sampling weight of data. Input should be a n by 1 vector,
where n denotes sample size. Default is NULL.
Indexes for quantile regression. Default is
c(5:95)/100.
Percentile of most and least affected. Default is set to be 0.1.
Generic objects in the least and most affected
subpopulations. Two options:
(1) "moment": weighted mean of Z in the
u-least/most affected subpopulation.
(2) "dist": distribution of Z in the u-least/most
affected subpopulation.
Default is interest = "moment".
An index for ca object. Should be a 1 by ncol(data)
indicator vector. Users can either (1) specify names of
variables of interest directly, or (2) use 1 to indicate
the variable of interest. For example, total number of
variables is 5 and interested in the 1st and 3rd vars,
then specify t = c(1, 0, 1, 0, 0).
If moment = "interest", cl allows the user
to get the variables of interest (specified in t
option) of the most and least affected groups. The
default is "both", which shows the variables of
the two groups; the alternative is "diff", which
shows the difference of the two groups. The user can
use the summary.ca to tabulate the
results, which also contain the standard errors and p-
values. If interest = "dist", this option doesn't
have any bearing and user can leave it to be the default
value.
P-values in classification analysis are adjusted for
multiplicity to account for joint testing of zero
coefficients on for all variables within a category.
Suppose we have selected specified 3 variables in
interest: t = c("a", "b", "c"). Without loss of
generality, assume "a" is not a factor, while
"b" and "c" are two factors. Then users
need to specify as cat = c("b", "c"). Default is
NULL.
Size for confidence interval. Shoule be between 0 and 1. Default is 0.1
Number of bootstrap draws. Default is 500.
Whether the user wants to use parallel computation.
The default is FALSE and only 1 CPU will be used.
The other option is TRUE, and user can specify
the number of CPUs in the ncores option.
Number of cores for computation. Default is set to be
detectCores(), which is a function from package
parallel that detects the number of CPUs on the
current host. For large dataset, parallel computing is
highly recommended since bootstrap is time-consuming.
Pseudo-number generation for reproduction. Default is 1.
Whether want the estimate to be bias-corrected. Default
is TRUE. If FALSE uncorrected estimate and
corresponding confidence bands will be reported.
When interest = "dist", we sort and unique
variables in interest to estimate weighted CDF. For large
dataset there can be memory problem storing very many of
observations, and thus users can provide a Sort value and
the package will sort and unique based on the weighted
quantile of Sort. If users don't want this feature, set
range_cb = NULL. Default is c(1:99)/100.
Type of bootstrap. Default is "nonpar", and the
package implements nonparametric bootstrap. The
alternative is "weighted", and the package
implements weighted bootstrap.
If subgroup = NULL, all outputs are whole sample. Otherwise output
are subgroup results. When interest = "moment", the output is a list
showing
est Estimates of variables in interest.
bse Bootstrap standard errors.
joint_p P-values that are adjusted for multiplicity to
account for joint testing for all variables.
pointwise_p P-values that doesn't adjust for join testing
If users have further specified cat (e.g., !is.null(cat)), the
fourth component will be replaced with p_cat: P-values that are a
djusted for multiplicity to account for joint testing for all variables
within a category. Users can use summary.ca to tabulate the
results.
When interest = "dist", the output is a list of two components:
infresults A list that stores estimates, upper and lower
confidence bounds for all variables in interest for least and most
affected groups.
sortvar A list that stores sorted and unique variables in
interest.
We recommend using plot.ca command for result visualization.
All estimates are bias-corrected and all confidence bands are monotonized. The bootstrap procedures follow algorithm 2.2 as in Chernozhukov, Fernandez-Val and Luo (2018).
# NOT RUN {
data("mortgage")
### Regression Specification
fm <- deny ~ black + p_irat + hse_inc + ccred + mcred + pubrec +
ltv_med + ltv_high + denpmi + selfemp + single + hischl
### Specify characteristics of interest
t <- c("deny", "p_irat", "black", "hse_inc", "ccred", "mcred", "pubrec",
"denpmi", "selfemp", "single", "hischl", "ltv_med", "ltv_high")
### issue ca command
CA <- ca(fm = fm, data = mortgage, var = "black", method = "logit",
cl = "diff", t = t, b = 50, bc = TRUE)
# }
Run the code above in your browser using DataLab