CA: Empirical classification analysis (CA) and inference

Description

CA conducts CA estimation and inference on user-specified objects of interest: first (weighted) moment or (weighted) distribution. Users can use t to specify variables in interest. When object of interest is moment, use cl to specify linear combinations for hypothesis testing. All estimates are bias-corrected and all confidence bands are monotonized. The bootstrap procedures follow algorithm 2.2 as in Chernozhukov, Fernandez-Val and Luo (2018).

Usage

CA(fm, data, method = "ols", var.type = "binary", var.T, compare,
  subgroup = NULL, samp_weight = NULL, taus = c(1:9)/10, u = 0.1,
  cl = matrix(c(1, 0), nrow = 2), t = c(1, 1, rep(0, dim(data)[2] -
  2)), interest = "moment", cat = NULL, alpha = 0.1, B = 10,
  ncores = 1, seed = 1, bc = TRUE, range.cb = c(0.5:99.5)/100,
  boot.type = "nonpar")

Arguments

Regression formula

data

The data in use (full sample or subpopulation in interset)

method

Models to be used for estimating partial effects. Four options: "logit" (binary response), "probit" (binary response), "ols" (interactive linear with additive errors), "QR" (linear model with non-additive errors). Default is "ols".

var.type

The type of parameter in interest. Three options: "binary", "categorical", "continuous". Default is "binary".

var.T

Variable T in interset. Should be a character.

compare

If parameter in interest is categorical, then user needs to specify which two category to compare with. Should be a 1 by 2 character vector. For example, if the two levels to compare with is 1 and 3, then c=("1", "3"), which will calculate partial effect from 1 to 3. To use this option, users first need to specify var.T as a factor variable.

subgroup

Subgroup in interest. Default is NULL. Specifcation should be a logical variable. For example, suppose data contains indicator variable for women (female if 1, male if 0). If users are interested in women SPE, then users should specify subgroup = data[, "female"] == 1.

samp_weight

Sampling weight of data. If null then function implements empirical bootstrap. If data specifies sampling weight, put that in and the function implements weighted (i.i.d exponential weights) bootstrap. Default is NULL.

taus

Indexes for quantile regression. Default is c(1:9)/10.

Percentile of most and least affected. Default is set to be 0.1.

A pre-specified linear combination. Should be a 2 by L matrix. Default is matrix(c(1,0), nrow=2). L-th column denotes L-th hypothesis For "moment" interest L means the number of hypotheses. cl must be specified as a matrix

An index for CA object. Should be a 1 by ncol(data) indicator vector. Users can either (1) specify names of variables of interest directly, or (2) use 1 to indicate the variable of interest. For example, total number of variables is 5 and interested in the 1st and 3rd vars, then specify t = c(1, 0, 1, 0, 0).

interest

Generic objects in the least and most affected subpopulations. Two options: (1) "moment": weighted mean of Z in the u-least/most affected subpopulation. (2) "dist": distribution of Z in the u-least/most affected subpopulation. Default is interest = "moment".

cat

P-values in classification analysis are adjusted for multiplicity to account for joint testing of zero coefficients on for all variables within a category. Specify all variables in interest in a list using numbers to denote relative positions. For example, if variables in interest are "educ", "male", "female", "low income", "middle income", and "high income", cat should be specified as cat = list(a=1, b=c(2,3), c=c(4,5,6)). Default of cat is NULL.

alpha

Size for confidence interval. Shoule be between 0 and 1. Default is 0.1

Number of bootstrap draws. Default is 10. For more accurate results, we recommend 500.

ncores

Number of cores for computation. Default is set to be 1. For large dataset, parallel computing is highly recommended since bootstrap is time-consuming.

seed

Pseudo-number generation for reproduction. Default is 1.

Whether want the estimate to be bias-corrected. Default is TRUE. If FALSE uncorrected estimate and corresponding confidence bands will be reported.

range.cb

When interest = "dist", we sort and unique variables in interest to estimate weighted CDF. For large dataset there can be memory problem storing very many of observations, and thus users can provide a Sort value and the package will sort and unique based on the weighted quantile of Sort. If users don't want this feature, set range.cb = NULL. Default is c(0.5:99.5)/100. To see how range.cb makes a difference in the plot, refer to the examples in the companion vignette.

boot.type

Type of bootstrap. Default is boot.type = "nonpar", and the package implements nonparametric bootstrap. An alternative is boot.type = "weighted", and the package implements weighted bootstrap.

Value

If subgroup = NULL, all outputs are whole sample. Otherwise output are subgroup results. When interest = "moment", the output is a list showing

est Estimates of variables in interest.
bse Bootstrap standard errors.
joint_p P-values that are adjusted for multiplicity to account for joint testing for all variables.

If users have further specified cat (e.g., !is.null(cat)), the output has a fourth component

p_cat P-values that are adjusted for multiplicity to account for joint testing for all variables within a category.

When interest = "dist", the output is a list of two components:

infresults A list that stores estimates, upper and lower confidence bounds for all variables in interest for least and most affected groups.
sortvar A list that stores sorted and unique variables in interest.

We recommend using CAplot command for result visualization.

Examples

Run this code

# NOT RUN {
data("mortgage")
fm <- deny ~ black + p_irat
t <- c(rep(1, 2), rep(0, 14)) # Specify variables in interest
cl <- matrix(c(1,0,0,1), nrow=2) # Meaning: show variables in interest for both groups
CA <- CA(fm = fm, data = mortgage, var.T = "black", method = "logit", cl = cl, t = t)

# Tabulate the results
est <- matrix(CA$est, ncol=2)
se <- matrix(CA$bse, ncol=2)
Table <- matrix(0, ncol=4, nrow=2)
Table[, 1] <- est[, 1] # Least Affected Bias-corrected estimate
Table[, 2] <- se[, 1] # Corresponding SE
Table[, 3] <- est[, 2] # Most affected
Table[, 4] <- se[, 2] # Corresponding SE
rownames(Table) <- colnames(CA$est)[1:2] # assign names to each row
colnames(Table) <- rep(c("Estimate", "SE"), 2)

# }

Run the code above in your browser using DataLab