binsregselect: Data-Driven IMSE-Optimal Partitioning/Binning Selection for Binscatter

Description

binsregselect implements data-driven procedures for selecting the number of bins for binscatter estimation. The selected number is optimal in minimizing integrated mean squared error (IMSE).

Usage

binsregselect(y, x, w = NULL, data = NULL, deriv = 0, bins = c(0, 0),
  binspos = "qs", binsmethod = "dpi", nbinsrot = NULL, simsgrid = 20,
  savegrid = F, vce = "HC1", useeffn = NULL, randcut = NULL,
  cluster = NULL, dfcheck = c(20, 30), masspoints = "on",
  weights = NULL, subset = NULL, norotnorm = F, numdist = NULL,
  numclust = NULL)

Value

nbinsrot.poly: ROT number of bins, unregularized.
nbinsrot.regul: ROT number of bins, regularized.
nbinsrot.uknot: ROT number of bins, unique knots.
nbinsdpi: DPI number of bins.
nbinsdpi.uknot: DPI number of bins, unique knots.
imse.v.rot: variance constant in IMSE expansion, ROT selection.
imse.b.rot: bias constant in IMSE expansion, ROT selection.
imse.v.dpi: variance constant in IMSE expansion, DPI selection.
imse.b.dpi: bias constant in IMSE expansion, DPI selection.
opt: A list containing options passed to the function, as well as total sample size n, number of distinct values Ndist in x, and number of clusters Nclust.
data.grid: A data frame containing grid.

Arguments

y

outcome variable. A vector.

x

independent variable of interest. A vector.

w

control variables. A matrix, a vector or a formula.

data

an optional data frame containing variables used in the model.

deriv

derivative order of the regression function for estimation, testing and plotting. The default is deriv=0, which corresponds to the function itself.

bins

a vector. bins=c(p,s) set a piecewise polynomial of degree p with s smoothness constraints for data-driven (IMSE-optimal) selection of the partitioning/binning scheme. The default is bins=c(0, 0), which corresponds to piecewise constant (canonical binscatter).

binspos

position of binning knots. The default is binspos="qs", which corresponds to quantile-spaced binning (canonical binscatter). The other options is "es" for evenly-spaced binning.

binsmethod

method for data-driven selection of the number of bins. The default is binsmethod="dpi", which corresponds to the IMSE-optimal direct plug-in rule. The other option is: "rot" for rule of thumb implementation.

nbinsrot

initial number of bins value used to construct the DPI number of bins selector. If not specified, the data-driven ROT selector is used instead.

simsgrid

number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the supremum (infimum or Lp metric) operation needed to construct confidence bands and hypothesis testing procedures. The default is simsgrid=20, which corresponds to 20 evenly-spaced evaluation points within each bin for approximating the supremum (infimum or Lp metric) operator.

savegrid

If true, a data frame produced containing grid.

vce

procedure to compute the variance-covariance matrix estimator. Options are

"const" homoskedastic variance estimator.
"HC0" heteroskedasticity-robust plug-in residuals variance estimator without weights.
"HC1" heteroskedasticity-robust plug-in residuals variance estimator with hc1 weights. Default.
"HC2" heteroskedasticity-robust plug-in residuals variance estimator with hc2 weights.
"HC3" heteroskedasticity-robust plug-in residuals variance estimator with hc3 weights.

useeffn

effective sample size to be used when computing the (IMSE-optimal) number of bins. This option is useful for extrapolating the optimal number of bins to larger (or smaller) datasets than the one used to compute it.

randcut

upper bound on a uniformly distributed variable used to draw a subsample for bins selection. Observations for which runif()<=# are used. # must be between 0 and 1.

cluster

cluster ID. Used for compute cluster-robust standard errors.

dfcheck

adjustments for minimum effective sample size checks, which take into account number of unique values of x (i.e., number of mass points), number of clusters, and degrees of freedom of the different statistical models considered. The default is dfcheck=c(20, 30). See Cattaneo, Crump, Farrell and Feng (2021b) for more details.

masspoints

how mass points in x are handled. Available options:

"on" all mass point and degrees of freedom checks are implemented. Default.
"noadjust" mass point checks and the corresponding effective sample size adjustments are omitted.
"nolocalcheck" within-bin mass point and degrees of freedom checks are omitted.
"off" "noadjust" and "nolocalcheck" are set simultaneously.
"veryfew" forces the function to proceed as if x has only a few number of mass points (i.e., distinct values). In other words, forces the function to proceed as if the mass point and degrees of freedom checks were failed.

weights

an optional vector of weights to be used in the fitting process. Should be NULL or a numeric vector. For more details, see lm.

subset

optional rule specifying a subset of observations to be used.

norotnorm

if true, a uniform density rather than normal density used for ROT selection.

numdist

number of distinct for selection. Used to speed up computation.

numclust

number of clusters for selection. Used to speed up computation.

Author

Matias D. Cattaneo, Princeton University, Princeton, NJ. cattaneo@princeton.edu.

Richard K. Crump, Federal Reserve Bank of New York, New York, NY. richard.crump@ny.frb.org.

Max H. Farrell, University of Chicago, Chicago, IL. max.farrell@chicagobooth.edu.

Yingjie Feng (maintainer), Tsinghua University, Beijing, China. fengyingjiepku@gmail.com.

References

Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2021a: On Binscatter. Working Paper.

Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2021b: Binscatter Regressions. Working Paper.

Examples

Run this code

 x <- runif(500); y <- sin(x)+rnorm(500)
 est <- binsregselect(y,x)
 summary(est)

Run the code above in your browser using DataLab