binsreg implements binscatter estimation with robust inference proposed and plots, following the
results in Cattaneo, Crump, Farrell and Feng (2019a).
Binscatter provides a flexible way of describing
the mean relationship between two variables, after possibly adjusting for other covariates, based on
partitioning/binning of the independent variable of interest. The main purpose of this function is to
generate binned scatter plots with curve estimation with robust pointwise confidence intervals and
uniform confidence band. If the binning scheme is not set by the user, the companion function
binsregselect is used to implement binscatter in a data-driven (optimal)
way. Hypothesis testing about the regression function can also be conducted via the companion
function binsregtest.
binsreg(y, x, w = NULL, deriv = 0, dots = c(0, 0), dotsgrid = 0,
dotsgridmean = T, line = NULL, linegrid = 20, ci = NULL,
cigrid = 0, cigridmean = T, cb = NULL, cbgrid = 20,
polyreg = NULL, polyreggrid = 20, polyregcigrid = 0, by = NULL,
bycolors = NULL, bysymbols = NULL, bylpatterns = NULL,
legendTitle = NULL, legendoff = F, testmodel = c(3, 3),
testmodelparfit = NULL, testmodelpoly = NULL, testshape = c(3, 3),
testshapel = NULL, testshaper = NULL, testshape2 = NULL,
nbins = NULL, binspos = "qs", binsmethod = "dpi",
nbinsrot = NULL, samebinsby = F, nsims = 500, simsgrid = 20,
simsseed = 666, vce = "HC1", cluster = NULL, level = 95,
noplot = F, dfcheck = c(20, 30), masspoints = "on",
weights = NULL, subset = NULL)outcome variable. A vector.
independent variable of interest. A vector.
control variables. A matrix or a vector.
derivative order of the regression function for estimation, testing and plotting.
The default is deriv=0, which corresponds to the function itself.
a vector. dots=c(p,s) sets a piecewise polynomial of degree p with s smoothness constraints for
point estimation and plotting as "dots". The default is dots=c(0,0), which corresponds to
piecewise constant (canonical binscatter)
number of dots within each bin to be plotted. Given the choice, these dots are point estimates
evaluated over an evenly-spaced grid within each bin. The default is dotsgrid=0, and only
the point estimates at the mean of x within each bin are presented.
If true, the dots corresponding to the point estimates evaluated at the mean of x within each bin
are presented. By default, they are presented, i.e., dotsgridmean=T.
a vector. line=c(p,s) sets a piecewise polynomial of degree p with s smoothness constraints
for plotting as a "line". By default, the line is not included in the plot unless explicitly
specified. Recommended specification is line=c(3,3), which adds a cubic B-spline estimate
of the regression function of interest to the binned scatter plot.
number of evaluation points of an evenly-spaced grid within each bin used for evaluation of
the point estimate set by the line=c(p,s) option. The default is linegrid=20,
which corresponds to 20 evenly-spaced evaluation points within each bin for fitting/plotting the line.
a vector. ci=c(p,s) sets a piecewise polynomial of degree p with s smoothness constraints used for
constructing confidence intervals. By default, the confidence intervals are not included in the plot
unless explicitly specified. Recommended specification is ci=c(3,3), which adds confidence
intervals based on cubic B-spline estimate of the regression function of interest to the binned scatter plot.
number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the point
estimate set by the ci=c(p,s) option. The default is cigrid=1, which corresponds to 1
evenly-spaced evaluation point within each bin for confidence interval construction.
If true, the confidence intervals corresponding to the point estimates evaluated at the mean of x within each bin
are presented. The default is cigridmean=T.
a vector. cb=c(p,s) sets a the piecewise polynomial of degree p with s smoothness constraints used for
constructing the confidence band. By default, the confidence band is not included in the plot unless
explicitly specified. Recommended specification is cb=c(3,3), which adds a confidence band
based on cubic B-spline estimate of the regression function of interest to the binned scatter plot.
number of evaluation points of an evenly-spaced grid within each bin used for evaluation of the point
estimate set by the cb=c(p,s) option. The default is cbgrid=20, which corresponds
to 20 evenly-spaced evaluation points within each bin for confidence interval construction.
degree of a global polynomial regression model for plotting. By default, this fit is not included
in the plot unless explicitly specified. Recommended specification is polyreg=3, which
adds a cubic (global) polynomial fit of the regression function of interest to the binned scatter plot.
number of evaluation points of an evenly-spaced grid within each bin used for evaluation of
the point estimate set by the polyreg=p option. The default is polyreggrid=20,
which corresponds to 20 evenly-spaced evaluation points within each bin for confidence
interval construction.
number of evaluation points of an evenly-spaced grid within each bin used for constructing
confidence intervals based on polynomial regression set by the polyreg=p option.
The default is polyregcigrid=0, which corresponds to not plotting confidence
intervals for the global polynomial regression approximation.
a vector containing the group indicator for subgroup analysis; both numeric and string variables
are supported. When by is specified, binsreg implements estimation and inference by each subgroup
separately, but produces a common binned scatter plot. By default, the binning structure is selected for each
subgroup separately, but see the option samebinsby below for imposing a common binning structure across subgroups.
an ordered list of colors for plotting each subgroup series defined by the option by.
an ordered list of symbols for plotting each subgroup series defined by the option by.
an ordered list of line patterns for plotting each subgroup series defined by the option by.
String, title of legend.
If true, no legend is added.
a vector. testmodel=c(p,s) sets a piecewise polynomial of degree p with s
smoothness constraints for parametric model specification testing. The default is
testmodel=c(3,3), which corresponds to a cubic B-spline estimate of the regression
function of interest for testing against the fitting from a parametric model specification.
a data frame or matrix which contains the evaluation grid and fitted values of the model(s) to be tested against. The first column contains a series of evaluation points at which the binscatter model and the parametric model of interest are compared with each other. Each parametric model is represented by other columns, which must contain the fitted values at the corresponding evaluation points.
degree of a global polynomial model to be tested against.
a vector. testshape=c(p,s) sets a piecewise polynomial of degree p with s
smoothness constraints for nonparametric shape restriction testing. The default is
testshape=c(3,3), which corresponds to a cubic B-spline estimate of the regression
function of interest for one-sided or two-sided testing.
a vector of null boundary values for hypothesis testing. Each number a in the vector
corresponds to one boundary of a one-sided hypothesis test to the left of the form
H0: sup_x mu(x)<=a.
a vector of null boundary values for hypothesis testing. Each number a in the vector
corresponds to one boundary of a one-sided hypothesis test to the right of the form
H0: inf_x mu(x)>=a.
a vector of null boundary values for hypothesis testing. Each number a in the vector
corresponds to one boundary of a two-sided hypothesis test ofthe form
H0: sup_x |mu(x)-a|=0.
number of bins for partitioning/binning of x. If not specified, the number of bins is
selected via the companion function binsregselect in a data-driven, optimal way whenever possible.
position of binning knots. The default is binspos="qs", which corresponds to quantile-spaced
binning (canonical binscatter). The other options are "es" for evenly-spaced binning, or
a vector for manual specification of the positions of inner knots (which must be within the range of
x).
method for data-driven selection of the number of bins. The default is binsmethod="dpi",
which corresponds to the IMSE-optimal direct plug-in rule. The other option is: "rot"
for rule of thumb implementation.
initial number of bins value used to construct the DPI number of bins selector. If not specified, the data-driven ROT selector is used instead.
if true, a common partitioning/binning structure across all subgroups specified by the option by is forced.
The knots positions are selected according to the option binspos and using the full sample. If nbins
is not specified, then the number of bins is selected via the companion command binsregselect and
using the full sample.
number of random draws for constructing confidence bands and hypothesis testing. The default is
nsims=500, which corresponds to 500 draws from a standard Gaussian random vector of size
[(p+1)*J - (J-1)*s].
number of evaluation points of an evenly-spaced grid within each bin used for evaluation of
the supremum (or infimum) operation needed to construct confidence bands and hypothesis testing
procedures. The default is simsgrid=20, which corresponds to 20 evenly-spaced
evaluation points within each bin for approximating the supremum (or infimum) operator.
seed for simulation.
Procedure to compute the variance-covariance matrix estimator. Options are
"const" homoskedastic variance estimator.
"HC0" heteroskedasticity-robust plug-in residuals variance estimator
without weights.
"HC1" heteroskedasticity-robust plug-in residuals variance estimator
with hc1 weights. Default.
"HC2" heteroskedasticity-robust plug-in residuals variance estimator
with hc2 weights.
"HC3" heteroskedasticity-robust plug-in residuals variance estimator
with hc3 weights.
cluster ID. Used for compute cluster-robust standard errors.
nominal confidence level for confidence interval and confidence band estimation. Default is level=95.
If true, no plot produced.
adjustments for minimum effective sample size checks, which take into account number of unique
values of x (i.e., number of mass points), number of clusters, and degrees of freedom of
the different stat models considered. The default is dfcheck=c(20, 30).
See Cattaneo, Crump, Farrell and Feng (2019b) for more details.
how mass points in x are handled. Available options:
"on" all mass point and degrees of freedom checks are implemented. Default.
"noadjust" mass point checks and the corresponding effective sample size adjustments are omitted.
"nolocalcheck" within-bin mass point and degrees of freedom checks are omitted.
"off" "noadjust" and "nolocalcheck" are set simultaneously.
"veryfew" forces the function to proceed as if x has only a few number of mass points (i.e., distinct values).
In other words, forces the function to proceed as if the mass point and degrees of freedom checks were failed.
an optional vector of weights to be used in the fitting process. Should be NULL or
a numeric vector. For more details, see lm.
Optional rule specifying a subset of observations to be used.
bins_plotA ggplot object for binscatter plot.
data.plotA list containing data for plotting. Each item is a sublist of data frames for each group. Each sublist may contain the following data frames:
data.dots Data for dots. It contains: x, evaluation points; bin, the indicator of bins;
isknot, indicator of inner knots; mid, midpoint of each bin; and fit, fitted values.
data.line Data for line. It contains: x, evaluation points; bin, the indicator of bins;
isknot, indicator of inner knots; mid, midpoint of each bin; and fit, fitted values.
data.ci Data for CI. It contains: x, evaluation points; bin, the indicator of bins;
isknot, indicator of inner knots; mid, midpoint of each bin;
ci.l and ci.r, left and right boundaries of each confidence intervals.
data.cb Data for CB. It contains: x, evaluation points; bin, the indicator of bins;
isknot, indicator of inner knots; mid, midpoint of each bin;
cb.l and cb.r, left and right boundaries of the confidence band.
data.poly Data for polynomial regression. It contains: x, evaluation points;
bin, the indicator of bins;
isknot, indicator of inner knots; mid, midpoint of each bin; and
fit, fitted values.
data.polyci Data for confidence intervals based on polynomial regression. It contains: x, evaluation points;
bin, the indicator of bins;
isknot, indicator of inner knots; mid, midpoint of each bin;
polyci.l and polyci.r, left and right boundaries of each confidence intervals.
cval.byA vector of critical values for constructing confidence band for each group.
testReturn of binsregtest.
optA list containing options passed to the function, as well as N.by (total sample size for each group),
Ndist.by (number of distinct values in x for each group), Nclust.by (number of clusters for each group),
and nbins.by (number of bins for each group), and byvals (number of distinct values in by).
Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2019a: On Binscatter. Working Paper.
Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2019b: Binscatter Regressions. Working Paper.
# NOT RUN {
x <- runif(500); y <- sin(x)+rnorm(500)
## Binned scatterplot
binsreg(y,x)
# }
Run the code above in your browser using DataLab