lpdensity
implements the local polynomial regression based density (and derivatives)
estimator proposed in Cattaneo, Jansson and Ma (2020). Robust bias-corrected inference methods,
both pointwise (confidence intervals) and uniform (confidence bands), are also implemented
following the results in Cattaneo, Jansson and Ma (2020, 2023).
See Cattaneo, Jansson and Ma (2022) for more implementation details and illustrations.
Companion command: lpbwdensity
for bandwidth selection.
Related Stata
and R
packages useful for nonparametric estimation and inference are
available at https://nppackages.github.io/.
lpdensity(
data,
grid = NULL,
bw = NULL,
p = NULL,
q = NULL,
v = NULL,
kernel = c("triangular", "uniform", "epanechnikov"),
scale = NULL,
massPoints = TRUE,
bwselect = c("mse-dpi", "imse-dpi", "mse-rot", "imse-rot"),
stdVar = TRUE,
regularize = TRUE,
nLocalMin = NULL,
nUniqueMin = NULL,
Cweights = NULL,
Pweights = NULL
)
A matrix containing (1) grid
(grid points), (2) bw
(bandwidths),
(3) nh
(number of observations in each local neighborhood),
(4) nhu
(number of unique observations in each local neighborhood),
(5) f_p
(point estimates with p-th order local polynomial),
(6) f_q
(point estimates with q-th order local polynomial, only if option q
is nonzero),
(7) se_p
(standard error corresponding to f_p
), and (8) se_q
(standard error
corresponding to f_q
).
The variance-covariance matrix corresponding to f_p
.
The variance-covariance matrix corresponding to f_q
.
A list containing options passed to the function.
Numeric vector or one dimensional matrix/data frame, the raw data.
Numeric, specifies the grid of evaluation points. When set to default, grid points will be chosen as 0.05-0.95 percentiles of the data, with a step size of 0.05.
Numeric, specifies the bandwidth
used for estimation. Can be (1) a positive scalar (common
bandwidth for all grid points); or (2) a positive numeric vector specifying bandwidths for
each grid point (should be the same length as grid
).
Nonnegative integer, specifies the order of the local polynomial used to construct point
estimates. (Default is 2
.)
Nonnegative integer, specifies the order of the local polynomial used to construct
confidence intervals/bands (a.k.a. the bias correction order). Default is p+1
. When set to be
the same as p
, no bias correction will be performed. Otherwise it should be
strictly larger than p
.
Nonnegative integer, specifies the derivative of the distribution function to be estimated. 0
for
the distribution function, 1
(default) for the density funtion, etc.
String, specifies the kernel function, should be one of "triangular"
, "uniform"
, and
"epanechnikov"
.
Numeric, specifies how
estimates are scaled. For example, setting this parameter to 0.5 will scale down both the
point estimates and standard errors by half. Default is 1
. This parameter is useful if only
part of the sample is employed for estimation, and should not be confused with Cweights
or Pweights
.
TRUE
(default) or FALSE
, specifies whether point estimates and standard errors
should be adjusted if there are mass points in the data.
String, specifies the method for data-driven bandwidth selection. This option will be
ignored if bw
is provided. Options are (1) "mse-dpi"
(default, mean squared error-optimal
bandwidth selected for each grid point); (2) "imse-dpi"
(integrated MSE-optimal bandwidth,
common for all grid points); (3) "mse-rot"
(rule-of-thumb bandwidth with Gaussian
reference model); and (4) "imse-rot"
(integrated rule-of-thumb bandwidth with Gaussian
reference model).
TRUE
(default) or FALSE
, specifies whether the data should be standardized for
bandwidth selection.
TRUE
(default) or FALSE
, specifies whether the bandwidth should be
regularized. When set to TRUE
, the bandwidth is chosen such that the local region includes
at least nLocalMin
observations and at least nUniqueMin
unique observations.
Nonnegative integer, specifies the minimum number of observations in each local neighborhood. This option
will be ignored if regularize=FALSE
. Default is 20+p+1
.
Nonnegative integer, specifies the minimum number of unique observations in each local neighborhood. This option
will be ignored if regularize=FALSE
. Default is 20+p+1
.
Numeric, specifies the weights used for counterfactual distribution construction. Should have the same length as the data.
Numeric, specifies the weights used in sampling. Should have the same length as the data.
Matias D. Cattaneo, Princeton University. cattaneo@princeton.edu.
Michael Jansson, University of California Berkeley. mjansson@econ.berkeley.edu.
Xinwei Ma (maintainer), University of California San Diego. x1ma@ucsd.edu.
Bias correction is only used for the construction of confidence intervals/bands, but not for point
estimation. The point estimates, denoted by f_p
, are constructed using local polynomial estimates
of order p
, while the centering of the confidence intervals/bands, denoted by f_q
, are constructed
using local polynomial estimates of order q
. The confidence intervals/bands take the form:
[f_q - cv * SE(f_q) , f_q + cv * SE(f_q)]
, where cv
denotes the appropriate critical value and SE(f_q)
denotes an standard error estimate for the centering of the confidence interval/band. As a result,
the confidence intervals/bands may not be centered at the point estimates because they have been bias-corrected.
Setting q
and p
to be equal results on centered at the point estimate confidence intervals/bands,
but requires undersmoothing for valid inference (i.e., (I)MSE-optimal bandwdith for the density point estimator
cannot be used). Hence the bandwidth would need to be specified manually when q=p
, and the
point estimates will not be (I)MSE optimal. See Cattaneo, Jansson and Ma (2020, 2023) for details, and also
Calonico, Cattaneo, and Farrell (2018, 2022) for robust bias correction methods.
Sometimes the density point estimates may lie outside of the confidence intervals/bands, which can happen
if the underlying distribution exhibits high curvature at some evaluation point(s). One possible solution
in this case is to increase the polynomial order p
or to employ a smaller bandwidth.
Calonico, S., M. D. Cattaneo, and M. H. Farrell. 2018. On the Effect of Bias Estimation on Coverage Accuracy in Nonparametric Inference. Journal of the American Statistical Association, 113(522): 767-779. tools:::Rd_expr_doi("10.1080/01621459.2017.1285776")
Calonico, S., M. D. Cattaneo, and M. H. Farrell. 2022. Coverage Error Optimal Confidence Intervals for Local Polynomial Regression. Bernoulli, 28(4): 2998-3022. tools:::Rd_expr_doi("10.3150/21-BEJ1445")
Cattaneo, M. D., M. Jansson, and X. Ma. 2020. Simple Local Polynomial Density Estimators. Journal of the American Statistical Association, 115(531): 1449-1455. tools:::Rd_expr_doi("10.1080/01621459.2019.1635480")
Cattaneo, M. D., M. Jansson, and X. Ma. 2022. lpdensity: Local Polynomial Density Estimation and Inference. Journal of Statistical Software, 101(2), 1–25. tools:::Rd_expr_doi("10.18637/jss.v101.i02")
Cattaneo, M. D., M. Jansson, and X. Ma. 2023. Local Regression Distribution Estimators. Journal of Econometrics, forthcoming. tools:::Rd_expr_doi("10.1016/j.jeconom.2021.01.006")
Supported methods: coef.lpdensity
, confint.lpdensity
, plot.lpdensity
, print.lpdensity
, summary.lpdensity
, vcov.lpdensity
.
# Generate a random sample
set.seed(42); X <- rnorm(2000)
# Estimate density and report results
est1 <- lpdensity(data = X, bwselect = "imse-dpi")
summary(est1)
# Report results for a subset of grid points
summary(est1, grid=est1$Estimate[4:10, "grid"])
summary(est1, gridIndex=4:10)
# Report the 99% uniform confidence band
set.seed(42) # fix the seed for simulating critical values
summary(est1, alpha=0.01, CIuniform=TRUE)
# Plot the estimates and confidence intervals
plot(est1, legendTitle="My Plot", legendGroups=c("X"))
# Plot the estimates and the 99% uniform confidence band
set.seed(42) # fix the seed for simulating critical values
plot(est1, alpha=0.01, CIuniform=TRUE, legendTitle="My Plot", legendGroups=c("X"))
# Adding a histogram to the background
plot(est1, legendTitle="My Plot", legendGroups=c("X"),
hist=TRUE, histData=X, histBreaks=seq(-1.5, 1.5, 0.25))
Run the code above in your browser using DataLab