lpdensity: Local Polynomial Density Estimation and Inference

Description

lpdensity implements the local polynomial regression based density (and derivatives) estimator proposed in Cattaneo, Jansson and Ma (2020). Robust bias-corrected inference methods, both pointwise (confidence intervals) and uniform (confidence bands), are also implemented following the results in Cattaneo, Jansson and Ma (2020, 2023). See Cattaneo, Jansson and Ma (2022) for more implementation details and illustrations.

Companion command: lpbwdensity for bandwidth selection.

Related Stata and R packages useful for nonparametric estimation and inference are available at https://nppackages.github.io/.

Usage

lpdensity(
  data,
  grid = NULL,
  bw = NULL,
  p = NULL,
  q = NULL,
  v = NULL,
  kernel = c("triangular", "uniform", "epanechnikov"),
  scale = NULL,
  massPoints = TRUE,
  bwselect = c("mse-dpi", "imse-dpi", "mse-rot", "imse-rot"),
  stdVar = TRUE,
  regularize = TRUE,
  nLocalMin = NULL,
  nUniqueMin = NULL,
  Cweights = NULL,
  Pweights = NULL
)

Value

Estimate: A matrix containing (1) grid (grid points), (2) bw (bandwidths), (3) nh (number of observations in each local neighborhood), (4) nhu (number of unique observations in each local neighborhood), (5) f_p (point estimates with p-th order local polynomial), (6) f_q (point estimates with q-th order local polynomial, only if option q is nonzero), (7) se_p (standard error corresponding to f_p), and (8) se_q (standard error corresponding to f_q).
CovMat_p: The variance-covariance matrix corresponding to f_p.
CovMat_q: The variance-covariance matrix corresponding to f_q.
opt: A list containing options passed to the function.

Arguments

data: Numeric vector or one dimensional matrix/data frame, the raw data.
grid: Numeric, specifies the grid of evaluation points. When set to default, grid points will be chosen as 0.05-0.95 percentiles of the data, with a step size of 0.05.
bw: Numeric, specifies the bandwidth used for estimation. Can be (1) a positive scalar (common bandwidth for all grid points); or (2) a positive numeric vector specifying bandwidths for each grid point (should be the same length as grid).
p: Nonnegative integer, specifies the order of the local polynomial used to construct point estimates. (Default is 2.)
q: Nonnegative integer, specifies the order of the local polynomial used to construct confidence intervals/bands (a.k.a. the bias correction order). Default is p+1. When set to be the same as p, no bias correction will be performed. Otherwise it should be strictly larger than p.
v: Nonnegative integer, specifies the derivative of the distribution function to be estimated. 0 for the distribution function, 1 (default) for the density funtion, etc.
kernel: String, specifies the kernel function, should be one of "triangular", "uniform", and "epanechnikov".
scale: Numeric, specifies how estimates are scaled. For example, setting this parameter to 0.5 will scale down both the point estimates and standard errors by half. Default is 1. This parameter is useful if only part of the sample is employed for estimation, and should not be confused with Cweights or Pweights.
massPoints: TRUE (default) or FALSE, specifies whether point estimates and standard errors should be adjusted if there are mass points in the data.
bwselect: String, specifies the method for data-driven bandwidth selection. This option will be ignored if bw is provided. Options are (1) "mse-dpi" (default, mean squared error-optimal bandwidth selected for each grid point); (2) "imse-dpi" (integrated MSE-optimal bandwidth, common for all grid points); (3) "mse-rot" (rule-of-thumb bandwidth with Gaussian reference model); and (4) "imse-rot" (integrated rule-of-thumb bandwidth with Gaussian reference model).
stdVar: TRUE (default) or FALSE, specifies whether the data should be standardized for bandwidth selection.
regularize: TRUE (default) or FALSE, specifies whether the bandwidth should be regularized. When set to TRUE, the bandwidth is chosen such that the local region includes at least nLocalMin observations and at least nUniqueMin unique observations.
nLocalMin: Nonnegative integer, specifies the minimum number of observations in each local neighborhood. This option will be ignored if regularize=FALSE. Default is 20+p+1.
nUniqueMin: Nonnegative integer, specifies the minimum number of unique observations in each local neighborhood. This option will be ignored if regularize=FALSE. Default is 20+p+1.
Cweights,: Numeric, specifies the weights used for counterfactual distribution construction. Should have the same length as the data.
Pweights: Numeric, specifies the weights used in sampling. Should have the same length as the data.

Author

Matias D. Cattaneo, Princeton University. cattaneo@princeton.edu.

Michael Jansson, University of California Berkeley. mjansson@econ.berkeley.edu.

Xinwei Ma (maintainer), University of California San Diego. x1ma@ucsd.edu.

Details

Bias correction is only used for the construction of confidence intervals/bands, but not for point estimation. The point estimates, denoted by f_p, are constructed using local polynomial estimates of order p, while the centering of the confidence intervals/bands, denoted by f_q, are constructed using local polynomial estimates of order q. The confidence intervals/bands take the form: [f_q - cv * SE(f_q) , f_q + cv * SE(f_q)], where cv denotes the appropriate critical value and SE(f_q) denotes an standard error estimate for the centering of the confidence interval/band. As a result, the confidence intervals/bands may not be centered at the point estimates because they have been bias-corrected. Setting q and p to be equal results on centered at the point estimate confidence intervals/bands, but requires undersmoothing for valid inference (i.e., (I)MSE-optimal bandwdith for the density point estimator cannot be used). Hence the bandwidth would need to be specified manually when q=p, and the point estimates will not be (I)MSE optimal. See Cattaneo, Jansson and Ma (2020, 2023) for details, and also Calonico, Cattaneo, and Farrell (2018, 2022) for robust bias correction methods.

Sometimes the density point estimates may lie outside of the confidence intervals/bands, which can happen if the underlying distribution exhibits high curvature at some evaluation point(s). One possible solution in this case is to increase the polynomial order p or to employ a smaller bandwidth.

References

Calonico, S., M. D. Cattaneo, and M. H. Farrell. 2018. On the Effect of Bias Estimation on Coverage Accuracy in Nonparametric Inference. Journal of the American Statistical Association, 113(522): 767-779. tools:::Rd_expr_doi("10.1080/01621459.2017.1285776")

Calonico, S., M. D. Cattaneo, and M. H. Farrell. 2022. Coverage Error Optimal Confidence Intervals for Local Polynomial Regression. Bernoulli, 28(4): 2998-3022. tools:::Rd_expr_doi("10.3150/21-BEJ1445")

Cattaneo, M. D., M. Jansson, and X. Ma. 2020. Simple Local Polynomial Density Estimators. Journal of the American Statistical Association, 115(531): 1449-1455. tools:::Rd_expr_doi("10.1080/01621459.2019.1635480")

Cattaneo, M. D., M. Jansson, and X. Ma. 2022. lpdensity: Local Polynomial Density Estimation and Inference. Journal of Statistical Software, 101(2), 1–25. tools:::Rd_expr_doi("10.18637/jss.v101.i02")

Cattaneo, M. D., M. Jansson, and X. Ma. 2023. Local Regression Distribution Estimators. Journal of Econometrics, forthcoming. tools:::Rd_expr_doi("10.1016/j.jeconom.2021.01.006")

Examples

Run this code

# Generate a random sample
set.seed(42); X <- rnorm(2000)

# Estimate density and report results
est1 <- lpdensity(data = X, bwselect = "imse-dpi")
summary(est1)

# Report results for a subset of grid points
summary(est1, grid=est1$Estimate[4:10, "grid"])
summary(est1, gridIndex=4:10)

# Report the 99% uniform confidence band
set.seed(42) # fix the seed for simulating critical values
summary(est1, alpha=0.01, CIuniform=TRUE)

# Plot the estimates and confidence intervals
plot(est1, legendTitle="My Plot", legendGroups=c("X"))

# Plot the estimates and the 99% uniform confidence band
set.seed(42) # fix the seed for simulating critical values
plot(est1, alpha=0.01, CIuniform=TRUE, legendTitle="My Plot", legendGroups=c("X"))

# Adding a histogram to the background
plot(est1, legendTitle="My Plot", legendGroups=c("X"),
  hist=TRUE, histData=X, histBreaks=seq(-1.5, 1.5, 0.25))

Run the code above in your browser using DataLab