Smoothed bootstrap is an extension of standard bootstrap using kernel densities.
kernelboot(
data,
statistic,
R = 500L,
bw = "default",
kernel = c("multivariate", "gaussian", "epanechnikov", "rectangular", "triangular",
"biweight", "cosine", "optcosine", "none"),
weights = NULL,
adjust = 1,
shrinked = TRUE,
ignore = NULL,
parallel = FALSE,
workers = 1L
)
vector, matrix, or data.frame. For non-numeric values standard bootstrap is applied (see below).
a function that is applied to the data
. The first argument of
the function will always be the original data.
the number of bootstrap replicates.
the smoothing bandwidth to be used (see density
).
The kernels are scaled such that this is the standard deviation,
or covariance matrix of the smoothing kernel. By default
bw.nrd0
is used for univariate data,
and bw.silv
is used for multivariate data. When using
kernel = "multivariate"
this parameter should be a
covariance matrix of the smoothing kernel.
a character string giving the smoothing kernel to be used.
This must partially match one of "multivariate", "gaussian",
"rectangular", "triangular", "epanechnikov", "biweight", "cosine",
"optcosine", or "none" with default "multivariate", and may be abbreviated.
Using kernel = "multivariate"
forces multivariate Gaussian kernel
(or univariate Gaussian for univariate data). Using kernel = "none"
forces using standard bootstrap (no kernel smoothing).
vector of importance weights. It should have as many elements
as there are observations in data
. It defaults to uniform
weights.
scalar; the bandwidth used is actually adjust*bw
. This makes
it easy to specify values like 'half the default' bandwidth.
logical; if TRUE
random generation algorithm preserves
means and variances of the variables. This parameter is ignored for
"multivariate" kernel.
vector of names of columns to be ignored during the smoothing phase of bootstrap procedure (their values are not altered using random noise).
if TRUE
, parallel computing is used (see future_lapply
).
Warning: using parallel computing does not necessary have to
lead to improved performance.
the number of workers used for parallel computing.
Smoothed bootstrap is an extension of standard bootstrap procedure, where instead of drawing samples with replacement from the empirical distribution, they are drawn from kernel density estimate of the distribution.
For smoothed bootstrap, points (in univariate case), or rows (in multivariate case), are drawn with
replacement, to obtain samples of size \(n\) from the initial dataset of size \(n\), as with
standard bootstrap. Next, random noise from kernel density \(K\) is added to each of the drawn
values. The procedure is repeated \(R\) times and statistic
is evaluated on each of the
samples.
The noise is added only to the numeric columns, while non-numeric columns (e.g.
character
, factor
, logical
) are not altered. What follows, to the
non-numeric columns and columns listed in ignore
parameter standard bootstrap procedure
is applied.
Univariate kernel densities
Univariate kernel density estimator is defined as
$$ \hat{f_h}(x) = \sum_{i=1}^n w_i \, K_h(x-y_i) $$
where \(w\) is a vector of weights such that all \(w_i \ge 0\) and \(\sum_i w_i = 1\) (by default uniform \(1/n\) weights are used), \(K_h = K(x/h)/h\) is kernel \(K\) parametrized by bandwidth \(h\) and \(y\) is a vector of data points used for estimating the kernel density.
To draw samples from univariate kernel density, the following procedure can be applied (Silverman, 1986):
Step 1 Sample \(i\) uniformly with replacement from \(1,\dots,n\).
Step 2 Generate \(\varepsilon\) to have probability density \(K\).
Step 3 Set \(x = y_i + h\varepsilon\).
If samples are required to have the same variance as data
(i.e. shrinked = TRUE
), then Step 3 is modified
as following:
Step 3' \( x = \bar y + (y_i - \bar y + h\varepsilon)/(1 + h^2 \sigma^2_K/\sigma^2_Y)^{1/2} \)
where \(\sigma_K^2\) is variance of the kernel (fixed to 1 for kernels used in this package).
When shrinkage described in Step 3' is applied, the smoothed bootstrap density function changes it's form to
$$ \hat{f}_{h,b}(x) = (1 + r) \; \hat{f_h}(x + r(x - \bar{y})) $$
where \(r = \left(1 + h^2 \sigma_K^2 / \sigma_y^2 \right)^{1/2}-1\).
This package offers the following univariate kernels:
Gaussian | \(\frac{1}{\sqrt{2\pi}} e^{-{u^2}/2}\) |
Rectangular | \(\frac{1}{2} \ \mathbf{1}_{(|u|\leq1)}\) |
Triangular | \((1-|u|) \ \mathbf{1}_{(|u|\leq1)}\) |
Epanchenikov | \(\frac{3}{4}(1-u^2) \ \mathbf{1}_{(|u|\leq1)}\) |
Biweight | \(\frac{15}{16}(1-u^2)^2 \ \mathbf{1}_{(|u|\leq1)}\) |
Cosine | \(\frac{1}{2} \left(1 + \cos(\pi u)\right) \ \mathbf{1}_{(|u|\leq1)}\) |
Optcosine | \(\frac{\pi}{4}\cos\left(\frac{\pi}{2}u\right) \ \mathbf{1}_{(|u|\leq1)}\) |
All the kernels are re-scalled so that their standard deviations are equal to 1, so that bandwidth parameter controls their standard deviations.
Random generation from Epanchenikov kernel is done using algorithm
described by Devroye (1986). For optcosine kernel inverse transform
sampling is used. For biweight kernel random values are drawn from
\(\mathrm{Beta}(3, 3)\) distribution and
\(\mathrm{Beta}(3.3575, 3.3575)\)
distribution serves as a close approximation of cosine kernel.
Random generation for triangular kernel is done by taking difference
of two i.i.d. uniform random variates. To sample from rectangular
and Gaussian kernels standard random generation algorithms are used
(see runif
and rnorm
).
Product kernel densities
Univariate kernels may easily be extended to multiple dimensions by using product kernel
$$ \hat{f_H}(\mathbf{x}) = \sum_{i=1}^n w_i \prod_{j=1}^m K_{h_j}(x_i - y_{ij}) $$
where \(w\) is a vector of weights such that all \(w_i \ge 0\) and \(\sum_i w_i = 1\) (by default uniform \(1/n\) weights are used), and \(K_{h_j}\) are univariate kernels \(K\) parametrized by bandwidth \(h_j\), where \(\boldsymbol{y}\) is a matrix of data points used for estimating the kernel density.
Random generation from product kernel is done by drawing with replacement rows of \(y\), and then adding to the sampled values random noise from univariate kernels \(K\), parametrized by corresponding bandwidth parameters \(h_j\).
Multivariate kernel densities
Multivariate kernel density estimator may also be defined in terms of multivariate kernels \(K_H\) (e.g. multivariate normal distribution, as in this package)
$$ \hat{f_H}(\mathbf{x}) = \sum_{i=1}^n w_i \, K_H( \mathbf{x}-\boldsymbol{y}_i) $$
where \(w\) is a vector of weights such that all \(w_i \ge 0\) and \(\sum_i w_i = 1\) (by default uniform \(1/n\) weights are used), \(K_H\) is kernel \(K\) parametrized by bandwidth matrix \(H\) and \(\boldsymbol{y}\) is a matrix of data points used for estimating the kernel density.
Notice: When using multivariate normal (Gaussian) distribution as a kernel \(K\), the bandwidth parameter \(H\) is a covariance matrix as compared to standard deviations used in univariate and product kernels.
Random generation from multivariate kernel is done by drawing with replacement
rows of \(y\), and then adding to the sampled values random noise from
multivariate normal distribution centered at the data points and parametrized
by corresponding bandwidth matrix \(H\). For further details see rmvg
.
Silverman, B. W. (1986). Density estimation for statistics and data analysis. Chapman and Hall/CRC.
Scott, D. W. (1992). Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons.
Efron, B. (1981). Nonparametric estimates of standard error: the jackknife, the bootstrap and other methods. Biometrika, 589-599.
Hall, P., DiCiccio, T.J. and Romano, J.P. (1989). On smoothing and the bootstrap. The Annals of Statistics, 692-704.
Silverman, B.W. and Young, G.A. (1987). The bootstrap: To smooth or not to smooth? Biometrika, 469-479.
Scott, D.W. (1992). Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons.
Wang, S. (1995). Optimizing the smoothed bootstrap. Annals of the Institute of Statistical Mathematics, 47(1), 65-80.
Young, G.A. (1990). Alternative smoothed bootstraps. Journal of the Royal Statistical Society. Series B (Methodological), 477-484.
De Angelis, D. and Young, G.A. (1992). Smoothing the bootstrap. International Statistical Review/Revue Internationale de Statistique, 45-56.
Polansky, A.M. and Schucany, W. (1997). Kernel smoothing to improve bootstrap confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(4), 821-838.
Devroye, L. (1986). Non-uniform random variate generation. New York: Springer-Verlag.
Parzen, E. (1962). On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3), 1065-1076.
Silverman, B.W. and Young, G.A. (1987). The bootstrap: To smooth or not to smooth? Biometrika, 469-479.
Jones, M.C. (1991). On correcting for variance inflation in kernel density estimation. Computational Statistics & Data Analysis, 11, 3-15.
bw.silv
, density
,
bandwidth
, kernelboot-class
set.seed(1)
# smooth bootstrap of parameters of linear regression
b1 <- kernelboot(mtcars, function(data) coef(lm(mpg ~ drat + wt, data = data)) , R = 250)
b1
summary(b1)
b2 <- kernelboot(mtcars, function(data) coef(lm(mpg ~ drat + wt, data = data)) , R = 250,
kernel = "epanechnikov")
b2
summary(b2)
# smooth bootstrap of parameters of linear regression
# smoothing phase is not applied to "am" and "cyl" variables
b3 <- kernelboot(mtcars, function(data) coef(lm(mpg ~ drat + wt + am + cyl, data = data)) , R = 250,
ignore = c("am", "cyl"))
b3
summary(b3)
# standard bootstrap (without kernel smoothing)
b4 <- kernelboot(mtcars, function(data) coef(lm(mpg ~ drat + wt + am + cyl, data = data)) , R = 250,
ignore = colnames(mtcars))
b4
summary(b4)
# smooth bootstrap for median of univariate data
b5 <- kernelboot(mtcars$mpg, function(data) median(data) , R = 250)
b5
summary(b5)
Run the code above in your browser using DataLab