np.boot: Nonparametric Bootstrap Resampling

Description

Nonparametric bootstrap resampling for univariate and multivariate statistics. Computes bootstrap estimates of the standard error, bias, and covariance. Also computes five different types of bootstrap confidence intervals: normal approximation interval, basic (reverse percentile) interval, percentile interval, studentized (bootstrap-t) interval, and bias-corrected and accelerated (BCa) interval.

Usage

np.boot(x, statistic, ..., R = 9999, level = c(0.9, 0.95, 0.99),
        method = c("norm", "basic", "perc", "stud", "bca")[-4], 
        sdfun = NULL, sdrep = 99, jackknife = NULL, 
        parallel = FALSE, cl = NULL, boot.dist = TRUE)

Value

t0: Observed statistic, computed using statistic(x, ...)
se: Bootstrap estimate of the standard error.
bias: Bootstrap estimate of the bias.
cov: Bootstrap estimate of the covariance (for multivariate statistics).
normal: Normal approximation confidence interval(s).
basic: Basic (reverse percentile) confidence interval(s).
percent: Percentile confidence interval(s).
student: Studentized (bootstrap-t) confidence interval(s).
bca: Bias-corrected and accelerated (BCa) confidence interval(s).
z0: Bias-correction factor(s). Only provided if bca %in% method.
acc: Acceleration factor(s). Only provided if bca %in% method.
boot.dist: Bootstrap distribution of statistic(s). Only provided if boot.dist = TRUE.
statistic: Statistic function (same as input).
R: Number of bootstrap replicates (same as input).
level: Confidence level (same as input).
sdfun: Standard deviation function for statistic (same as input).
sdrep: Number of inner bootstrap replicates (same as input).
jackknife: Jackknife function (same as input).

Arguments

x: vector of data (for univariate data), data frame (for basic multivariate data), or vector of row indices (for advanced multivariate data). See examples.
statistic: function that takes in x (and possibly additional arguments passed using ...) and returns a vector containing the statistic(s). See examples.
...: additional named arguments for the statistic function.
R: number of bootstrap replicates
level: desired confidence level(s) for the computed intervals. Default computes 90%, 95%, and 99% confidence intervals.
method: method(s) for computing confidence intervals. Partial matching is allowed. Any subset of allowable methods is permitted (default computes all intervals except studentized). Set method = NULL to produce no confidence intervals.
sdfun: function for computing the standard deviation of statistic. Should produce a vector the same length as the output of statistic. Only applicable if "stud" %in% method. If NULL, an inner bootstrap is used to estimate the standard deviation.
sdrep: number of bootstrap replicates for the inner bootstrap used to estimate the standard deviation of statistic. Only applicable if "stud" %in% method and sdfun = NULL. Larger values produce more accurate estimates (see Note).
jackknife: function that takes in x (and possibly additional arguments passed using ...) and returns a vector containing the jackknife statistic(s). Should produce a vector the same length as the output of statistic. Only applicable if "bca" %in% method. If NULL, the jackknife function is defined as the statistic function (default). See the last example for a case when statistic and jackknife are different.
parallel: Logical indicating if the parallel package should be used for parallel computing (of the bootstrap distribution). Defaults to FALSE, which implements sequential computing.
cl: Cluster for parallel computing, which is used when parallel = TRUE. Note that if parallel = TRUE and cl = NULL, then the cluster is defined as makeCluster(2L) to use two cores. To make use of all available cores, use the code cl = makeCluster(detectCores()).
boot.dist: Logical indicating if the bootstrap distribution should be returned (see Note).

Author

Nathaniel E. Helwig <helwig@umn.edu>

Details

The first three intervals (normal, basic, and percentile) are only first-order accurate, whereas the last two intervals (studentized and BCa) are both second-order accurate. Thus, the results from the studentized and BCa intervals tend to provide more accurate coverage rates.

Unless the standard deviation function for the studentized interval is input via the sdfun argument, the studentized interval can be quite computationally costly. This is because an inner bootstrap is needed to estimate the standard deviation of the statistic for each (outer) bootstrap replicate---and you may want to increase the default number of inner bootstrap replicates (see Note).

The efficiency of the BCa interval will depend on the sample size \(n\) and the computational complexity of the (jackknife) statistic estimate. Assuming that \(n\) is not too large and the jackknife statistic is not too difficult to compute, the BCa interval can be computed reasonably quickly---especially in comparison the studentized interval with an inner bootstrap.

Computational details of the various confidence intervals are described in Efron and Tibshirani (1994) and in Davison and Hinkley (1997). For a useful and concise discussion of the various intervals, see Carpenter and Bithell (2000).

References

Carpenter, J., & Bithell, J. (2000). Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians. Statistics in Medicine, 19(9), 1141-1164. doi: 10.1002/(SICI)1097-0258(20000515)19:9%3C1141::AID-SIM479%3E3.0.CO;2-F

Davison, A. C., & Hinkley, D. V. (1997). Bootstrap Methods and Their Application. Cambridge University Press. doi: 10.1017/CBO9780511802843

Efron, B., & Tibshirani, R. J. (1994). An Introduction to the Boostrap. Chapman & Hall/CRC. doi: 10.1201/9780429246593

Examples

Run this code

######***######   UNIVARIATE DATA   ######***######

### Example 1: univariate statistic (median)

# generate 100 standard normal observations
set.seed(1)
n <- 100
x <- rnorm(n)

# nonparametric bootstrap
npbs <- np.boot(x = x, statistic = median)
npbs


### Example 2: multivariate statistic (quartiles)

# generate 100 standard normal observations
set.seed(1)
n <- 100
x <- rnorm(n)

# nonparametric bootstrap
npbs <- np.boot(x = x, statistic = quantile, 
                probs = c(0.25, 0.5, 0.75))
npbs



if (FALSE) {

######***######   MULTIVARIATE DATA   ######***######

### Example 1: univariate statistic (correlation)

## Generate bivariate data with population var = 1 and cor = 0.5

# correlation matrix square root (with rho = 0.5)
rho <- 0.5
val <- c(sqrt(1 + rho), sqrt(1 - rho))
corsqrt <- matrix(c(val[1], -val[2], val), 2, 2) / sqrt(2)

# generate 100 bivariate observations (with rho = 0.5)
n <- 100
set.seed(1)
data <- cbind(rnorm(n), rnorm(n)) %*% corsqrt

## Method A: x = data frame;  statistic of x

# define statistic function
statfun <- function(data) cor(data[,1], data[,2])

# nonparametric bootstrap
set.seed(2)
npbs <- np.boot(x = data, statistic = statfun)
npbs

## Method B: x = 1:n;  statistic of data[ix,] with data passed via ...

# define statistic function
statfun <- function(ix, data) cor(data[ix,1], data[ix,2])

# nonparametric bootstrap
set.seed(2)
npbs <- np.boot(x = 1:n, statistic = statfun, data = data)
npbs


### Example 2: multivariate statistic (variances and covariance)

## Generate bivariate data with population var = 1 and cor = 0.5

# correlation matrix square root (with rho = 0.5)
rho <- 0.5
val <- c(sqrt(1 + rho), sqrt(1 - rho))
corsqrt <- matrix(c(val[1], -val[2], val), 2, 2) / sqrt(2)

# generate 100 bivariate observations (with rho = 0.5)
n <- 100
set.seed(1)
data <- cbind(rnorm(n), rnorm(n)) %*% corsqrt

## Method A: x = data frame;  statistic of x

# define statistic function
statfun <- function(data) {
  cmat <- cov(data)
  ltri <- lower.tri(cmat, diag = TRUE)
  cvec <- cmat[ltri]
  names(cvec) <- c("var(x1)", "cov(x1,x2)", "var(x2)")
  cvec
}

# nonparametric bootstrap
set.seed(2)
npbs <- np.boot(x = data, statistic = statfun)
npbs

## Method B: x = 1:n;  statistic of data[ix,] with data passed via ...

# define statistic function
statfun <- function(ix, data) {
  cmat <- cov(data[ix,])
  ltri <- lower.tri(cmat, diag = TRUE)
  cvec <- cmat[ltri]
  names(cvec) <- c("var(x1)", "cov(x1,x2)", "var(x2)")
  cvec
}

# nonparametric bootstrap
set.seed(2)
npbs <- np.boot(x = 1:n, statistic = statfun, data = data)
npbs



######***######   REGRESSION   ######***######

### Example 1: bootstrap cases

## Generate bivariate data with E(y|x) = 1 + 2 * x

# generate 100 observations
n <- 100
set.seed(1)
x <- seq(0, 1, length.out = n)
y <- 1 + 2 * x + rnorm(n)
data <- data.frame(x = x, y = y)

## Method A: x = data frame;  statistic of x

# define statistic function
statfun <- function(data) {
  lm(y ~ x, data = data)$coefficients
}

# nonparametric bootstrap
set.seed(2)
npbs <- np.boot(x = data, statistic = statfun)
npbs

## Method B: x = 1:n;  statistic of data[ix,] with data passed via ...

# define statistic function
statfun <- function(ix, data) {
  lm(y ~ x, data = data[ix,])$coefficients
}

# nonparametric bootstrap
set.seed(2)
npbs <- np.boot(x = 1:n, statistic = statfun, data = data)
npbs


### Example 2: bootstrap residuals

# generate 100 observations
n <- 100
set.seed(1)
x <- seq(0, 1, length.out = n)
y <- 1 + 2 * x + rnorm(n)
data <- data.frame(x = x, y = y)

# prepare data
mod0 <- lm(y ~ x, data = data)
df <- data.frame(x = x, y = y, fit = mod0$fitted.values, resid = mod0$residuals)

# define statistic function
statfun <- function(ix, data) {
  data$y <- data$fit + data$resid[ix]
  lm(y ~ x, data = data)$coefficients
}

# define jackknife function
jackfun <- function(ix, data){
  lm(y ~ x, data = data[ix,])$coefficients
}

# nonparametric bootstrap
npbs <- np.boot(x = 1:n, statistic = statfun, data = df, jackknife = jackfun)
npbs
}

Run the code above in your browser using DataLab