bigKRLS: bigKRLS

Description

Runtime and Memory Optimized Kernel Regularized Least Squares Pete Mohanty (Stanford University) and Robert Shaffer (University of Texas at Austin)

Usage

bigKRLS(y = NULL, X = NULL, sigma = NULL, derivative = TRUE,
  which.derivatives = NULL, vcov.est = TRUE, Neig = NULL,
  eigtrunc = NULL, lambda = NULL, L = NULL, U = NULL, tol = NULL,
  model_subfolder_name = NULL, overwrite.existing = FALSE,
  Ncores = NULL, acf = FALSE, noisy = NULL, instructions = TRUE)

Arguments

A vector of numeric observations on the dependent variable. Missing values not allowed. May be base R matrix or big.matrix.

A matrix of numeric observations of the independent variables. Factors, missing values, and constant vectors not allowed. May be base R matrix or big.matrix.

sigma

Bandwidth parameter, shorthand for sigma squared. Default: sigma <- ncol(X).

derivative

Logical. Estimate derivatives (as opposed to just coefficients)? Recommended for interpretability.

which.derivatives

Optional. For which columns of X should marginal effects be estimated? If derivative=TRUE and which.derivative=NULL (default), all marginal effects will be estimated.

vcov.est

Logical. Estimate variance covariance matrix? Required to obtain derivatives and standard errors on predictions.

Neig

Number of eigenvectors and eigenvalues to calculate. The default is to calculate all N and only use those where eigval >= 0.001 * max(eigval). Smaller values will reduce runtime, but decrease precision.

eigtrunc

Eigentruncation parameter. If NULL, defaults to 0.001 if N > 3000 and 0 otherwise. eigtrunc = 0.25 keeps only those eigenvectors/values such that the eigenvalue is at least 25% of the max. If eigtrunc == 0, all Neig are used to select lambda and to estimate variances. Larger values will reduce runtime, but decrease precision.

lambda

Regularization parameter. Default: selected based (in part) on the eigenvalues of the kernel via Golden Search. Must be positive, real number.

Lower bound of Golden Search for lambda.

Upper bound of Golden Search for lambda.

tol

tolerance parameter for Golden Search for lambda. Default: N / 1000.

model_subfolder_name

If not null, will save estimates to this subfolder of your current working directory. Alternatively, use save.bigKRLS() on the outputted object.

overwrite.existing

Logical. overwrite contents in folder 'model_subfolder_name'? If FALSE, appends lowest possible number to model_subfolder_name name (e.g., ../myresults3/).

Ncores

Number of processor cores to use. Default = ncol(X) or N - 2 (whichever is smaller). More than N - 2 NOT recommended. Uses library(parallel) unless Ncores = 1.

acf

Logical. Experimental; default == FALSE. Calculate Neffective as function of mean absolute auto-correlation in X to correct p-values?

noisy

Logical. Display detailed version of progress to console (intermediate output, time stamps, etc.) as opposed to minimal display? Default: if(N > 2000) TRUE else FALSE. SSH users should use X11 forwarding to see Rcpp progress display.

instructions

Logical. Display syntax after estimation with other library(bigKRLS) functions that can be used on output?

Value

bigKRLS Object containing slope and uncertainty estimates; summary() and predict() defined for class bigKRLS, as is shiny.bigKRLS().

Details

Kernel Regularized Least Squares (KRLS) is a kernel-based, complexity-penalized method developed by Hainmueller and Hazlett (2014) to minimize parametric assumptions while maintaining interpretive clarity. Here, we introduce bigKRLS, an updated version of the original KRLS R package with algorithmic and implementation improvements designed to optimize speed and memory usage. These improvements allow users to straightforwardly fit KRLS models to medium and large datasets (N > ~2500).

Major Updates:

1. C++ integration. We re-implement most major computations in the model in C++ via Rcpp and RcppArmadillo. These changes produce up to a 50% runtime decrease compared to the original R implementation even on a single core.

2. Leaner algorithm. Because of the Tikhonov regularization and parameter tuning strategies used in KRLS, this method of estimation is inherently memory-heavy O(N^2), making memory savings important even for small- and medium-sized applications. We develop and implement a new local derivatives algorithm, which reduces peak memory usage by approximately an order of magnitude, and cut the number of computations needed to find regularization parameter in half.

3. Improved memory management. Most data objects in R perform poorly in memory-intensive applications. We use a series of packages in the bigmemory environment to ease this constraint, allowing our implementation to handle larger datasets more smoothly.

4. Parallel Processing. We parallelize most major calculations in the model. Time savings are especially noticeable in the derivative estimation portion of the algorithm.

5. Interactive data visualization. We include an R Shiny app that allows users bigKRLS users to easily share results with collaborators or more general audiences. Simply call shiny.bigKRLS() on the outputted regression object.

6. Improved uncertainty estimates. bigKRLS uses an adjusted degrees-of-freedom estimator that reflect both the regularization process and the number of predictors. For details and other options, see help(summary.bigKRLS).

7. Cross-validation. crossvalidate.bigKRLS performs CV and stores performance results and parameter settings for each fold. See vignette("bigKRLS_basics") or help("crossvalidate.bigKRLS") for syntax.

Requirements. bigKRLS is under active development. bigKRLS, as well its dependencies, require current versions of R and its compilers (and RStudio if used). For details, see https://github.com/rdrr1990/code/blob/master/bigKRLS_installation.md.

For details on syntax, load the library and then open our vignette vignette("bigKRLS_basics"). Because of the quadratic memory requirement, users working on a typical laptop (8-16 gigabytes of RAM) may wish to start at N = 2,500 or 5,000, particularly if the number of *x* variables is large. When you have a sense of how bigKRLS runs on your system, you may wish to only estimate a subset of the marginal effects at N = 10-15,000 by setting bigKRLS(... which.derivatives = c(1, 3, 5)) for the marginal effects of the first, third, and fifth x variable.

Pete Mohanty and Robert Shaffer. 2018. "Messy Data, Robust Inference? Navigating Obstacles to Inference with bigKRLS." Political Analysis. Cambridge University Press. DOI=10.1017/pan.2018.33. pages 1-18. See also: Mohanty, Pete; Shaffer, Robert, 2018, "Replication Data for: Messy Data, Robust Inference? Navigating Obstacles to Inference with bigKRLS", https://doi.org/10.7910/DVN/CYYLOK, Harvard Dataverse, V1.

Hainmueller, Jens and Chad Hazlett. 2014. "Kernel Regularized Least Squares: Reducing Misspecification Bias with a Flexible and Interpretable Machine Learning Approach." Political Analysis. 22:143-68. https://web.stanford.edu/~jhain/Paper/PA2014a.pdf (Accessed May 20th, 2016).

Recent papers, presentations, and other code available at https://github.com/rdrr1990/code/

License Code released under GPL (>= 2).

Examples

Run this code

# NOT RUN {
# Analyzing chicken weights
# y <- as.matrix(ChickWeight$weight) 
# X <- matrix(cbind(ChickWeight$Time, ChickWeight$Diet == 1), ncol = 2)

# out <- bigKRLS(y, X)
# out$R2
# summary(out, labs = c("Time", "Diet")) 

# don't use save() unless out$has.big.matrices == FALSE
# save.bigKRLS(out, "exciting_results") 
# }

Run the code above in your browser using DataLab