gKRLS: Generalized Kernel Regularized Least Squares

Description

This page documents how to use gKRLS as part of a model estimated with mgcv. Post-estimation functions to calculate marginal effects are documented elsewhere, e.g. calculate_effects.

Usage

gKRLS(
  sketch_method = "subsampling",
  standardize = "Mahalanobis",
  bandwidth = NULL,
  sketch_multiplier = 5,
  sketch_size_raw = NULL,
  sketch_prob = NULL,
  rescale_penalty = TRUE,
  truncate.eigen.tol = sqrt(.Machine$double.eps),
  demean_kernel = FALSE,
  remove_instability = TRUE
)
get_calibration_information(object)

Value

gKRLS returns a named list with the elements in "Arguments".

Arguments

sketch_method

A string that specifies which kernel sketching method should be used (default of "subsampling"). Options include "subsampling", "gaussian", "bernoulli", or "none" (no sketching). Drineas et al. (2005) and Yang et al. (2017) provide more details on these options.

To force "subsampling" to select a specific set of observations, you can provide a vector of row positions to sketch_method. This manually sets the size of the sketching multiplier, implicitly overriding other options in gKRLS. The examples provide an illustration.

standardize

A string that specifies how the data is standardized before calculating the distance between observations. The default is "Mahalanobis" (i.e., demeaned and transformed to have an identity covariance matrix). Other options are "scaled" (all columns are scaled to have mean zero and variance of one) or "none" (no standardization).

bandwidth

A bandwidth $P$ for the kernel where each element of the kernel $(i,j)$ is defined by $\exp(-||x_i - x_j||^2_2/P)$. The default (NULL) uses the number of covariates in the kernel or the rank of the corresponding design matrix. An additional option ("calibrate") choses $P$ to maximize the variance of the kernel, e.g., $var(vec(K))$ for the unsketched case. This follows Hartman et al. (2024) with modifications when the kernel is sketched. Please see gKRLS_addendum.pdf for a formal exposition.

sketch_multiplier

A number that sets the size of the sketching dimension: sketch_multiplier * ceiling(N^(1/3)) where N is the number of observations. The default is 5; Chang and Goplerud (2024) find that increasing this to 15 may improve stability for certain complex kernels. sketch_size_raw can directly set the size of the sketching dimension.

sketch_size_raw

A number to set the exact size of the sketching dimension. The default, NULL, means that this argument is not used and the size depends on the number of observations; see sketch_multiplier. Exactly one of sketch_size_raw or sketch_multiplier must be NULL.

sketch_prob

A probability for an element of the sketching matrix to equal 1 when using Bernoulli sketching. Yang et al. (2017) provide more details.

rescale_penalty

A logical value for whether the penalty should be rescaled for numerical stability. See documentation for mgcv::smooth.spec on the meaning of this term. The default is TRUE.

truncate.eigen.tol

A threshold to remove columns of the penalty $S K S^T$ whose eigenvalues are small (below truncate.eigen.tol). These columns are removed from the sketched kernel and avoids instability due to numerically very small eigenvalues. The default is sqrt(.Machine$double.eps). This adjustment can be disabled by setting remove_instability = FALSE.

demean_kernel

A logical value that indicates whether columns of the (sketched) kernel should be demeaned before estimation. The default is FALSE.

remove_instability

A logical value that indicates whether numerical zeros (set via truncate.eigen.tol) should be removed when building the penalty matrix. The default is TRUE.

object

Model estimated using mgcv::gam or mgcv::bam

Details

Overview: The gKRLS function should not be called directly. Its options, described above, control how gKRLS is estimated. It should be passed to mgcv as follows: s(x1, x2, x3, bs = "gKRLS", xt = gKRLS(...)). Multiple kernels can be specified and have different gKRLS arguments. It can also be used alongside the existing options for s() in mgcv.

If bandwidth="calibrate", the function get_calibration_information reports the estimated bandwidth and time (in minutes) needed to do so.

Default Settings: By default, bs = "gKRLS" uses Mahalanobis distance between the observations, random sketching using subsampling sketching (i.e., where the kernel is constructed using a random sample of the observations; Yang et al. 2017) and a sketching dimension of 5 * ceiling(N^(1/3)) where N is the number of observations. Chang and Goplerud (2024) provide an exploration of alternative options.

Notes: Please note that variables must be separated with commas inside of s(...) and that character variables should usually be passed as factors to work smoothly with mgcv. When using this function with bam, the sketching dimension uses chunk.size in place of N and thus either chunk.size or sketch_size_raw must be used to cause the sketching dimension to increase with N.

References

Chang, Qing, and Max Goplerud. 2024. "Generalized Kernel Regularized Least Squares." Political Analysis 32(2):157-171.

Hartman, Erin, Chad Hazlett, and Ciara Sterbenz. 2024. "kpop: A Kernel Balancing Approach for Reducing Specification Assumptions in Survey Weighting." Journal of the Royal Statistical Society Series A: Statistics in Society tools:::Rd_expr_doi("doi:10.1093/jrsssa/qnae082").

Drineas, Petros, Michael W. Mahoney, and Nello Cristianini. 2005. "On the Nyström Method for Approximating a Gram Matrix For Improved Kernel-Based Learning." Journal of Machine Learning Research 6(12):2153-2175.

Yang, Yun, Mert Pilanci, and Martin J. Wainwright. 2017. "Randomized Sketches for Kernels: Fast and Optimal Nonparametric Regression." Annals of Statistics 45(3):991-1023.

Examples

Run this code

set.seed(123)
n <- 100
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
state <- sample(letters[1:5], n, replace = TRUE)
y <- 0.3 * x1 + 0.4 * x2 + 0.5 * x3 + rnorm(n)
data <- data.frame(y, x1, x2, x3, state)
data$state <- factor(data$state)
# A gKRLS model without fixed effects
fit_gKRLS <- mgcv::gam(y ~ s(x1, x2, x3, bs = "gKRLS"), data = data)
summary(fit_gKRLS)
# A gKRLS model with fixed effects outside of the kernel
fit_gKRLS_FE <- mgcv::gam(y ~ state + s(x1, x2, x3, bs = "gKRLS"), data = data)

# HC3 is not available for mgcv; this uses the effective degrees of freedom
# instead of the number of columns; see ?estfun.gam for details
robust <- sandwich::vcovHC(fit_gKRLS, type = 'HC1')
cluster <- sandwich::vcovCL(fit_gKRLS, cluster = data$state)

# Change default standardization to "scaled", sketch method to Gaussian,
# and alter sketching multiplier
fit_gKRLS_alt <- mgcv::gam(y ~ s(x1, x2, x3,
  bs = "gKRLS",
  xt = gKRLS(
    standardize = "scaled",
    sketch_method = "gaussian",
    sketch_multiplier = 2
  )
),
data = data
)
# A model with multiple kernels
fit_gKRLS_2 <- mgcv::gam(y ~ s(x1, x2, bs = 'gKRLS') + s(x1, x3, bs = 'gKRLS'), data = data)
# A model with a custom set of ids for sketching
id <- sample(1:n, 5)
fit_gKRLS_custom <- mgcv::gam(y ~ s(x1, bs = 'gKRLS', xt = gKRLS(sketch_method = id)), data = data)
# Note that the ids of the sampled observations can be extracted 
# from the fitted mgcv object
stopifnot(identical(id, fit_gKRLS_custom$smooth[[1]]$subsampling_id))
# calculate marginal effect (see ?calculate_effects for more examples)
calculate_effects(fit_gKRLS, variables = "x1")

Run the code above in your browser using DataLab