cvLM provides a high-performance framework for cross-validating linear models using
RcppArmadillo and RcppParallel. It implements numerically stable algorithms to compute K-fold,
Leave-One-Out (LOO), and Generalized Cross-Validation (GCV) metrics for both linear and ridge regression.
cvLM(object, ...)# S3 method for formula
cvLM(object, data, subset, na.action, K.vals = 10L,
lambda = 0, generalized = FALSE, seed = 1L, n.threads = 1L,
tol = 1e-7, center = TRUE, ...)
# S3 method for lm
cvLM(object, data, K.vals = 10L, lambda = 0,
generalized = FALSE, seed = 1L, n.threads = 1L,
tol = 1e-7, center = TRUE, ...)
# S3 method for glm
cvLM(object, data, K.vals = 10L, lambda = 0,
generalized = FALSE, seed = 1L, n.threads = 1L,
tol = 1e-7, center = TRUE, ...)
A data.frame with the following columns:
Kthe number of folds requested by the user. Note that if the internal algorithm determines a more appropriate number of folds to use (based on the number of rows), a warning is issued and a modified fold count is used for the computation; however, this column retains the input value.
CVthe cross-validation metric calculated during the procedure.
seedthe integer seed used for fold randomization.
a data frame, list or environment containing the variables in the model. See
model.frame.
a specification of the rows/observations to be used. For lm and glm methods,
this is inherited from object. See model.frame.
a function indicating how NA values should be handled. For lm and glm
methods, this is inherited from object. See model.frame.
an integer vector specifying the number of folds. For Leave-One-Out CV, set K.vals
equal to the number of observations.
a non-negative numeric scalar for the ridge regularization parameter. Defaults to 0 (OLS).
if TRUE, the Generalized Cross-Validation (GCV) statistic is computed.
K.vals is ignored.
an integer used to initialize the random number generator for reproducible fold splits.
an integer specifying the number of threads for parallel computation. Set to -1 to use the
RcppParallel default (defaultNumThreads). If n.threads > 1,
the package attempts to restrict the underlying BLAS to a single thread using RhpcBLASctl to prevent
oversubscription.
a numeric scalar specifying the tolerance for rank estimation in the underlying orthogonal and singular value decompositions. See 'Details' below.
if TRUE (the default), the predictors and response are mean-centered prior to fitting.
See 'Details' below.
additional arguments passed to or from other methods (currently disregarded).
Philip Nye, phipnye@gmail.com
Numerical Stability and Underdetermined Systems
The newest release of cvLM prioritizes numerical robustness. For Ordinary Least Squares
(\(\lambda = 0\)), the package employs Complete Orthogonal Decomposition (COD). In the presence of column
rank-deficiency---where predictors are perfectly collinear or the number of parameters \(p\) exceeds the
number of observations \(n\)---COD allows for the identification of the unique minimum \(L_2\) norm
solution, providing a stable platform for cross-validation in high-dimensional or ill-conditioned settings.
The numerical rank is determined by the tol argument; a pivot is considered nonzero if its absolute
value is strictly greater than the threshold relative to the largest pivot:
$$|r_{ii}| > \text{tol} \times |r_{\text{max}}|$$
This approach differs from R's lm, which handles column rank-deficiency by zeroing out coefficients
for redundant columns, and may result in different coefficient estimates and subsequent predictions when the
design matrix is not of full-column rank.
For ridge regression (\(\lambda > 0\)), the package utilizes Singular Value Decomposition (SVD). By
decomposing the design matrix into \(X = UDV^T\), the ridge solution can be computed with high precision,
avoiding the potential information loss associated with forming the cross-product matrix \(X^TX\) required
for Cholesky-based methods used in previous versions. Similar to the COD, a singular value is considered
nonzero only if it satisfies:
$$\sigma_i > \text{tol} \times \sigma_{\text{max}}$$
While this numerical stability comes at a speed cost, it allows for consistency with methods used for
efficient shrinkage parameter search used by grid.search.
Regularization and Data Centering
Following the methodology in The Elements of Statistical Learning (Hastie et al.), cvLM now
defaults to centering the data (center = TRUE). Centering the predictors \(X\) and the response
\(y\) by their respective means is mathematically equivalent to excluding the intercept term from the
regularization penalty.
In the uncentered case (center = FALSE), the objective function penalizes all coefficients,
including the intercept \(\beta_0\):
$$\hat{\beta} = \arg\min_{\beta} \|y - X\beta\|^2 + \lambda \|\beta\|^2$$
In the centered case (center = TRUE), the optimization problem is effectively reformulated as:
$$\hat{\beta} = \arg\min_{\beta} \left\{ \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} x_{ij}
\beta_j \right)^2 + \lambda \sum_{j=1}^{p} \beta_j^2 \right\}$$
Efficiency of LOOCV and GCV
For Leave-One-Out (LOOCV) and Generalized Cross-Validation (GCV), cvLM avoids the computational cost
of \(n\) refits by utilizing closed-form solutions derived through use of the hat matrix.
Parallel Computation and Oversubscription
cvLM uses RcppParallel to distribute K-fold cross-validation tasks across multiple threads.
However, many modern R installations link to multi-threaded BLAS libraries. Running multi-threaded BLAS
operations within multi-threaded folds can lead to oversubscription, where the total number of active
threads exceeds the physical CPU capacity, causing significant performance penalties.
To mitigate this, if the RhpcBLASctl package is installed, cvLM will automatically attempt to
set the BLAS thread count to 1 during parallel execution. It is highly recommended to install
RhpcBLASctl when using n.threads > 1.
Eddelbuettel D, Sanderson C (2014). "RcppArmadillo: Accelerating R with high-performance C++ linear algebra." Computational Statistics and Data Analysis, 71, 1054---1063. tools:::Rd_expr_doi("10.1016/j.csda.2013.02.005").
Aggarwal, C. C. (2020). Linear Algebra and Optimization for Machine Learning: A Textbook. Springer Cham. tools:::Rd_expr_doi("10.1007/978-3-030-40344-7").
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer New York, NY. tools:::Rd_expr_doi("10.1007/978-0-387-84858-7").
Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press, Baltimore, MD. tools:::Rd_expr_doi("10.56021/9781421407944").
grid.search
if (FALSE) {
data(mtcars)
n <- nrow(mtcars)
# 10-fold CV for a linear regression model
cvLM(mpg ~ ., data = mtcars, K.vals = 10)
# Comparing 5-fold, 10-fold, and Leave-One-Out CV configurations using 2 threads
cvLM(mpg ~ ., data = mtcars, K.vals = c(5, 10, nrow(mtcars)), n.threads = 2)
# Ridge regression with analytic GCV (using lm interface)
fitted.lm <- lm(mpg ~ ., data = mtcars)
cvLM(fitted.lm, data = mtcars, lambda = 0.5, generalized = TRUE)
}
Run the code above in your browser using DataLab