The core function for fitting Lagrangian smoothing splines with less user-friendliness.
lgspline.fit(predictors, y = NULL, standardize_response = TRUE,
standardize_predictors_for_knots = TRUE,
standardize_expansions_for_fitting = TRUE, family = gaussian(),
glm_weight_function = function(mu, y, order_indices, family,
dispersion, observation_weights,
...) {
if(any(!is.null(observation_weights))){
family$variance(mu) * observation_weights
} else {
family$variance(mu)
}
},
shur_correction_function = function(X, y, B, dispersion, order_list,
K, family, observation_weights,
...) {
lapply(1:(K+1), function(k) 0)
},
need_dispersion_for_estimation = FALSE,
dispersion_function = function(mu, y, order_indices, family,
observation_weights, ...) { 1 },
K = NULL, custom_knots = NULL, cluster_on_indicators = FALSE,
make_partition_list = NULL, previously_tuned_penalties = NULL,
smoothing_spline_penalty = NULL, opt = TRUE, use_custom_bfgs = TRUE,
delta = NULL, tol = 10*sqrt(.Machine$double.eps),
invsoftplus_initial_wiggle = c(-25, 20, -15, -10, -5),
invsoftplus_initial_flat = c(-14, -7), wiggle_penalty = 2e-07,
flat_ridge_penalty = 0.5, unique_penalty_per_partition = TRUE,
unique_penalty_per_predictor = TRUE, meta_penalty = 1e-08,
predictor_penalties = NULL, partition_penalties = NULL,
include_quadratic_terms = TRUE, include_cubic_terms = TRUE,
include_quartic_terms = FALSE, include_2way_interactions = TRUE,
include_3way_interactions = TRUE,
include_quadratic_interactions = FALSE,
offset = c(), just_linear_with_interactions = NULL,
just_linear_without_interactions = NULL,
exclude_interactions_for = NULL,
exclude_these_expansions = NULL, custom_basis_fxn = NULL,
include_constrain_fitted = TRUE,
include_constrain_first_deriv = TRUE,
include_constrain_second_deriv = TRUE,
include_constrain_interactions = TRUE, cl = NULL, chunk_size = NULL,
parallel_eigen = TRUE, parallel_trace = FALSE, parallel_aga = FALSE,
parallel_matmult = FALSE, parallel_unconstrained = FALSE,
parallel_find_neighbors = FALSE, parallel_penalty = FALSE,
parallel_make_constraint = FALSE,
unconstrained_fit_fxn = unconstrained_fit_default,
keep_weighted_Lambda = FALSE, iterate_tune = TRUE,
iterate_final_fit = TRUE, blockfit = FALSE,
qp_score_function = function(X, y, mu, order_list, dispersion,
VhalfInv, observation_weights, ...) {
if(!is.null(observation_weights)) {
crossprod(X, cbind((y - mu)*observation_weights))
} else {
crossprod(X, cbind(y - mu))
}
},
qp_observations = NULL, qp_Amat = NULL, qp_bvec = NULL, qp_meq = 0,
qp_positive_derivative = FALSE, qp_negative_derivative = FALSE,
qp_monotonic_increase = FALSE, qp_monotonic_decrease = FALSE,
qp_range_upper = NULL, qp_range_lower = NULL, qp_Amat_fxn = NULL,
qp_bvec_fxn = NULL, qp_meq_fxn = NULL, constraint_values = cbind(),
constraint_vectors = cbind(), return_G = TRUE, return_Ghalf = TRUE,
return_U = TRUE, estimate_dispersion = TRUE,
unbias_dispersion = TRUE,
return_varcovmat = TRUE, custom_penalty_mat = NULL,
cluster_args = c(custom_centers = NA, nstart = 10),
dummy_dividor = 1.2345672152894e-22,
dummy_adder = 2.234567210529e-18, verbose = FALSE,
verbose_tune = FALSE, expansions_only = FALSE,
observation_weights = NULL, do_not_cluster_on_these = c(),
neighbor_tolerance = 1 + 1e-16, no_intercept = FALSE,
VhalfInv = NULL, Vhalf = NULL, include_warnings = TRUE, ...)
A list containing the fitted model components, forming the core
structure used internally by lgspline
and its associated methods.
This function is primarily intended for internal use or advanced users needing
direct access to fitting components. The returned list contains numerous elements,
typically including:
The original response vector provided.
The fitted values on the original response scale.
A list, with each element the design matrix (\(\textbf{X}_k\)) for partition k.
The constraint matrix (\(\textbf{A}\)) encoding smoothness and any other linear equality constraints.
A list of the final fitted coefficient vectors (\(\boldsymbol{\beta}_k\)) for each partition k, on the original predictor/response scale.
A list of fitted coefficient vectors on the internally standardized scale used during fitting.
Key dimensions: number of internal knots (K), basis functions per partition (p), original predictors (q), total coefficients (P), and sample size (N).
A list containing the final penalty components used (e.g., Lambda
, L1
, L2
, L_predictor_list
, L_partition_list
). See compute_Lambda
.
Functions to transform predictors to/from the scale used for knot placement.
Matrix or vector of knot locations on the original predictor scale (NULL if K=0 or q > 1).
Vector assigning each original observation to a partition.
Internal representation of partition boundaries.
List containing centers, knot midpoints, neighbor info, and assignment function from partitioning (NULL if K=0 or 1D). See make_partitions
.
Internal functions for partitioning data. See knot_expand_list
.
The primary function embedded in the object for generating predictions on new data. See predict.lgspline
.
The family
object or custom list used.
Logical flags related to dispersion estimation settings.
The estimated (or fixed, if estimate_dispersion=FALSE
) dispersion parameter \(\tilde{\sigma}^2\).
Functions to convert coefficients between standardized and original scales.
Mean and standard deviation used for standardizing the response.
Information mapping original data order to partitioned order.
User-supplied additional linear equality constraints.
Scaling factors applied to basis expansions during fitting (if standardize_expansions_for_fitting=TRUE
).
Functions related to computing derivatives of the fitted spline. See take_derivative
, take_interaction_2ndderivative
, make_derivative_matrix
.
Integer vectors storing column indices identifying different types of terms in the basis expansion.
Logical indicating if variance matrix calculation was requested.
Original generated names for basis expansion columns (before potential renaming if input predictors had names).
Functions to standardize/unstandardize design matrices according to expansion_scales
.
Logical indicating if a parallel cluster was used.
The original observation weights provided (potentially reformatted).
The fixed \(\mathbf{V}^{-1/2}\) matrix if supplied for GEEs.
List containing components related to quadratic programming constraints, if used.
Matrices related to the variance-covariance structure (\(\mathbf{G}\), \(\mathbf{G}^{1/2}\), \(\mathbf{U}\)), returned if requested via corresponding arguments. See compute_G_eigen
and get_U
.
The trace term \(\text{trace}(\mathbf{X}\mathbf{U}\mathbf{G}\mathbf{X}^{T})\), used for effective degrees of freedom. See compute_trace_UGXX_wrapper
.
The final variance-covariance matrix of the estimated coefficients, \(\sigma^2 \mathbf{U}\mathbf{G}\), returned if return_varcovmat = TRUE
.
Note that the exact components returned depend heavily on the function
arguments (e.g., values of return_G
, return_varcovmat
, etc.).
If expansions_only = TRUE
, a much smaller list is returned containing
only pre-fitting components needed for inspection or setup (see lgspline
).
Default: NULL. Numeric matrix or data frame of predictor variables. Supports direct matrix input or formula interface when used with `data` argument. Must contain numeric predictors, with categorical variables pre-converted to numeric indicators.
Default: NULL. Numeric response variable vector representing the target/outcome/dependent variable etc. to be modeled.
Default: TRUE. Logical indicator controlling whether the response variable should be centered by mean and scaled by standard deviation before model fitting. When TRUE, tends to improve numerical stability. Only offered for identity link functions (when family$link == 'identity')
Default: TRUE. Logical flag determining whether predictor variables should be standardized before knot placement. Ensures consistent knot selection across different predictor scales, and that no one predictor dominates in terms of influence on knot placement. For all expansions (x), standardization corresponds to dividing by the difference in 69 and 31st percentiles = x / (quantile(x, 0.69) - quantile(x, 0.31)).
Default: TRUE. Logical switch to standardize polynomial basis expansions during model fitting. Provides computational stability during penalty tuning without affecting statistical inference, as design matrices, variance-covariance matrices, and coefficient estimates are systematically backtransformed after fitting to account for the standardization. If TRUE, \(\textbf{U}\) and \(\textbf{G}\) will remain on the transformed scale, and B_raw as returned will correspond to the coefficients fitted on the expansion-standardized scale.
Default: gaussian(). Generalized linear model (GLM) distribution family specifying the error distribution and link function for model fitting. Defaults to Gaussian distribution with identity link. Supports custom family specifications, including user-defined link functions and optional custom tuning loss criteria. Minimally requires 1) family name (family) 2) link name (link) 3) linkfun (link function) 4) linkinv (link function inverse) 5) variance (mean variance relationship function, \(\text{Var}(Y|\mu)\)).
Default: function that returns family$variance(mu) * observation_weights if weights exist, family$variance(mu) otherwise. Codes the mean-variance relationship of a GLM or GLM-like model, the diagonal \(\textbf{W}\) matrix of \(\textbf{X}^T\textbf{W}\textbf{X}\) that appears in the information. This can be replaced with a user-specified function. It is used for updating \(\textbf{G} = (\textbf{X}^{T}\textbf{W}\textbf{X} + \textbf{L})^{-1}\) after obtaining constrained estimates of coefficients. This is not used for fitting unconstrained models, but for iterating between updates of \(\textbf{U}\), \(\textbf{G}\), and beta coefficients afterwards.
Default: function that returns list of zeros. Advanced function for computing Schur complements \(\textbf{S}\) to add to \(\textbf{G}\) to properly account for uncertainty in dispersion or other nuisance parameter estimation. The effective information becomes \(\textbf{G}^* = (\textbf{G}^{-1} + \textbf{S})^{-1}\).
Default: FALSE. Logical indicator specifying whether a dispersion parameter is required for coefficient estimation. This is not needed for canonical regular exponential family models, but is often needed otherwise (such as fitting Weibull AFT models).
Default: function that returns 1. Custom function for estimating the dispersion parameter. Unless need_dispersion_for_estimation
is TRUE, this will not affect coefficient estimates.
Default: NULL. Integer specifying the number of knot locations for spline partitions. This can intuitively be considered the total number of partitions - 1.
Default: NULL. Optional matrix providing user-specified knot locations in 1-D.
Default: FALSE. Logical flag determining whether indicator variables should be used for clustering knot locations.
Default: NULL. Optional list allowing direct specification of custom partition assignments. The intent is that the make_partition_list returned by one model can be supplied here to keep the same knot locations for another.
Default: NULL. Optional list of pre-computed penalty components from previous model fits.
Default: NULL. Optional custom smoothing spline penalty matrix for fine-tuned complexity control.
Default: TRUE. Logical switch controlling whether model penalties should be automatically optimized via generalized cross-validation. Turn this off if previously_tuned_penalties are supplied AND desired, otherwise, the previously_tuned_penalties will be ignored.
Default: TRUE. Logical indicator selecting between a native implementation of damped-BFGS optimization method with analytical gradients or base R's BFGS implementation with finite-difference approximation of gradients.
Default: NULL. Numeric pseudocount used for stabilizing optimization in non-identity link function scenarios.
Default: 10*sqrt(.Machine$double.eps). Numeric convergence tolerance controlling the precision of optimization algorithms.
Default: c(-25, 20, -15, -10, -5). Numeric vector of initial grid points for wiggle penalty optimization, specified on the inverse-softplus (\(\text{softplus}(x) = \log(1+e^x)\)) scale.
Default: c(-7, 0). Numeric vector of initial grid points for ridge penalty optimization, specified on the inverse-softplus (\(\text{softplus}(x) = \log(1+e^x)\)) scale.
Default: 2e-7. Numeric penalty controlling the integrated squared second derivative, governing function smoothness. Applied to both smoothing spline penalty (alone) and is multiplied by flat_ridge_penalty
for penalizing linear terms.
Default: 0.5. Numeric flat ridge penalty for additional regularization on only intercepts and linear terms (won't affect interactions or quadratic/cubic/quartic terms by default). If custom_penalty_mat
is supplied, the penalty will be for the custom penalty matrix instead. This penalty is multiplied with wiggle_penalty
to obtain the total ridge penalty - hence, by default, the ridge penalization on linear terms is half of the magnitude of non-linear terms.
Default: TRUE. Logical flag allowing the magnitude of the smoothing spline penalty to differ across partition.
Default: TRUE. Logical flag allowing the magnitude of the smoothing spline penalty to differ between predictors.
Default: 1e-8. Numeric "meta-penalty" applied to predictor and partition penalties during tuning. The minimization of GCV is modified to be a penalized minimization problem, with penalty \(0.5 \times \text{meta\_penalty} \times (\sum \log(\text{penalty}))^2\), such that penalties are pulled towards 1 on the absolute scale and thus, their multiplicative effect towards 0.
Default: NULL. Optional list of custom penalties specified per predictor.
Default: NULL. Optional list of custom penalties specified per partition.
Default: TRUE. Logical switch to include squared predictor terms in basis expansions.
Default: TRUE. Logical switch to include cubic predictor terms in basis expansions.
Default: NULL. Logical switch to include quartic predictor terms in basis expansions. This is highly recommended for fitting models with multiple predictors to avoid over-specified constraints. When NULL (by default), will internally set to FALSE if only one predictor present, and TRUE otherwise.
Default: TRUE. Logical switch to include linear two-way interactions between predictors.
Default: TRUE. Logical switch to include three-way interactions between predictors.
Default: FALSE. Logical switch to include linear-quadratic interaction terms.
Default: Empty vector. When non-missing, this is a vector of column indices/names to include as offsets. lgspline
will automatically introduce constraints such that the coefficient for offset terms are 1.
Default: NULL. Integer vector specifying columns to retain linear terms with interactions.
Default: NULL. Integer vector specifying columns to retain only linear terms without interactions.
Default: NULL. Integer vector indicating columns to exclude from all interaction terms.
Default: NULL. Character vector specifying basis expansions to be excluded from the model. These must be named columns of the data, or in the form "_1_", "_2_", "_1_x_2_", "_2_^2" etc. where "1" and "2" indicate column indices of predictor matrix input.
Default: NULL. Optional user-defined function for generating custom basis expansions. See get_polynomial_expansions
.
Default: TRUE. Logical switch to constrain fitted values at knot points.
Default: TRUE. Logical switch to constrain first derivatives at knot points.
Default: TRUE. Logical switch to constrain second derivatives at knot points.
Default: TRUE. Logical switch to constrain interaction terms at knot points.
Default: NULL. Parallel processing cluster object for distributed computation (use parallel::makeCluster()
).
Default: NULL. Integer specifying custom fixed chunk size for parallel processing.
Default: TRUE. Logical flag to enable parallel processing for eigenvalue decomposition computations.
Default: FALSE. Logical flag to enable parallel processing for trace computation.
Default: FALSE. Logical flag to enable parallel processing for specific matrix operations involving G and A.
Default: FALSE. Logical flag to enable parallel processing for block-diagonal matrix multiplication.
Default: TRUE. Logical flag to enable parallel processing for unconstrained maximum likelihood estimation.
Default: FALSE. Logical flag to enable parallel processing for neighbor identification (which partitions are neighbors).
Default: FALSE. Logical flag to enable parallel processing for penalty matrix construction.
Default: FALSE. Logical flag to enable parallel processing for constraint matrix generation.
Default: unconstrained_fit_default
. Custom function for fitting unconstrained models per partition.
Default: FALSE. Logical flag to retain generalized linear model weights in penalty constraints using Tikhonov parameterization. It is advised to turn this to TRUE when fitting non-canonical GLMs. The default unconstrained_fit_default
by default assumes canonical GLMs for setting up estimating equations; this is not valid with non-canonical GLMs. With keep_weighted_Lambda = TRUE
, the Tikhonov parameterization binds \(\boldsymbol{\Lambda}^{1/2}\), the square-root penalty matrix, to the design matrix \(\textbf{X}_k\) for each partition k, and family$linkinv(0) to the response vector \(\textbf{y}_k\) for each partition before finding unconstrained estimates using base R's glm.fit
function. The potential issue is that the weights of the information matrix will appear in the penalty, such that the effective penalty is \(\boldsymbol{\Lambda}_\text{eff} = \textbf{L}^{1/2}\textbf{W}\textbf{L}^{1/2}\) rather than just \(\textbf{L}^{1/2}\textbf{L}^{1/2}\). If FALSE, this approach will only be used to supply initial values to a native implementation of damped Newton-Rapshon for fitting GLM models (see damped_newton_r
and unconstrained_fit_default
). For Gamma with log-link, this is fortunately a non-issue since the mean-variance relationship is essentially stabilized, so keep_weighted_Lambda = TRUE
is strongly advised.
Default: TRUE. Logical switch to use iterative optimization during penalty tuning. If FALSE, \(\textbf{G}\) and \(\textbf{U}\) are constructed from unconstrained \(\boldsymbol{\beta}\) estimates when tuning.
Default: TRUE. Logical switch to use iterative optimization for final model fitting. If FALSE, \(\textbf{G}\) and \(\textbf{U}\) are constructed from unconstrained \(\boldsymbol{\beta}\) estimates when fitting the final model after tuning.
Default: FALSE. Logical switch to abandon per-partition fitting for non-spline effects without interactions, collapse all matrices into block-diagonal single-matrix form, and fit agnostic to partition. This would be more efficient for many non-spline effects without interactions and relatively few spline effects or non-spline effects with interactions. Ignored if length(just_linear_without_interactions) = 0
after processing formulas and input.
Default: \(\textbf{X}^{T}(\textbf{y} - \text{E}[\textbf{y}])\), where \(\text{E}[\textbf{y}] = \boldsymbol{\mu}\). A function returning the score of the log-likelihood for optimization (excluding penalization/priors involving \(\boldsymbol{\Lambda}\)), which is needed for the formulation of quadratic programming problems, when blockfit = TRUE
, and correlation-structure fitting for GLMs, all relying on solve.QP
. Accepts arguments "X, y, mu, order_list, dispersion, VhalfInv, observation_weights, ..." in order. As shown in the examples below, a gamma log-link model requires \(\textbf{X}^{T}\textbf{W}(\textbf{y} - \text{E}[\textbf{y}])\) instead, with \(\textbf{W}\) a diagonal matrix of \(\text{E}[\textbf{y}]^2\) (Note: This example might be incorrect; check the specific score equation for Gamma log-link). This argument is not needed when fitting non-canonical GLMs without quadratic programming constraints or correlation structures, situations for which keep_weighted_Lambda=TRUE
is sufficient.
Default: NULL. Numeric vector of observations to apply constraints to for monotonic and range quadratic programming constraints. Useful for saving computational resources.
Default: NULL. Constraint matrix for quadratic programming formulation. The Amat
argument of solve.QP
.
Default: NULL. Constraint vector for quadratic programming formulation. The bvec
argument of solve.QP
.
Default: 0. Number of equality constraints in quadratic programming setup. The meq
argument of solve.QP
.
Default: FALSE. Logical flags to constrain the function to have positive first derivatives/be monotonically increasing using quadratic programming with respect to the order (ascending rows) of the input data set.
Default: FALSE. Logical flags to constrain the function to have negative first derivatives/be monotonically decreasing using quadratic programming with respect to the order (ascending rows) of the input data set.
Default: NULL. Numeric upper bound for constrained fitted values using quadratic programming.
Default: NULL. Numeric lower bound for constrained fitted values using quadratic programming.
Default: NULL. Custom function for generating Amat matrix in quadratic programming.
Default: NULL. Custom function for generating bvec vector in quadratic programming.
Default: NULL. Custom function for determining meq equality constraints in quadratic programming.
Default: cbind()
. Matrix of constraint values for sum constraints. The constraint enforces \(\textbf{C}^T(\boldsymbol{\beta} - \textbf{c}) = \boldsymbol{0}\) in addition to smoothing constraints, where \(\textbf{C}\) = constraint_vectors
and \(\textbf{c}\) = constraint_values
.
Default: cbind()
. Matrix of vectors for sum constraints. The constraint enforces \(\textbf{C}^T(\boldsymbol{\beta} - \textbf{c}) = \boldsymbol{0}\) in addition to smoothing constraints, where \(\textbf{C}\) = constraint_vectors
and \(\textbf{c}\) = constraint_values
.
Default: TRUE. Logical switch to return the unscaled variance-covariance matrix without smoothing constraints (\(\textbf{G}\)).
Default: TRUE. Logical switch to return the matrix square root of the unscaled variance-covariance matrix without smoothing constraints (\(\textbf{G}^{1/2}\)).
Default: TRUE. Logical switch to return the constraint projection matrix \(\textbf{U}\).
Default: TRUE. Logical flag to estimate the dispersion parameter after fitting.
Default NULL. Logical switch to multiply final dispersion estimates by \(N/(N-\text{trace}(\textbf{X}\textbf{U}\textbf{G}\textbf{X}^{T}))\), which in the case of Gaussian-distributed errors with identity link function, provides unbiased estimates of variance. When NULL (by default), gets set to TRUE for Gaussian + identity link and FALSE otherwise.
Default: TRUE. Logical switch to return the variance-covariance matrix of the estimated coefficients. This is needed for performing Wald inference.
Default: NULL. Optional \(p \times p\) custom penalty matrix for individual partitions to replace the default ridge penalty applied to linear-and-intercept terms only. This can be interpreted as proportional to the prior correlation matrix of coefficients for non-spline effects, and will appear in the penalty matrix for all partitions. It is recommended to first run the function using expansions_only = TRUE
so you have an idea of where the expansions appear in each partition, what "p" is, and you can carefully customize your penalty matrix after.
Default: c(custom_centers = NA, nstart = 10)
. Named vector of arguments controlling clustering procedures. If the first argument is not NA, this will be treated as custom cluster centers and all other arguments ignored. Otherwise, default base R k-means clustering will be used with all other arguments supplied to kmeans
(for example, by default, the "nstart" argument as provided). Custom centers must be a \(K \times q\) matrix with one column for each predictor in order of their appearance in input predictor/data, and one row for each center.
Default: 0.00000000000000000000012345672152894. Small numeric constant to prevent division by zero in computational routines.
Default: 0.000000000000000002234567210529. Small numeric constant to prevent division by zero in computational routines.
Default: FALSE. Logical flag to print general progress messages during model fitting (does not include during tuning).
Default: FALSE. Logical flag to print detailed progress messages during penalty tuning specifically.
Default: FALSE. Logical switch to return only basis expansions without full model fitting. Useful for setting up custom constraints and penalties.
Default: NULL. Numeric vector of observation-specific weights for generalized least squares estimation.
Default: c(). Vector specifying predictor columns to exclude from clustering procedures, in addition to the non-spline effects by default.
Default: 1 + 1e-8. Numeric tolerance for determining neighboring partitions using k-means clustering. Greater values means more partitions are likely to be considered neighbors. Intended for internal use only (modify at your own risk!).
Default: FALSE. Logical flag to remove intercept, constraining it to 0. The function automatically constructs constraint_vectors and constraint_values to achieve this. Calling formulas with a "0+" in it like y ~ 0 + .
will set this option to TRUE.
Default: NULL. Matrix representing a fixed, custom square-root-inverse covariance structure for the response variable of longitudinal and spatial modeling. Must be an \(N \times N\) matrix where N is number of observations. This matrix \(\textbf{V}^{-1/2}\) serves as a fixed transformation matrix for the response, equivalent to GLS with known covariance \(\textbf{V}\). This is known as "whitening" in some literature.
Default: NULL. Matrix representing a fixed, custom square-root covariance structure for the response variable of longitudinal and spatial modeling. Must be an \(N \times N\) matrix where N is number of observations. This matrix \(\textbf{V}^{1/2}\) is used when backtransforming coefficients for fitting GLMs with arbitrary correlation structure.
Default: TRUE. Logical switch to control display of warning messages during model fitting.
Additional arguments passed to the unconstrained model fitting function.