This document provides the mathematical and implementation details for Lagrangian Multiplier Smoothing Splines.
Consider a dataset with observed predictors \(\mathbf{T}\) (an \(N \times q\) matrix) and corresponding response values \(\mathbf{y}\) (an \(N \times 1\) vector). We assume the relationship follows a generalized linear model with an unknown smooth function \(f\):
$$y_i \sim \mathcal{D}(g^{-1}(f(\mathbf{t}_i)), \sigma^2)$$
where \(\mathcal{D}\) is a distribution, \(g\) is a link function, \(\mathbf{t}_i\) is the \(i\)th row of \(\mathbf{T}\), and \(\sigma^2\) is a dispersion parameter.
The objective is to estimate the unknown function \(f\) that balances:
Goodness of fit: How well the model fits the observed data
Smoothness: Avoiding excessive fluctuations in the estimated function
Interpretability: Understanding the relationship between predictors and response
Lagrangian Multiplier Smoothing Splines address this by:
Partitioning the predictor space into \(K+1\) regions
Fitting local polynomial models within each partition
Explicitly enforcing smoothness where partitions meet using Lagrangian multipliers
Penalizing the integrated squared second derivative of the estimated function
Unlike other smoothing spline formulations, this technique ensures that no post-fitting algebraic rearrangement or disentangelement of a spline basis is needed to obtain interpretable models. The relationship between predictor and response is explicit, and the basis expansions for each partition are homogeneous.
Lagrangian Multiplier Smoothing Splines fit piecewise polynomial regression models to partitioned data, with smoothness at partition boundaries enforced through Lagrangian multipliers. The approach is penalized by the integrated squared second derivative of the estimated function.
Unlike traditional smoothing splines that implicitly derive piecewise polynomials through optimization, or regression splines using specialized bases (e.g., B-splines), this method:
Explicitly represents polynomial basis functions in a natural form
Uses Lagrangian multipliers to enforce smoothness constraints
Maintains interpretability of coefficient estimates
The fitted model is directly interpretable without the need for reparameterization following fitting. This implementation accommodates non-spline terms and interactions, GLMs, correlation structures, and inequality constraints in addition to linear regression assuming Gaussian response. Extensive customization is offered for users to adapt lgspline for their own modelling frameworks.
Core notation:
\(\mathbf{y}_{(N \times 1)}\): Response vector
\(\mathbf{T}_{(N \times q)}\): Matrix of predictors
\(\mathbf{X}_{(N \times P)}\): Block-diagonal matrix of polynomial expansions
\(\boldsymbol{\Lambda}_{(P \times P)}\): Penalty matrix
\(\tilde{\boldsymbol{\beta}}_{(P \times 1)}\): Constrained coefficient estimates
\(\mathbf{G}_{(P \times P)}\): \((\mathbf{X}^T\mathbf{W}\mathbf{X} + \boldsymbol{\Lambda})^{-1}\)
\(\mathbf{A}_{(P \times r)}\): Constraint matrix ensuring smoothness
\(\mathbf{U}_{(P \times P)}\): \(\mathbf{I} - \mathbf{G}\mathbf{A}(\mathbf{A}^T\mathbf{G}\mathbf{A})^{-1}\mathbf{A}^T\)
The method decomposes the predictor space into K+1 partitions and fits polynomial regression models within each partition, constraining the fitted function to be smooth at partition boundaries.
For a single predictor, the function within each partition k is represented as: $$f_k(t) = \beta_k^{(0)} + \beta_k^{(1)}t + \beta_k^{(2)}t^2 + \beta_k^{(3)}t^3 + ...$$
More generally, for each partition k, the model takes the form: $$f_k(\mathbf{t}) = \mathbf{x}^T\boldsymbol{\beta}_k$$
Where \(\mathbf{x}\) contains polynomial basis functions (intercept, linear, quadratic, cubic terms, and their interactions) and \(\boldsymbol{\beta}_k\) are the corresponding coefficients.
Smoothness constraints enforce that the function value, first and second derivatives match at adjacent partition boundaries: $$f_k(t_k) = f_{k+1}(t_k)$$ $$f'_k(t_k) = f'_{k+1}(t_k)$$ $$f''_k(t_k) = f''_{k+1}(t_k)$$
These constraints are expressed as linear equations in the \(\mathbf{A}\) matrix such that \(\mathbf{A}^T\boldsymbol{\beta} = \mathbf{0}\) implies the smoothness conditions are satisfied.
The estimation procedure follows these key steps:
1. Unconstrained estimation: $$\hat{\boldsymbol{\beta}} = \mathbf{G}\mathbf{X}^T\mathbf{y}$$ where \(\mathbf{G} = (\mathbf{X}^T\mathbf{W}\mathbf{X} + \boldsymbol{\Lambda})^{-1}\) for weighted design matrix \(\mathbf{W}\) and penalty matrix \(\boldsymbol{\Lambda}\).
2. Apply smoothness constraints: $$\tilde{\boldsymbol{\beta}} = \hat{\boldsymbol{\beta}} - \mathbf{G}\mathbf{A}(\mathbf{A}^T\mathbf{G}\mathbf{A})^{-1}\mathbf{A}^T\hat{\boldsymbol{\beta}}$$ This can be computed efficiently as: $$\tilde{\boldsymbol{\beta}} = \mathbf{U}\hat{\boldsymbol{\beta}}$$ where \(\mathbf{U} = \mathbf{I} - \mathbf{G}\mathbf{A}(\mathbf{A}^T\mathbf{G}\mathbf{A})^{-1}\mathbf{A}^T\).
3. For GLMs, iterative refinement:
Update \(\mathbf{G}\) based on current estimates
Recompute \(\tilde{\boldsymbol{\beta}}\) using the new \(\mathbf{G}\)
Continue until convergence
4. For inequality constraints (shape or range constraints):
Sequential quadratic programming using solve.QP
Initial values from the equality-constrained estimates
5. Variance estimation: $$\tilde{\sigma}^2 = \frac{\|\mathbf{y} - \mathbf{X}\tilde{\boldsymbol{\beta}}\|^2}{N - \text{trace}(\mathbf{X}\mathbf{U}\mathbf{G}\mathbf{X}^T)}$$
The variance-covariance matrix of \(\tilde{\boldsymbol{\beta}}\) is estimated as: $$\text{Var}(\tilde{\boldsymbol{\beta}}) = \tilde{\sigma}^2\mathbf{U}\mathbf{G}$$
This approach offers computational efficiency through:
Parallel computation for each partition
Inversion of only small matrices (one per partition)
Efficient calculation of \(\mathbf{X}\mathbf{U}\mathbf{G}\mathbf{X}^T\) trace
For a single predictor, knots are placed at evenly-spaced quantiles of the predictor variable:
The predictor range is divided into K+1 regions using K interior knots
Each observation is assigned to a partition based on these knot boundaries
Custom knots can be specified through the custom_knots
parameter
For multiple predictors, a k-means clustering approach is used:
K+1 cluster centers are identified using k-means on standardized predictors
Neighboring centers are determined using a midpoint criterion:
Centers i and j are neighbors if the midpoint between them is closer to both i and j than to any other center
Observations are assigned to the nearest cluster center
The default number of knots (K) is determined adaptively based on:
Sample size (N)
Number of basis terms per partition (p)
Number of predictors (q)
Model complexity (GLM family)
The package provides interfaces for extending to custom distributions and estimation procedures.
A list containing GLM components:
Character name of distribution
Character name of link function
Function transforming response to linear predictor scale
Function transforming linear predictor to response scale
Optional function for GCV optimization criterion
Core function for unconstrained coefficient estimation per partition:
function(X, y, LambdaHalf, Lambda, keep_weighted_Lambda, family,
tol, K, parallel, cl, chunk_size, num_chunks, rem_chunks,
order_indices, weights, status, ...) {
# Returns p-length coefficient vector
}
Computes scale/dispersion parameter:
function(mu, y, order_indices, family, observation_weights, ...) {
# Returns scalar dispersion estimate
# Defaults to 1 (no dispersion)
}
Computes observation weights:
function(mu, y, order_indices, family, dispersion,
observation_weights, ...) {
# Returns weights vector
# Defaults to family variance with optional weights
}
Adjusts covariance matrix for parameter uncertainty:
function(X, y, B, dispersion, order_list, K, family,
observation_weights, ...) {
# Required for valid standard errors when dispersion affects coefficient estimation
}
Linear constraints enforce exact relationships between coefficients, implemented through the Lagrangian multiplier approach and integrated with smoothness constraints.
Constraints are specified as \(\mathbf{h}^T\boldsymbol{\beta} = \mathbf{h}^T\boldsymbol{\beta}_0\) via:
lgspline(
y ~ spl(x),
constraint_vectors = h, # Matrix of constraint vectors
constraint_values = beta0 # Target values
)
Common applications include:
No-intercept models (via no_intercept = TRUE
)
Offset terms with coefficients constrained to 1
Hypothesis testing with nested models
Inequality constraints enforce bounds or directional relationships on the fitted function through sequential quadratic programming.
Built-in constraints include:
Range constraints: qp_range_lower
and qp_range_upper
Monotonicity: qp_monotonic_increase
or qp_monotonic_decrease
Derivative constraints: qp_positive_derivative
or qp_negative_derivative
Custom inequality constraints can be defined through:
qp_Amat_fxn
: Function returning constraint matrix
qp_bvec_fxn
: Function returning constraint vector
qp_meq_fxn
: Function returning number of equality constraints
Correlation patterns in clustered data are modeled using a matrix whitening approach based on restricted maximum likelihood (REML). The correlation structure is parameterized through the square root inverse matrix \(\mathbf{V}^{-1/2}\).
Exchangeable: Constant correlation within clusters
Spatial Exponential: Exponential decay with distance
AR(1): Correlation based on rank differences
Gaussian/Squared Exponential: Smooth decay with squared distance
Spherical: Polynomial decay with exact cutoff
Matern: Flexible correlation with adjustable smoothness
Gamma-Cosine: Combined gamma decay with oscillation
Gaussian-Cosine: Combined Gaussian decay with oscillation
The REML objective function combines correlation structure, parameter estimates, and smoothness constraints:
$$\frac{1}{N}\big(-\log|\mathbf{V}^{-1/2}| + \frac{N}{2}\log(\sigma^2) + \frac{1}{2\sigma^2}(\mathbf{y} - \mathbf{X}\tilde{\boldsymbol{\beta}})^{T}\mathbf{V}^{-1}(\mathbf{y} - \mathbf{X}\tilde{\boldsymbol{\beta}}) + \frac{1}{2}\log|\sigma^2\mathbf{U}\mathbf{G} | \big)$$
Analytical gradients are provided for efficient optimization of correlation parameters.
Custom correlation structures can be specified through:
VhalfInv_fxn
: Creates \(\mathbf{V}^{-1/2}\)
Vhalf_fxn
: Creates \(\mathbf{V}^{1/2}\) (for non-Gaussian responses)
REML_grad
: Provides analytical gradient
VhalfInv_logdet
: Efficient determinant computation
custom_VhalfInv_loss
: Alternative optimization objective
The penalty matrix \(\boldsymbol{\Lambda}\) is constructed as the integrated squared second derivative of the estimated function over the support of the predictors:
$$\int_a^b \|f''(t)\|^2 dt = \int_a^b \|\mathbf{x}''^\top \boldsymbol{\beta}_k\|^2 dt = \boldsymbol{\beta}_k^{T}\left\{\int_a^b \mathbf{x}''\mathbf{x}''^{\top} dt\right\} \boldsymbol{\beta}_k = \boldsymbol{\beta}_k^{T}\boldsymbol{\Lambda}_s\boldsymbol{\beta}_k$$
For a one-dimensional cubic function, the second derivative basis is \(\mathbf{x}'' = (0, 0, 2, 6t)^{T}\) and the penalty matrix has a closed-form expression:
$$\boldsymbol{\Lambda}_s = \begin{pmatrix} 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 4(b-a) & 6(b^2-a^2) \\ 0 & 0 & 6(b^2-a^2) & 12(b^3-a^3) \end{pmatrix}$$
The full penalty matrix combines the smoothing spline penalty with a ridge penalty on non-spline terms and optional predictor-specific or partition-specific penalties:
$$\boldsymbol{\Lambda} = \lambda_s (\boldsymbol{\Lambda}_s + \lambda_r\boldsymbol{\Lambda}_r + \sum_{l=1}^L \lambda_l \boldsymbol{\Lambda}_l)$$
where:
\(\lambda_s\) is the global smoothing parameter (wiggle_penalty
)
\(\boldsymbol{\Lambda}_s\) is the smoothing spline penalty
\(\boldsymbol{\Lambda}_r\) is a ridge penalty on lower-order terms (flat_ridge_penalty
)
\(\lambda_l\) are optional predictor/partition-specific penalties
This penalty structure defines a meaningful metric for function smoothness, pulling estimates toward linear functions rather than simply shrinking coefficients toward zero.
A penalized variant of Generalized Cross Validation (GCV) is used to find optimal penalties:
$$\text{GCV} = \frac{\sum_{i=1}^N r_i^2}{N \big(1-\frac{1}{N}\text{trace}(\mathbf{X}\mathbf{U}\mathbf{G}\mathbf{X}^T) \big)^2} + \frac{mp}{2} \sum_{l=1}^{L}(\log (1+\lambda_{l})-1)^2$$
where:
\(r_i = y_i - \tilde{y}_i\) are residuals (or replaced with custom alternative for GLMs)
\(\text{trace}(\mathbf{X}\mathbf{U}\mathbf{G}\mathbf{X}^T)\) represents effective degrees of freedom
\(mp\) is the "meta-penalty" term that regularizes predictor/partition-specific penalties
For GLMs, a pseudo-count approach is used to ensure valid link transformations, or
custom_dev.resids
can be provided to replace the sum-of-squared errors.
Optimization employs:
Grid search for initial values
Damped BFGS with analytical gradients
Automated restarts and error handling
Inflation factor \(((N+2)/(N-2))^2\) for final penalties
Buse, A., & Lim, L. (1977). Cubic Splines as a Special Case of Restricted Least Squares. Journal of the American Statistical Association, 72, 64-68.
Craven, P., & Wahba, G. (1978). Smoothing noisy data with spline functions. Numerische Mathematik, 31, 377-403.
Eilers, P. H. C., & Marx, B. D. (1996). Flexible smoothing with B-splines and penalties. Statistical Science, 11, 89-121.
Ezhov, N., Neitzel, F., & Petrovic, S. (2018). Spline approximation, Part 1: Basic methodology. Journal of Applied Geodesy, 12, 139-155.
Kisi, O., Heddam, S., Parmar, K. S., Petroselli, A., Külls, C., & Zounemat-Kermani, M. (2025). Integration of Gaussian process regression and K means clustering for enhanced short term rainfall runoff modeling. Scientific Reports, 15, 7444.
Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer.
Reinsch, C. H. (1967). Smoothing by spline functions. Numerische Mathematik, 10, 177-183.
Searle, S. R., Casella, G., & McCulloch, C. E. (2009). Variance Components. Wiley Series in Probability and Statistics. Wiley.
Silverman, B. W. (1984). Spline Smoothing: The Equivalent Variable Kernel Method. The Annals of Statistics, 12, 898-916.
Wahba, G. (1990). Spline Models for Observational Data. Society for Industrial and Applied Mathematics.
Wood, S. N. (2006). Generalized Additive Models: An Introduction with R. Chapman & Hall/CRC, Boca Raton.