smooth.spline
Fit a Smoothing Spline
Fits a cubic smoothing spline to the supplied data.
- Keywords
- smooth
Usage
smooth.spline(x, y = NULL, w = NULL, df, spar = NULL, lambda = NULL, cv = FALSE,
all.knots = FALSE, nknots = .nknots.smspl,
keep.data = TRUE, df.offset = 0, penalty = 1,
control.spar = list(), tol = 1e-6 * IQR(x), keep.stuff = FALSE)
Arguments
- x
a vector giving the values of the predictor variable, or a list or a two-column matrix specifying x and y.
- y
responses. If
y
is missing orNULL
, the responses are assumed to be specified byx
, withx
the index vector.- w
optional vector of weights of the same length as
x
; defaults to all 1.- df
the desired equivalent number of degrees of freedom (trace of the smoother matrix). Must be in \((1,n_x]\), \(n_x\) the number of unique x values, see below.
- spar
smoothing parameter, typically (but not necessarily) in \((0,1]\). When
spar
is specified, the coefficient \(\lambda\) of the integral of the squared second derivative in the fit (penalized log likelihood) criterion is a monotone function ofspar
, see the details below. Alternativelylambda
may be specified instead of the scale freespar
=\(s\).- lambda
if desired, the internal (design-dependent) smoothing parameter \(\lambda\) can be specified instead of
spar
. This may be desirable for resampling algorithms such as cross validation or the bootstrap.- cv
ordinary leave-one-out (
TRUE
) or ‘generalized’ cross-validation (GCV) whenFALSE
; is used for smoothing parameter computation only when bothspar
anddf
are not specified; it is used however to determinecv.crit
in the result. Setting it toNA
for speedup skips the evaluation of leverages and any score.- all.knots
if
TRUE
, all distinct points inx
are used as knots. IfFALSE
(default), a subset ofx[]
is used, specificallyx[j]
where thenknots
indices are evenly spaced in1:n
, see also the next argumentnknots
.Alternatively, a strictly increasing
numeric
vector specifying “all the knots” to be used; must be rescaled to \([0, 1]\) already such that it corresponds to theans $ fit$knots
sequence returned, not repeating the boundary knots.- nknots
integer or
function
giving the number of knots to use whenall.knots = FALSE
. If a function (as by default), the number of knots isnknots(nx)
. By default for \(n_x > 49\) this is less than \(n_x\), the number of uniquex
values, see the Note.- keep.data
logical specifying if the input data should be kept in the result. If
TRUE
(as per default), fitted values and residuals are available from the result.- df.offset
allows the degrees of freedom to be increased by
df.offset
in the GCV criterion.- penalty
the coefficient of the penalty for degrees of freedom in the GCV criterion.
- control.spar
optional list with named components controlling the root finding when the smoothing parameter
spar
is computed, i.e., missing orNULL
, see below.Note that this is partly experimental and may change with general spar computation improvements!
- low:
lower bound for
spar
; defaults to -1.5 (used to implicitly default to 0 in R versions earlier than 1.4).- high:
upper bound for
spar
; defaults to +1.5.- tol:
the absolute precision (tolerance) used; defaults to 1e-4 (formerly 1e-3).
- eps:
the relative precision used; defaults to 2e-8 (formerly 0.00244).
- trace:
logical indicating if iterations should be traced.
- maxit:
integer giving the maximal number of iterations; defaults to 500.
Note that
spar
is only searched for in the interval \([low, high]\).- tol
a tolerance for same-ness or uniqueness of the
x
values. The values are binned into bins of sizetol
and values which fall into the same bin are regarded as the same. Must be strictly positive (and finite).- keep.stuff
an experimental
logical
indicating if the result should keep extras from the internal computations. Should allow to reconstruct the \(X\) matrix and more.
Details
Neither x
nor y
are allowed to containing missing or
infinite values.
The x
vector should contain at least four distinct values.
‘Distinct’ here is controlled by tol
: values which are
regarded as the same are replaced by the first of their values and the
corresponding y
and w
are pooled accordingly.
Unless lambda
has been specified instead of spar
,
the computational \(\lambda\) used (as a function of
\(s=spar\)) is
\(\lambda = r * 256^{3 s - 1}\)
where
\(r = tr(X' W X) / tr(\Sigma)\),
\(\Sigma\) is the matrix given by
\(\Sigma_{ij} = \int B_i''(t) B_j''(t) dt\),
\(X\) is given by \(X_{ij} = B_j(x_i)\),
\(W\) is the diagonal matrix of weights (scaled such that
its trace is \(n\), the original number of observations)
and \(B_k(.)\) is the \(k\)-th B-spline.
Note that with these definitions, \(f_i = f(x_i)\), and the B-spline basis representation \(f = X c\) (i.e., \(c\) is the vector of spline coefficients), the penalized log likelihood is \(L = (y - f)' W (y - f) + \lambda c' \Sigma c\), and hence \(c\) is the solution of the (ridge regression) \((X' W X + \lambda \Sigma) c = X' W y\).
If spar
and lambda
are missing or NULL
, the value
of df
is used to determine the degree of smoothing. If
df
is missing as well, leave-one-out cross-validation (ordinary
or ‘generalized’ as determined by cv
) is used to
determine \(\lambda\).
Note that from the above relation,
spar
is \(s = s0 + 0.0601 * \bold{\log}\lambda\),
which is intentionally different from the S-PLUS implementation
of smooth.spline
(where spar
is proportional to
\(\lambda\)). In R's (\(\log \lambda\)) scale, it makes more
sense to vary spar
linearly.
Note however that currently the results may become very unreliable
for spar
values smaller than about -1 or -2. The same may
happen for values larger than 2 or so. Don't think of setting
spar
or the controls low
and high
outside such a
safe range, unless you know what you are doing!
Similarly, specifying lambda
instead of spar
is
delicate, notably as the range of “safe” values for
lambda
is not scale-invariant and hence entirely data dependent.
The ‘generalized’ cross-validation method GCV will work correctly when
there are duplicated points in x
. However, it is ambiguous what
leave-one-out cross-validation means with duplicated points, and the
internal code uses an approximation that involves leaving out groups
of duplicated points. cv = TRUE
is best avoided in that case.
Value
An object of class "smooth.spline"
with components
the distinct x
values in increasing order, see
the ‘Details’ above.
the fitted values corresponding to x
.
the weights used at the unique values of x
.
the y values used at the unique y
values.
the tol
argument (whose default depends on x
).
only if keep.data = TRUE
: itself a
list
with components x
, y
and w
of the same length. These are the original \((x_i,y_i,w_i),
i = 1, \dots, n\), values where data$x
may have repeated values and
hence be longer than the above x
component; see details.
(when cv
was not NA
) leverages, the diagonal
values of the smoother matrix.
cross-validation score, ‘generalized’ or true, depending
on cv
. The CV score is often called “PRESS” (and
labeled on print()
), for ‘PREdiction
Sum of Squares’.
the penalized criterion, a non-negative number; simply
the (weighted) residual sum of squares (RSS), sum(.$w * residuals(.)^2)
.
the criterion value minimized in the underlying
.Fortran
routine sslvrg
. When df
has been specified,
the criterion is \(3 + (tr(S_\lambda) - df)^2\),
where the \(3 +\) is there for numerical (and historical) reasons.
equivalent degrees of freedom used. Note that (currently)
this value may become quite imprecise when the true df
is
between and 1 and 2.
the value of spar
computed or given, unless it has been
given as c(lambda = *)
, when it set to NA
here.
(when spar
above is not NA
), the value
\(r\), the ratio of two matrix traces.
the value of \(\lambda\) corresponding to spar
,
see the details above.
named integer(3) vector where ..$ipars["iter"]
gives number of spar computing iterations used.
experimental; when keep.stuff
was true, a
“flat” numeric vector containing parts of the internal computations.
list for use by predict.smooth.spline
, with
components
- knot:
the knot sequence (including the repeated boundary knots), scaled into \([0, 1]\) (via
min
andrange
).- nk:
number of coefficients or number of ‘proper’ knots plus 2.
- coef:
coefficients for the spline basis used.
- min, range:
numbers giving the corresponding quantities of
x
.
the matched call.
method(class = "smooth.spline") shows a hatvalues() method based on the lev vector above.
Note
The number of unique x
values, \(\code{nx} = n_x\), are
determined by the tol
argument, equivalently to
nx <- length(x) - sum(duplicated( round((x - mean(x)) / tol) ))
The default all.knots = FALSE
and nknots = .nknots.smspl
,
entails using only \(O({n_x}^{0.2})\)
knots instead of \(n_x\) for \(n_x > 49\). This cuts
speed and memory requirements, but not drastically anymore since R
version 1.5.1 where it is only \(O(n_k) + O(n)\) where
\(n_k\) is the number of knots.
In this case where not all unique x
values are
used as knots, the result is not a smoothing spline in the strict
sense, but very close unless a small smoothing parameter (or large
df
) is used.
References
Chambers, J. M. and Hastie, T. J. (1992) Statistical Models in S, Wadsworth & Brooks/Cole.
Green, P. J. and Silverman, B. W. (1994) Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. Chapman and Hall.
Hastie, T. J. and Tibshirani, R. J. (1990) Generalized Additive Models. Chapman and Hall.
See Also
predict.smooth.spline
for evaluating the spline
and its derivatives.
Examples
library(stats)
# NOT RUN {
require(graphics)
plot(dist ~ speed, data = cars, main = "data(cars) & smoothing splines")
cars.spl <- with(cars, smooth.spline(speed, dist))
cars.spl
## This example has duplicate points, so avoid cv = TRUE
# }
# NOT RUN {
lines(cars.spl, col = "blue")
ss10 <- smooth.spline(cars[,"speed"], cars[,"dist"], df = 10)
lines(ss10, lty = 2, col = "red")
legend(5,120,c(paste("default [C.V.] => df =",round(cars.spl$df,1)),
"s( * , df = 10)"), col = c("blue","red"), lty = 1:2,
bg = 'bisque')
## Residual (Tukey Anscombe) plot:
plot(residuals(cars.spl) ~ fitted(cars.spl))
abline(h = 0, col = "gray")
## consistency check:
stopifnot(all.equal(cars$dist,
fitted(cars.spl) + residuals(cars.spl)))
## The chosen inner knots in original x-scale :
with(cars.spl$fit, min + range * knot[-c(1:3, nk+1 +1:3)]) # == unique(cars$speed)
## Visualize the behavior of .nknots.smspl()
nKnots <- Vectorize(.nknots.smspl) ; c.. <- adjustcolor("gray20",.5)
curve(nKnots, 1, 250, n=250)
abline(0,1, lty=2, col=c..); text(90,90,"y = x", col=c.., adj=-.25)
abline(h=100,lty=2); abline(v=200, lty=2)
n <- c(1:799, seq(800, 3490, by=10), seq(3500, 10000, by = 50))
plot(n, nKnots(n), type="l", main = "Vectorize(.nknots.smspl) (n)")
abline(0,1, lty=2, col=c..); text(180,180,"y = x", col=c..)
n0 <- c(50, 200, 800, 3200); c0 <- adjustcolor("blue3", .5)
lines(n0, nKnots(n0), type="h", col=c0)
axis(1, at=n0, line=-2, col.ticks=c0, col=NA, col.axis=c0)
axis(4, at=.nknots.smspl(10000), line=-.5, col=c..,col.axis=c.., las=1)
##-- artificial example
y18 <- c(1:3, 5, 4, 7:3, 2*(2:5), rep(10, 4))
xx <- seq(1, length(y18), len = 201)
(s2 <- smooth.spline(y18)) # GCV
(s02 <- smooth.spline(y18, spar = 0.2))
(s02. <- smooth.spline(y18, spar = 0.2, cv = NA))
plot(y18, main = deparse(s2$call), col.main = 2)
lines(s2, col = "gray"); lines(predict(s2, xx), col = 2)
lines(predict(s02, xx), col = 3); mtext(deparse(s02$call), col = 3)
## Specifying 'lambda' instead of usual spar :
(s2. <- smooth.spline(y18, lambda = s2$lambda, tol = s2$tol))
# }
# NOT RUN {
## The following shows the problematic behavior of 'spar' searching:
(s2 <- smooth.spline(y18, control =
list(trace = TRUE, tol = 1e-6, low = -1.5)))
(s2m <- smooth.spline(y18, cv = TRUE, control =
list(trace = TRUE, tol = 1e-6, low = -1.5)))
## both above do quite similarly (Df = 8.5 +- 0.2)
# }