# smooth.spline

##### Fit a Smoothing Spline

Fits a cubic smoothing spline to the supplied data.

- Keywords
- smooth

##### Usage

```
smooth.spline(x, y = NULL, w = NULL, df, spar = NULL, lambda = NULL, cv = FALSE,
all.knots = FALSE, nknots = .nknots.smspl,
keep.data = TRUE, df.offset = 0, penalty = 1,
control.spar = list(), tol = 1e-6 * IQR(x), keep.stuff = FALSE)
```

##### Arguments

- x
a vector giving the values of the predictor variable, or a list or a two-column matrix specifying x and y.

- y
responses. If

`y`

is missing or`NULL`

, the responses are assumed to be specified by`x`

, with`x`

the index vector.- w
optional vector of weights of the same length as

`x`

; defaults to all 1.- df
the desired equivalent number of degrees of freedom (trace of the smoother matrix). Must be in \((1,n_x]\), \(n_x\) the number of unique x values, see below.

- spar
smoothing parameter, typically (but not necessarily) in \((0,1]\). When

`spar`

is specified, the coefficient \(\lambda\) of the integral of the squared second derivative in the fit (penalized log likelihood) criterion is a monotone function of`spar`

, see the details below. Alternatively`lambda`

may be specified instead of the*scale free*`spar`

=\(s\).- lambda
if desired, the internal (design-dependent) smoothing parameter \(\lambda\) can be specified instead of

`spar`

. This may be desirable for resampling algorithms such as cross validation or the bootstrap.- cv
ordinary leave-one-out (

`TRUE`

) or ‘generalized’ cross-validation (GCV) when`FALSE`

; is used for smoothing parameter computation only when both`spar`

and`df`

are not specified; it is used however to determine`cv.crit`

in the result. Setting it to`NA`

for speedup skips the evaluation of leverages and any score.- all.knots
if

`TRUE`

, all distinct points in`x`

are used as knots. If`FALSE`

(default), a subset of`x[]`

is used, specifically`x[j]`

where the`nknots`

indices are evenly spaced in`1:n`

, see also the next argument`nknots`

.Alternatively, a strictly increasing

`numeric`

vector specifying “all the knots” to be used; must be rescaled to \([0, 1]\) already such that it corresponds to the`ans $ fit$knots`

sequence returned, not repeating the boundary knots.- nknots
integer or

`function`

giving the number of knots to use when`all.knots = FALSE`

. If a function (as by default), the number of knots is`nknots(nx)`

. By default for \(n_x > 49\) this is less than \(n_x\), the number of unique`x`

values, see the Note.- keep.data
logical specifying if the input data should be kept in the result. If

`TRUE`

(as per default), fitted values and residuals are available from the result.- df.offset
allows the degrees of freedom to be increased by

`df.offset`

in the GCV criterion.- penalty
the coefficient of the penalty for degrees of freedom in the GCV criterion.

- control.spar
optional list with named components controlling the root finding when the smoothing parameter

`spar`

is computed, i.e., missing or`NULL`

, see below.**Note**that this is partly*experimental*and may change with general spar computation improvements!- low:
lower bound for

`spar`

; defaults to -1.5 (used to implicitly default to 0 in R versions earlier than 1.4).- high:
upper bound for

`spar`

; defaults to +1.5.- tol:
the absolute precision (

**tol**erance) used; defaults to 1e-4 (formerly 1e-3).- eps:
the relative precision used; defaults to 2e-8 (formerly 0.00244).

- trace:
logical indicating if iterations should be traced.

- maxit:
integer giving the maximal number of iterations; defaults to 500.

Note that

`spar`

is only searched for in the interval \([low, high]\).- tol
a tolerance for same-ness or uniqueness of the

`x`

values. The values are binned into bins of size`tol`

and values which fall into the same bin are regarded as the same. Must be strictly positive (and finite).- keep.stuff
an experimental

`logical`

indicating if the result should keep extras from the internal computations. Should allow to reconstruct the \(X\) matrix and more.

##### Details

Neither `x`

nor `y`

are allowed to containing missing or
infinite values.

The `x`

vector should contain at least four distinct values.
‘Distinct’ here is controlled by `tol`

: values which are
regarded as the same are replaced by the first of their values and the
corresponding `y`

and `w`

are pooled accordingly.

Unless `lambda`

has been specified instead of `spar`

,
the computational \(\lambda\) used (as a function of
\(s=spar\)) is
\(\lambda = r * 256^{3 s - 1}\)
where
\(r = tr(X' W X) / tr(\Sigma)\),
\(\Sigma\) is the matrix given by
\(\Sigma_{ij} = \int B_i''(t) B_j''(t) dt\),
\(X\) is given by \(X_{ij} = B_j(x_i)\),
\(W\) is the diagonal matrix of weights (scaled such that
its trace is \(n\), the original number of observations)
and \(B_k(.)\) is the \(k\)-th B-spline.

Note that with these definitions, \(f_i = f(x_i)\), and the B-spline basis representation \(f = X c\) (i.e., \(c\) is the vector of spline coefficients), the penalized log likelihood is \(L = (y - f)' W (y - f) + \lambda c' \Sigma c\), and hence \(c\) is the solution of the (ridge regression) \((X' W X + \lambda \Sigma) c = X' W y\).

If `spar`

and `lambda`

are missing or `NULL`

, the value
of `df`

is used to determine the degree of smoothing. If
`df`

is missing as well, leave-one-out cross-validation (ordinary
or ‘generalized’ as determined by `cv`

) is used to
determine \(\lambda\).

Note that from the above relation,
`spar`

is \(s = s0 + 0.0601 * \bold{\log}\lambda\),
which is intentionally *different* from the S-PLUS implementation
of `smooth.spline`

(where `spar`

is proportional to
\(\lambda\)). In R's (\(\log \lambda\)) scale, it makes more
sense to vary `spar`

linearly.

Note however that currently the results may become very unreliable
for `spar`

values smaller than about -1 or -2. The same may
happen for values larger than 2 or so. Don't think of setting
`spar`

or the controls `low`

and `high`

outside such a
safe range, unless you know what you are doing!
Similarly, specifying `lambda`

instead of `spar`

is
delicate, notably as the range of “safe” values for
`lambda`

is not scale-invariant and hence entirely data dependent.

The ‘generalized’ cross-validation method GCV will work correctly when
there are duplicated points in `x`

. However, it is ambiguous what
leave-one-out cross-validation means with duplicated points, and the
internal code uses an approximation that involves leaving out groups
of duplicated points. `cv = TRUE`

is best avoided in that case.

##### Value

An object of class `"smooth.spline"`

with components

the *distinct* `x`

values in increasing order, see
the ‘Details’ above.

the fitted values corresponding to `x`

.

the weights used at the unique values of `x`

.

the y values used at the unique `y`

values.

the `tol`

argument (whose default depends on `x`

).

only if `keep.data = TRUE`

: itself a
`list`

with components `x`

, `y`

and `w`

of the same length. These are the original \((x_i,y_i,w_i),
i = 1, \dots, n\), values where `data$x`

may have repeated values and
hence be longer than the above `x`

component; see details.

(when `cv`

was not `NA`

) leverages, the diagonal
values of the smoother matrix.

cross-validation score, ‘generalized’ or true, depending
on `cv`

. The CV score is often called “PRESS” (and
labeled on `print()`

), for ‘**PRE**diction
**S**um of **S**quares’.

the penalized criterion, a non-negative number; simply
the (weighted) residual sum of squares (RSS), ` sum(.$w * residuals(.)^2) `

.

the criterion value minimized in the underlying
`.Fortran`

routine `sslvrg`

. When `df`

has been specified,
the criterion is \(3 + (tr(S_\lambda) - df)^2\),
where the \(3 +\) is there for numerical (and historical) reasons.

equivalent degrees of freedom used. Note that (currently)
this value may become quite imprecise when the true `df`

is
between and 1 and 2.

the value of `spar`

computed or given, unless it has been
given as `c(lambda = *)`

, when it set to `NA`

here.

(when `spar`

above is not `NA`

), the value
\(r\), the ratio of two matrix traces.

the value of \(\lambda\) corresponding to `spar`

,
see the details above.

named integer(3) vector where `..$ipars["iter"]`

gives number of spar computing iterations used.

experimental; when `keep.stuff`

was true, a
“flat” numeric vector containing parts of the internal computations.

list for use by `predict.smooth.spline`

, with
components

- knot:
the knot sequence (including the repeated boundary knots), scaled into \([0, 1]\) (via

`min`

and`range`

).- nk:
number of coefficients or number of ‘proper’ knots plus 2.

- coef:
coefficients for the spline basis used.

- min, range:
numbers giving the corresponding quantities of

`x`

.

the matched call.

method(class = "smooth.spline") shows a hatvalues() method based on the lev vector above.

##### Note

The number of unique `x`

values, \(\code{nx} = n_x\), are
determined by the `tol`

argument, equivalently to

nx <- length(x) - sum(duplicated( round((x - mean(x)) / tol) ))

The default `all.knots = FALSE`

and `nknots = .nknots.smspl`

,
entails using only \(O({n_x}^{0.2})\)
knots instead of \(n_x\) for \(n_x > 49\). This cuts
speed and memory requirements, but not drastically anymore since R
version 1.5.1 where it is only \(O(n_k) + O(n)\) where
\(n_k\) is the number of knots.

In this case where not all unique `x`

values are
used as knots, the result is not a smoothing spline in the strict
sense, but very close unless a small smoothing parameter (or large
`df`

) is used.

##### References

Chambers, J. M. and Hastie, T. J. (1992)
*Statistical Models in S*, Wadsworth & Brooks/Cole.

Green, P. J. and Silverman, B. W. (1994)
*Nonparametric Regression and Generalized Linear Models:
A Roughness Penalty Approach.* Chapman and Hall.

Hastie, T. J. and Tibshirani, R. J. (1990)
*Generalized Additive Models.* Chapman and Hall.

##### See Also

`predict.smooth.spline`

for evaluating the spline
and its derivatives.

##### Examples

`library(stats)`

```
# NOT RUN {
require(graphics)
plot(dist ~ speed, data = cars, main = "data(cars) & smoothing splines")
cars.spl <- with(cars, smooth.spline(speed, dist))
cars.spl
## This example has duplicate points, so avoid cv = TRUE
# }
# NOT RUN {
lines(cars.spl, col = "blue")
ss10 <- smooth.spline(cars[,"speed"], cars[,"dist"], df = 10)
lines(ss10, lty = 2, col = "red")
legend(5,120,c(paste("default [C.V.] => df =",round(cars.spl$df,1)),
"s( * , df = 10)"), col = c("blue","red"), lty = 1:2,
bg = 'bisque')
## Residual (Tukey Anscombe) plot:
plot(residuals(cars.spl) ~ fitted(cars.spl))
abline(h = 0, col = "gray")
## consistency check:
stopifnot(all.equal(cars$dist,
fitted(cars.spl) + residuals(cars.spl)))
## The chosen inner knots in original x-scale :
with(cars.spl$fit, min + range * knot[-c(1:3, nk+1 +1:3)]) # == unique(cars$speed)
## Visualize the behavior of .nknots.smspl()
nKnots <- Vectorize(.nknots.smspl) ; c.. <- adjustcolor("gray20",.5)
curve(nKnots, 1, 250, n=250)
abline(0,1, lty=2, col=c..); text(90,90,"y = x", col=c.., adj=-.25)
abline(h=100,lty=2); abline(v=200, lty=2)
n <- c(1:799, seq(800, 3490, by=10), seq(3500, 10000, by = 50))
plot(n, nKnots(n), type="l", main = "Vectorize(.nknots.smspl) (n)")
abline(0,1, lty=2, col=c..); text(180,180,"y = x", col=c..)
n0 <- c(50, 200, 800, 3200); c0 <- adjustcolor("blue3", .5)
lines(n0, nKnots(n0), type="h", col=c0)
axis(1, at=n0, line=-2, col.ticks=c0, col=NA, col.axis=c0)
axis(4, at=.nknots.smspl(10000), line=-.5, col=c..,col.axis=c.., las=1)
##-- artificial example
y18 <- c(1:3, 5, 4, 7:3, 2*(2:5), rep(10, 4))
xx <- seq(1, length(y18), len = 201)
(s2 <- smooth.spline(y18)) # GCV
(s02 <- smooth.spline(y18, spar = 0.2))
(s02. <- smooth.spline(y18, spar = 0.2, cv = NA))
plot(y18, main = deparse(s2$call), col.main = 2)
lines(s2, col = "gray"); lines(predict(s2, xx), col = 2)
lines(predict(s02, xx), col = 3); mtext(deparse(s02$call), col = 3)
## Specifying 'lambda' instead of usual spar :
(s2. <- smooth.spline(y18, lambda = s2$lambda, tol = s2$tol))
# }
# NOT RUN {
## The following shows the problematic behavior of 'spar' searching:
(s2 <- smooth.spline(y18, control =
list(trace = TRUE, tol = 1e-6, low = -1.5)))
(s2m <- smooth.spline(y18, cv = TRUE, control =
list(trace = TRUE, tol = 1e-6, low = -1.5)))
## both above do quite similarly (Df = 8.5 +- 0.2)
# }
```

*Documentation reproduced from package stats, version 3.5.0, License: Part of R 3.5.0*