# smooth.spline

##### Fit a Smoothing Spline

Fits a cubic smoothing spline to the supplied data.

- Keywords
- smooth

##### Usage

```
smooth.spline(x, y = NULL, w = NULL, df, spar = NULL, cv = FALSE,
all.knots = FALSE, nknots = .nknots.smspl,
keep.data = TRUE, df.offset = 0, penalty = 1,
control.spar = list(), tol = 1e-6 * IQR(x))
```

##### Arguments

- x
- a vector giving the values of the predictor variable, or a list or a two-column matrix specifying x and y.
- y
- responses. If
`y`

is missing or`NULL`

, the responses are assumed to be specified by`x`

, with`x`

the index vector. - w
- optional vector of weights of the same length as
`x`

; defaults to all 1. - df
- the desired equivalent number of degrees of freedom (trace of the smoother matrix). Must be in \((1,n_x]\), \(n_x\) the number of unique x values, see below.
- spar
- smoothing parameter, typically (but not necessarily) in
\((0,1]\). The coefficient \(\lambda\) of the integral of the
squared second derivative in the fit (penalized log likelihood)
criterion is a monotone function of
`spar`

, see the details below. - cv
- ordinary (
`TRUE`

) or ‘generalized’ cross-validation (GCV) when`FALSE`

; is used for smoothing parameter computation only when`df`

is not specified; it is used however to determine`cv.crit`

in the result. Setting it to`NA`

for speedup skips the evaluation of leverages and any score. - all.knots
- if
`TRUE`

, all distinct points in`x`

are used as knots. If`FALSE`

(default), a subset of`x[]`

is used, specifically`x[j]`

where the`nknots`

indices are evenly spaced in`1:n`

, see also the next argument`nknots`

. - nknots
- integer or
`function`

giving the number of knots to use when`all.knots = FALSE`

. If a function (as by default), the number of knots is`nknots(nx)`

. By default for \(n_x > 49\) this is less than \(n_x\), the number of unique`x`

values, see the Note. - keep.data
- logical specifying if the input data should be kept
in the result. If
`TRUE`

(as per default), fitted values and residuals are available from the result. - df.offset
- allows the degrees of freedom to be increased by
`df.offset`

in the GCV criterion. - penalty
- the coefficient of the penalty for degrees of freedom in the GCV criterion.
- control.spar
- optional list with named components controlling the
root finding when the smoothing parameter
`spar`

is computed, i.e., missing or`NULL`

, see below.**Note**that this is partly*experimental*and may change with general spar computation improvements!- low:
- lower bound for
`spar`

; defaults to -1.5 (used to implicitly default to 0 in R versions earlier than 1.4). - high:
- upper bound for
`spar`

; defaults to +1.5. - tol:
- the absolute precision (
**tol**erance) used; defaults to 1e-4 (formerly 1e-3). - eps:
- the relative precision used; defaults to 2e-8 (formerly 0.00244).
- trace:
- logical indicating if iterations should be traced.
- maxit:
- integer giving the maximal number of iterations; defaults to 500.

Note that

`spar`

is only searched for in the interval \([low, high]\). - tol
- a tolerance for same-ness or uniqueness of the
`x`

values. The values are binned into bins of size`tol`

and values which fall into the same bin are regarded as the same. Must be strictly positive (and finite).

##### Details

Neither `x`

nor `y`

are allowed to containing missing or
infinite values. The `x`

vector should contain at least four distinct values.
‘Distinct’ here is controlled by `tol`

: values which are
regarded as the same are replaced by the first of their values and the
corresponding `y`

and `w`

are pooled accordingly. The computational \(\lambda\) used (as a function of
\(s=spar\)) is
\(\lambda = r * 256^{3 s - 1}\)
where
\(r = tr(X' W X) / tr(\Sigma)\),
\(\Sigma\) is the matrix given by
\(\Sigma_{ij} = \int B_i''(t) B_j''(t) dt\),
\(X\) is given by \(X_{ij} = B_j(x_i)\),
\(W\) is the diagonal matrix of weights (scaled such that
its trace is \(n\), the original number of observations)
and \(B_k(.)\) is the \(k\)-th B-spline. Note that with these definitions, \(f_i = f(x_i)\), and the B-spline
basis representation \(f = X c\) (i.e., \(c\) is
the vector of spline coefficients), the penalized log likelihood is
\(L = (y - f)' W (y - f) + \lambda c' \Sigma c\), and hence
\(c\) is the solution of the (ridge regression)
\((X' W X + \lambda \Sigma) c = X' W y\). If `spar`

is missing or `NULL`

, the value of `df`

is used to
determine the degree of smoothing. If both are missing, leave-one-out
cross-validation (ordinary or ‘generalized’ as determined by
`cv`

) is used to determine \(\lambda\).
Note that from the above relation,
`spar`

is \(s = s0 + 0.0601 * \bold{\log}\lambda\),
which is intentionally *different* from the S-PLUS implementation
of `smooth.spline`

(where `spar`

is proportional to
\(\lambda\)). In R's (\(\log \lambda\)) scale, it makes more
sense to vary `spar`

linearly. Note however that currently the results may become very unreliable
for `spar`

values smaller than about -1 or -2. The same may
happen for values larger than 2 or so. Don't think of setting
`spar`

or the controls `low`

and `high`

outside such a
safe range, unless you know what you are doing! The ‘generalized’ cross-validation method will work correctly when
there are duplicated points in `x`

. However, it is ambiguous what
leave-one-out cross-validation means with duplicated points, and the
internal code uses an approximation that involves leaving out groups
of duplicated points. `cv = TRUE`

is best avoided in that case.

##### Value

An object of class `"smooth.spline"`

with components

*distinct*

`x`

values in increasing order, see
the ‘Details’ above.`x`

.`x`

.`y`

values.`keep.data = TRUE`

: itself a
`list`

with components `x`

, `y`

and `w`

of the same length. These are the original \((x_i,y_i,w_i),
i = 1, \dots, n\), values where `data$x`

may have repeated values and
hence be longer than the above `x`

component; see details.
`cv`

was not `NA`

) leverages, the diagonal
values of the smoother matrix.`cv`

.`.Fortran`

routine `sslvrg`

. When `df`

has been specified,
the criterion is \(3 + (tr(S_\lambda) - df)^2\),
where the \(3 +\) is there for numerical (and historical) reasons.`df`

is
between and 1 and 2.
`spar`

computed or given.`spar`

,
see the details above.`..$ipars["iter"]`

gives number of spar computing iterations used.`predict.smooth.spline`

, with
components
- knot:
- the knot sequence (including the repeated boundary knots).
- nk:
- number of coefficients or number of ‘proper’ knots plus 2.
- coef:
- coefficients for the spline basis used.
- min, range:
- numbers giving the corresponding quantities of
`x`

.

##### Note

The number of unique `x`

values, \(\code{nx} = n_x\), are
determined by the `tol`

argument, equivalently to

nx <- length(x) - sum(duplicated( round((x - mean(x)) / tol) ))The default

`all.knots = FALSE`

and `nknots = .nknots.smspl`

,
entails using only \(O({n_x}^{0.2})\)
knots instead of \(n_x\) for \(n_x > 49\). This cuts
speed and memory requirements, but not drastically anymore since R
version 1.5.1 where it is only \(O(n_k) + O(n)\) where
\(n_k\) is the number of knots. In this case where not all unique `x`

values are
used as knots, the result is not a smoothing spline in the strict
sense, but very close unless a small smoothing parameter (or large
`df`

) is used.
##### References

Chambers, J. M. and Hastie, T. J. (1992)
*Statistical Models in S*, Wadsworth & Brooks/Cole. Green, P. J. and Silverman, B. W. (1994)
*Nonparametric Regression and Generalized Linear Models:
A Roughness Penalty Approach.* Chapman and Hall. Hastie, T. J. and Tibshirani, R. J. (1990)
*Generalized Additive Models.* Chapman and Hall.

##### See Also

`predict.smooth.spline`

for evaluating the spline
and its derivatives.

##### Examples

`library(stats)`

```
require(graphics)
attach(cars)
plot(speed, dist, main = "data(cars) & smoothing splines")
cars.spl <- smooth.spline(speed, dist)
(cars.spl)
## This example has duplicate points, so avoid cv = TRUE
lines(cars.spl, col = "blue")
lines(smooth.spline(speed, dist, df = 10), lty = 2, col = "red")
legend(5,120,c(paste("default [C.V.] => df =",round(cars.spl$df,1)),
"s( * , df = 10)"), col = c("blue","red"), lty = 1:2,
bg = 'bisque')
detach()
## Residual (Tukey Anscombe) plot:
plot(residuals(cars.spl) ~ fitted(cars.spl))
abline(h = 0, col = "gray")
## consistency check:
stopifnot(all.equal(cars$dist,
fitted(cars.spl) + residuals(cars.spl)))
## Visualize the behavior of .nknots.smspl()
nKnots <- Vectorize(.nknots.smspl) ; c.. <- adjustcolor("gray20",.5)
curve(nKnots, 1, 250, n=250)
abline(0,1, lty=2, col=c..); text(90,90,"y = x", col=c.., adj=-.25)
abline(h=100,lty=2); abline(v=200, lty=2)
n <- c(1:799, seq(800, 3490, by=10), seq(3500, 10000, by = 50))
plot(n, nKnots(n), type="l", main = "Vectorize(.nknots.smspl) (n)")
abline(0,1, lty=2, col=c..); text(180,180,"y = x", col=c..)
n0 <- c(50, 200, 800, 3200); c0 <- adjustcolor("blue3", .5)
lines(n0, nKnots(n0), type="h", col=c0)
axis(1, at=n0, line=-2, col.ticks=c0, col=NA, col.axis=c0)
axis(4, at=.nknots.smspl(10000), line=-.5, col=c..,col.axis=c.., las=1)
##-- artificial example
y18 <- c(1:3, 5, 4, 7:3, 2*(2:5), rep(10, 4))
xx <- seq(1, length(y18), len = 201)
(s2 <- smooth.spline(y18)) # GCV
(s02 <- smooth.spline(y18, spar = 0.2))
(s02. <- smooth.spline(y18, spar = 0.2, cv = NA))
plot(y18, main = deparse(s2$call), col.main = 2)
lines(s2, col = "gray"); lines(predict(s2, xx), col = 2)
lines(predict(s02, xx), col = 3); mtext(deparse(s02$call), col = 3)
## The following shows the problematic behavior of 'spar' searching:
(s2 <- smooth.spline(y18, control =
list(trace = TRUE, tol = 1e-6, low = -1.5)))
(s2m <- smooth.spline(y18, cv = TRUE, control =
list(trace = TRUE, tol = 1e-6, low = -1.5)))
## both above do quite similarly (Df = 8.5 +- 0.2)
```

*Documentation reproduced from package stats, version 3.3.3, License: Part of R 3.3.3*