# smooth.spline

##### Fit a Smoothing Spline

Fits a cubic smoothing spline to the supplied data.

- Keywords
- smooth

##### Usage

```
smooth.spline(x, y = NULL, w = NULL, df, spar = NULL, cv = FALSE,
all.knots = FALSE, nknots = .nknots.smspl,
keep.data = TRUE, df.offset = 0, penalty = 1,
control.spar = list(), tol = 1e-6 * IQR(x))
```

##### Arguments

- x
- a vector giving the values of the predictor variable, or a list or a two-column matrix specifying x and y.
- y
- responses. If
`y`

is missing or`NULL`

, the responses are assumed to be specified by`x`

, with`x`

the index vector. - w
- optional vector of weights of the same length as
`x`

; defaults to all 1. - df
- the desired equivalent number of degrees of freedom (trace of the smoother matrix). Must be in $(1,n_x]$, $n_x$ the number of unique x values, see below.
- spar
- smoothing parameter, typically (but not necessarily) in
$(0,1]$. The coefficient $\lambda$ of the integral of the
squared second derivative in the fit (penalized log likelihood)
criterion is a monotone function of
`spar`

, see the details below. - cv
- ordinary (
`TRUE`

) orgeneralized cross-validation (GCV) when`FALSE`

; is used for smoothing parameter computation only when`df`

is not specified; it is used however to determine`cv.crit`

in the result. Setting it to`NA`

for speedup skips the evaluation of leverages and any score. - all.knots
- if
`TRUE`

, all distinct points in`x`

are used as knots. If`FALSE`

(default), a subset of`x[]`

is used, specifically`x[j]`

where the`nknots`

indices are evenly spaced in`1:n`

, see also the next argument`nknots`

. - nknots
- integer or
`function`

giving the number of knots to use when`all.knots = FALSE`

. If a function (as by default), the number of knots is`nknots(nx)`

. By default for $n_x > 49$ this is less than $n_x$, the number of unique`x`

values, see the Note. - keep.data
- logical specifying if the input data should be kept
in the result. If
`TRUE`

(as per default), fitted values and residuals are available from the result. - df.offset
- allows the degrees of freedom to be increased by
`df.offset`

in the GCV criterion. - penalty
- the coefficient of the penalty for degrees of freedom in the GCV criterion.
- control.spar
- optional list with named components controlling the
root finding when the smoothing parameter
`spar`

is computed, i.e., missing or`NULL`

, see below.**Note**that this is partly*experimental*and may change with general spar computation improvements![object Object],[object Object],[object Object],[object Object],[object Object],[object Object] Note that

`spar`

is only searched for in the interval $[low, high]$. - tol
- a tolerance for same-ness or uniqueness of the
`x`

values. The values are binned into bins of size`tol`

and values which fall into the same bin are regarded as the same. Must be strictly positive (and finite).

##### Details

Neither `x`

nor `y`

are allowed to containing missing or
infinite values.

The `x`

vector should contain at least four distinct values.
`tol`

: values which are
regarded as the same are replaced by the first of their values and the
corresponding `y`

and `w`

are pooled accordingly.

The computational $\lambda$ used (as a function of $s=spar$) is $\lambda = r * 256^{3 s - 1}$ where $r = tr(X' W X) / tr(\Sigma)$, $\Sigma$ is the matrix given by $\Sigma_{ij} = \int B_i''(t) B_j''(t) dt$, $X$ is given by $X_{ij} = B_j(x_i)$, $W$ is the diagonal matrix of weights (scaled such that its trace is $n$, the original number of observations) and $B_k(.)$ is the $k$-th B-spline.

Note that with these definitions, $f_i = f(x_i)$, and the B-spline basis representation $f = X c$ (i.e., $c$ is the vector of spline coefficients), the penalized log likelihood is $L = (y - f)' W (y - f) + \lambda c' \Sigma c$, and hence $c$ is the solution of the (ridge regression) $(X' W X + \lambda \Sigma) c = X' W y$.

If `spar`

is missing or `NULL`

, the value of `df`

is used to
determine the degree of smoothing. If both are missing, leave-one-out
cross-validation (ordinary or `cv`

) is used to determine $\lambda$.
Note that from the above relation, `spar`

is $s = s0 + 0.0601 * \bold{\log}\lambda$,
which is intentionally *different* from the S-PLUS implementation
of `smooth.spline`

(where `spar`

is proportional to
$\lambda$). In R's ($\log \lambda$) scale, it makes more
sense to vary `spar`

linearly.

Note however that currently the results may become very unreliable
for `spar`

values smaller than about -1 or -2. The same may
happen for values larger than 2 or so. Don't think of setting
`spar`

or the controls `low`

and `high`

outside such a
safe range, unless you know what you are doing!

The `x`

. However, it is ambiguous what
leave-one-out cross-validation means with duplicated points, and the
internal code uses an approximation that involves leaving out groups
of duplicated points. `cv = TRUE`

is best avoided in that case.

##### Value

- An object of class
`"smooth.spline"`

with components x the *distinct*`x`

values in increasing order, see theDetails above.y the fitted values corresponding to `x`

.w the weights used at the unique values of `x`

.yin the y values used at the unique `y`

values.data only if `keep.data = TRUE`

: itself a`list`

with components`x`

,`y`

and`w`

of the same length. These are the original $(x_i,y_i,w_i), i = 1, \dots, n$, values where`data$x`

may have repeated values and hence be longer than the above`x`

component; see details.lev (when `cv`

was not`NA`

) leverages, the diagonal values of the smoother matrix.cv.crit cross-validation score, generalized or true, depending on`cv`

.pen.crit penalized criterion crit the criterion value minimized in the underlying `.Fortran`

routinesslvrg . When`df`

has been specified, the criterion is $3 + (tr(S_\lambda) - df)^2$, where the $3 +$ is there for numerical (and historical) reasons.df equivalent degrees of freedom used. Note that (currently) this value may become quite imprecise when the true `df`

is between and 1 and 2.spar the value of `spar`

computed or given.lambda the value of $\lambda$ corresponding to `spar`

, see the details above.iparms named integer(3) vector where `..$ipars["iter"]`

gives number of spar computing iterations used.fit list for use by `predict.smooth.spline`

, with components [object Object],[object Object],[object Object],[object Object]call the matched call.

##### Note

The number of unique `x`

values, $\code{nx} = n_x$, are
determined by the `tol`

argument, equivalently to nx <- length(x) - sum(duplicated( round((x - mean(x)) / tol) ))

The default `all.knots = FALSE`

and `nknots = .nknots.smspl`

,
entails using only $O({n_x}^{0.2})$
knots instead of $n_x$ for $n_x > 49$. This cuts
speed and memory requirements, but not drastically anymore since Rversion 1.5.1 where it is only $O(n_k) + O(n)$ where
$n_k$ is the number of knots.

In this case where not all unique `x`

values are
used as knots, the result is not a smoothing spline in the strict
sense, but very close unless a small smoothing parameter (or large
`df`

) is used.

##### source

This function is based on code in the `GAMFIT`

Fortran program by
T. Hastie and R. Tibshirani (`smooth.spline`

function of Chambers & Hastie (1992).

##### References

Chambers, J. M. and Hastie, T. J. (1992)
*Statistical Models in S*, Wadsworth & Brooks/Cole.

Green, P. J. and Silverman, B. W. (1994)
*Nonparametric Regression and Generalized Linear Models:
A Roughness Penalty Approach.* Chapman and Hall.

Hastie, T. J. and Tibshirani, R. J. (1990)
*Generalized Additive Models.* Chapman and Hall.

##### See Also

`predict.smooth.spline`

for evaluating the spline
and its derivatives.

##### Examples

`library(stats)`

```
require(graphics)
attach(cars)
plot(speed, dist, main = "data(cars) & smoothing splines")
cars.spl <- smooth.spline(speed, dist)
(cars.spl)
## This example has duplicate points, so avoid cv = TRUE
stopifnot(cars.spl $ w == table(speed)) # weights = multiplicities
utils::str(cars.spl, digits = 5, vec.len = 6)
cars.spl$fit
lines(cars.spl, col = "blue")
lines(smooth.spline(speed, dist, df = 10), lty = 2, col = "red")
legend(5,120,c(paste("default [C.V.] => df =",round(cars.spl$df,1)),
"s( * , df = 10)"), col = c("blue","red"), lty = 1:2,
bg = 'bisque')
detach()
## Residual (Tukey Anscombe) plot:
plot(residuals(cars.spl) ~ fitted(cars.spl))
abline(h = 0, col = "gray")
## consistency check:
stopifnot(all.equal(cars$dist,
fitted(cars.spl) + residuals(cars.spl)))
## Visualize the behavior of .nknots.smspl()
nKnots <- Vectorize(.nknots.smspl) ; c.. <- adjustcolor("gray20",.5)
curve(nKnots, 1, 250, n=250)
abline(0,1, lty=2, col=c..); text(90,90,"y = x", col=c.., adj=-.25)
abline(h=100,lty=2); abline(v=200, lty=2)
n <- c(1:799, seq(800, 3490, by=10), seq(3500, 10000, by = 50))
plot(n, nKnots(n), type="l", main = "Vectorize(.nknots.smspl) (n)")
abline(0,1, lty=2, col=c..); text(180,180,"y = x", col=c..)
n0 <- c(50, 200, 800, 3200); c0 <- adjustcolor("blue3", .5)
lines(n0, nKnots(n0), type="h", col=c0)
axis(1, at=n0, line=-2, col.ticks=c0, col=NA, col.axis=c0)
axis(4, at=.nknots.smspl(10000), line=-.5, col=c..,col.axis=c.., las=1)
##-- artificial example
y18 <- c(1:3, 5, 4, 7:3, 2*(2:5), rep(10, 4))
xx <- seq(1, length(y18), len = 201)
(s2 <- smooth.spline(y18)) # GCV
(s02 <- smooth.spline(y18, spar = 0.2))
(s02. <- smooth.spline(y18, spar = 0.2, cv = NA))
plot(y18, main = deparse(s2$call), col.main = 2)
lines(s2, col = "gray"); lines(predict(s2, xx), col = 2)
lines(predict(s02, xx), col = 3); mtext(deparse(s02$call), col = 3)
stopifnot(all.equal(predict(s02 , xx),
predict(s02., xx), tolerance = 1e-15))
## The following shows the problematic behavior of 'spar' searching:
(s2 <- smooth.spline(y18, control =
list(trace = TRUE, tol = 1e-6, low = -1.5)))
(s2m <- smooth.spline(y18, cv = TRUE, control =
list(trace = TRUE, tol = 1e-6, low = -1.5)))
## both above do quite similarly (Df = 8.5 +- 0.2)
```

*Documentation reproduced from package stats, version 3.3, License: Part of R 3.3*