smooth.spline: Fit a Smoothing Spline

Description

Fits a cubic smoothing spline to the supplied data.

Usage

smooth.spline(x, y = NULL, w = NULL, df, spar = NULL, cv = FALSE,
              all.knots = FALSE, nknots = .nknots.smspl,
              keep.data = TRUE, df.offset = 0, penalty = 1,
              control.spar = list(), tol = 1e-6 * IQR(x))

Arguments

a vector giving the values of the predictor variable, or a list or a two-column matrix specifying x and y.

responses. If y is missing or NULL, the responses are assumed to be specified by x, with x the index vector.

optional vector of weights of the same length as x; defaults to all 1.

the desired equivalent number of degrees of freedom (trace of the smoother matrix). Must be in $(1,n_x]$, $n_x$ the number of unique x values, see below.

spar

smoothing parameter, typically (but not necessarily) in $(0,1]$. The coefficient $\lambda$ of the integral of the squared second derivative in the fit (penalized log likelihood) criterion is a monotone function of spar, see the details below.

ordinary (TRUE) or generalized cross-validation (GCV) when FALSE; is used for smoothing parameter computation only when df is not specified; it is used however to determine cv.crit in the result. Setting it to NA for speedup skips the evaluation of leverages and any score.

all.knots

if TRUE, all distinct points in x are used as knots. If FALSE (default), a subset of x[] is used, specifically x[j] where the nknots indices are evenly spaced in 1:n, see also the next argument nknots.

nknots

integer or function giving the number of knots to use when all.knots = FALSE. If a function (as by default), the number of knots is nknots(nx). By default for $n_x > 49$ this is less than $n_x$, the number of unique x values, see the Note.

keep.data

logical specifying if the input data should be kept in the result. If TRUE (as per default), fitted values and residuals are available from the result.

df.offset

allows the degrees of freedom to be increased by df.offset in the GCV criterion.

penalty

the coefficient of the penalty for degrees of freedom in the GCV criterion.

control.spar

optional list with named components controlling the root finding when the smoothing parameter spar is computed, i.e., missing or NULL, see below.

Note that this is partly experimental and may change with general spar computation improvements!

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object] Note that spar is only searched for in the interval $[low, high]$.

tol

a tolerance for same-ness or uniqueness of the x values. The values are binned into bins of size tol and values which fall into the same bin are regarded as the same. Must be strictly positive (and finite).

Value

An object of class "smooth.spline" with components
xthe distinct x values in increasing order, see the Details above.
ythe fitted values corresponding to x.
wthe weights used at the unique values of x.
yinthe y values used at the unique y values.
dataonly if keep.data = TRUE: itself a list with components x, y and w of the same length. These are the original $(x_i,y_i,w_i), i = 1, \dots, n$, values where data$x may have repeated values and hence be longer than the above x component; see details.
lev(when cv was not NA) leverages, the diagonal values of the smoother matrix.
cv.critcross-validation score, generalized or true, depending on cv.
pen.critpenalized criterion
critthe criterion value minimized in the underlying .Fortran routine sslvrg. When df has been specified, the criterion is $3 + (tr(S_\lambda) - df)^2$, where the $3 +$ is there for numerical (and historical) reasons.
dfequivalent degrees of freedom used. Note that (currently) this value may become quite imprecise when the true df is between and 1 and 2.
sparthe value of spar computed or given.
lambdathe value of $\lambda$ corresponding to spar, see the details above.
iparmsnamed integer(3) vector where ..$ipars["iter"] gives number of spar computing iterations used.
fitlist for use by predict.smooth.spline, with components [object Object],[object Object],[object Object],[object Object]
callthe matched call.

source

This function is based on code in the GAMFIT Fortran program by T. Hastie and R. Tibshirani (http://lib.stat.cmu.edu/general/), which makes use of spline code by Finbarr O'Sullivan. Its design parallels the smooth.spline function of Chambers & Hastie (1992).

Details

Neither x nor y are allowed to containing missing or infinite values.

The x vector should contain at least four distinct values. Distinct here is controlled by tol: values which are regarded as the same are replaced by the first of their values and the corresponding y and w are pooled accordingly.

The computational $\lambda$ used (as a function of $s=spar$) is $\lambda = r * 256^{3 s - 1}$ where $r = tr(X' W X) / tr(\Sigma)$, $\Sigma$ is the matrix given by $\Sigma_{ij} = \int B_i''(t) B_j''(t) dt$, $X$ is given by $X_{ij} = B_j(x_i)$, $W$ is the diagonal matrix of weights (scaled such that its trace is $n$, the original number of observations) and $B_k(.)$ is the $k$-th B-spline.

Note that with these definitions, $f_i = f(x_i)$, and the B-spline basis representation $f = X c$ (i.e., $c$ is the vector of spline coefficients), the penalized log likelihood is $L = (y - f)' W (y - f) + \lambda c' \Sigma c$, and hence $c$ is the solution of the (ridge regression) $(X' W X + \lambda \Sigma) c = X' W y$.

If spar is missing or NULL, the value of df is used to determine the degree of smoothing. If both are missing, leave-one-out cross-validation (ordinary or generalized as determined by cv) is used to determine $\lambda$. Note that from the above relation, spar is $s = s0 + 0.0601 * \bold{\log}\lambda$, which is intentionally different from the S-PLUS implementation of smooth.spline (where spar is proportional to $\lambda$). In R's ($\log \lambda$) scale, it makes more sense to vary spar linearly.

Note however that currently the results may become very unreliable for spar values smaller than about -1 or -2. The same may happen for values larger than 2 or so. Don't think of setting spar or the controls low and high outside such a safe range, unless you know what you are doing!

The generalized cross-validation method will work correctly when there are duplicated points in x. However, it is ambiguous what leave-one-out cross-validation means with duplicated points, and the internal code uses an approximation that involves leaving out groups of duplicated points. cv = TRUE is best avoided in that case.

References

Chambers, J. M. and Hastie, T. J. (1992) Statistical Models in S, Wadsworth & Brooks/Cole.

Green, P. J. and Silverman, B. W. (1994) Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. Chapman and Hall.

Hastie, T. J. and Tibshirani, R. J. (1990) Generalized Additive Models. Chapman and Hall.

Examples

Run this code

require(graphics)

attach(cars)
plot(speed, dist, main = "data(cars)  &  smoothing splines")
cars.spl <- smooth.spline(speed, dist)
(cars.spl)
## This example has duplicate points, so avoid cv = TRUE
stopifnot(cars.spl $ w == table(speed)) # weights = multiplicities
  utils::str(cars.spl, digits = 5, vec.len = 6)
  cars.spl$fit
lines(cars.spl, col = "blue")
lines(smooth.spline(speed, dist, df = 10), lty = 2, col = "red")
legend(5,120,c(paste("default [C.V.] => df =",round(cars.spl$df,1)),
               "s( * , df = 10)"), col = c("blue","red"), lty = 1:2,
       bg = 'bisque')
detach()


## Residual (Tukey Anscombe) plot:
plot(residuals(cars.spl) ~ fitted(cars.spl))
abline(h = 0, col = "gray")

## consistency check:
stopifnot(all.equal(cars$dist,
                    fitted(cars.spl) + residuals(cars.spl)))

## Visualize the behavior of  .nknots.smspl()
nKnots <- Vectorize(.nknots.smspl) ; c.. <- adjustcolor("gray20",.5)
curve(nKnots, 1, 250, n=250)
abline(0,1, lty=2, col=c..); text(90,90,"y = x", col=c.., adj=-.25)
abline(h=100,lty=2); abline(v=200, lty=2)

n <- c(1:799, seq(800, 3490, by=10), seq(3500, 10000, by = 50))
plot(n, nKnots(n), type="l", main = "Vectorize(.nknots.smspl) (n)")
abline(0,1, lty=2, col=c..); text(180,180,"y = x", col=c..)
n0 <- c(50, 200, 800, 3200); c0 <- adjustcolor("blue3", .5)
lines(n0, nKnots(n0), type="h", col=c0)
axis(1, at=n0, line=-2, col.ticks=c0, col=NA, col.axis=c0)
axis(4, at=.nknots.smspl(10000), line=-.5, col=c..,col.axis=c.., las=1)

##-- artificial example
y18 <- c(1:3, 5, 4, 7:3, 2*(2:5), rep(10, 4))
xx  <- seq(1, length(y18), len = 201)
(s2  <- smooth.spline(y18)) # GCV
(s02  <- smooth.spline(y18, spar = 0.2))
(s02. <- smooth.spline(y18, spar = 0.2, cv = NA))
plot(y18, main = deparse(s2$call), col.main = 2)
lines(s2, col = "gray"); lines(predict(s2, xx), col = 2)
lines(predict(s02, xx), col = 3); mtext(deparse(s02$call), col = 3)

stopifnot(all.equal(predict(s02 , xx),
		    predict(s02., xx), tolerance = 1e-15))
## The following shows the problematic behavior of 'spar' searching:
(s2  <- smooth.spline(y18, control =
                      list(trace = TRUE, tol = 1e-6, low = -1.5)))
(s2m <- smooth.spline(y18, cv = TRUE, control =
                      list(trace = TRUE, tol = 1e-6, low = -1.5)))
## both above do quite similarly (Df = 8.5 +- 0.2)

Run the code above in your browser using DataLab