Data2fd: Create a functional data object from data

Description

This function converts an array y of function values plus an array argvals of argument values into a functional data object. This function tries to do as much for the user as possible in setting up a call to function smooth.basis. Be warned that the result may not be a satisfactory smooth of the data, and consequently that it may be necessary to use function smooth.basis instead, the help file for which provides a great deal more information than is provided here. Also, function Data2fd can swap the first two arguments, argvals and y if it appears that they have been included in reverse order. A warning message is returned if this swap takes place. Any such automatic decision, though, has the possibility of being wrong, and the results should be carefully checked. Preferably, the order of the arguments should be respected: argvals comes first and y comes second.

Usage

Data2fd(argvals=NULL, y=NULL, basisobj=NULL, nderiv=NULL,
        lambda=3e-8/diff(as.numeric(range(argvals))),
        fdnames=NULL, covariates=NULL, method="chol",
        dfscale=1)

Arguments

argvals

a set of argument values. If this is a vector, the same set of argument values is used for all columns of y. If argvals is a matrix, the columns correspond to the columns of y, and contain the argument values for that replicate or case.

Dimensions for argvals must match the first dimensions of y, though y can have more dimensions. For example, if dim(y) = c(9, 5, 2), argvals can be a vector of length 9 or a matrix of dimensions c(9, 5) or an array of dimensions c(9, 5, 2).

an array containing sampled values of curves.

If y is a vector, only one replicate and variable are assumed. If y is a matrix, rows must correspond to argument values and columns to replications or cases, and it will be assumed that there is only one variable per observation. If y is a three-dimensional array, the first dimension (rows) corresponds to argument values, the second (columns) to replications, and the third (layers) to variables within replications. Missing values are permitted, and the number of values may vary from one replication to another. If this is the case, the number of rows must equal the maximum number of argument values, and columns of y having fewer values must be padded out with NA's.

basisobj

One of the following:

basisfd a functional basis object (class basisfd).
fd a functional data object (class fd), from which its basis component is extracted.
fdPar a functional parameter object (class fdPar), from which its basis component is extracted.
integer an integer giving the order of a B-spline basis, create.bspline.basis(argvals, norder=basisobj)
numeric vector specifying the knots for a B-spline basis, create.bspline.basis(basisobj)
NULL Defaults to create.bspline.basis(argvals).

nderiv

Smoothing typically specified as an integer order for the derivative whose square is integrated and weighted by lambda to smooth. By default, if basisobj[['type']] == 'bspline', the smoothing operator is int2Lfd(max(0, norder-2)).

A general linear differential operator can also be supplied.

lambda

weight on the smoothing operator specified by nderiv.

fdnames

Either a character vector of length 3 or a named list of length 3. In either case, the three elements correspond to the following:

argname name of the argument, e.g. "time" or "age".
repname a description of the cases, e.g. "reps" or "weather stations"
value the name of the observed function value, e.g. "temperature"

If fdnames is a list, the components provide labels for the levels of the corresponding dimension of y.

covariates

the observed values in y are assumed to be primarily determined by the height of the curve being estimated. However, from time to time certain values can also be influenced by other known variables. For example, multi-year sets of climate variables may be also determined by the presence of absence of an El Nino event, or a volcanic eruption. One or more of these covariates can be supplied as an n by p matrix, where p is the number of such covariates. When such covariates are available, the smoothing is called "semi-parametric." Matrices or arrays of regression coefficients are then estimated that define the impacts of each of these covariates for each curve and each variable.

method

by default the function uses the usual textbook equations for computing the coefficients of the basis function expansions. But, as in regression analysis, a price is paid in terms of rounding error for such computations since they involved cross-products of basis function values. Optionally, if method is set equal to the string "qr", the computation uses an algorithm based on the qr-decomposition which is more accurate, but will require substantially more computing time when n is large, meaning more than 500 or so. The default is "chol", referring the Choleski decomposition of a symmetric positive definite matrix.

dfscale

the generalized cross-validation or "gcv" criterion that is often used to determine the size of the smoothing parameter involves the subtraction of an measure of degrees of freedom from n. Chong Gu has argued that multiplying this degrees of freedom measure by a constant slightly greater than 1, such as 1.2, can produce better decisions about the level of smoothing to be used. The default value is, however, 1.0.

Value

an object of the fd class containing:

coefs the coefficient array
basis a basis object
fdnames a list containing names for the arguments, function values and variables

Details

This function tends to be used in rather simple applications where there is no need to control the roughness of the resulting curve with any great finesse. The roughness is essentially controlled by how many basis functions are used. In more sophisticated applications, it would be better to use the function smooth.basisPar.

References

Ramsay, James O., and Silverman, Bernard W. (2006), Functional Data Analysis, 2nd ed., Springer, New York.

Ramsay, James O., and Silverman, Bernard W. (2002), Applied Functional Data Analysis, Springer, New York.

Examples

Run this code

# NOT RUN {
##
## Simplest possible example:  constant function
##
# 1 basis, order 1 = degree 0 = constant function
b1.1 <- create.bspline.basis(nbasis=1, norder=1)
# data values: 1 and 2, with a mean of 1.5
y12 <- 1:2
# smooth data, giving a constant function with value 1.5
fd1.1 <- Data2fd(y12, basisobj=b1.1)
plot(fd1.1)
# now repeat the analysis with some smoothing, which moves the
# toward 0.
fd1.1.5 <- Data2fd(y12, basisobj=b1.1, lambda=0.5)
#  values of the smooth:
# fd1.1.5 = sum(y12)/(n+lambda*integral(over arg=0 to 1 of 1))
#         = 3 / (2+0.5) = 1.2
eval.fd(seq(0, 1, .2), fd1.1.5)
##
## step function smoothing
##
# 2 step basis functions: order 1 = degree 0 = step functions
b1.2 <- create.bspline.basis(nbasis=2, norder=1)
#  fit the data without smoothing
fd1.2 <- Data2fd(1:2, basisobj=b1.2)
# plot the result:  A step function:  1 to 0.5, then 2
op <- par(mfrow=c(2,1))
plot(b1.2, main='bases')
plot(fd1.2, main='fit')
par(op)
##
## Simple oversmoothing
##
# 3 step basis functions: order 1 = degree 0 = step functions
b1.3 <- create.bspline.basis(nbasis=3, norder=1)
#  smooth the data with smoothing
fd1.3.5 <- Data2fd(y12, basisobj=b1.3, lambda=0.5)
#  plot the fit along with the points
plot(0:1, c(0, 2), type='n')
points(0:1, y12)
lines(fd1.3.5)
# Fit = penalized least squares with penalty =
#          = lambda * integral(0:1 of basis^2),
#            which shrinks the points towards 0.
# X1.3 = matrix(c(1,0, 0,0, 0,1), 2)
# XtX = crossprod(X1.3) = diag(c(1, 0, 1))
# penmat = diag(3)/3
#        = 3x3 matrix of integral(over arg=0:1 of basis[i]*basis[j])
# Xt.y = crossprod(X1.3, y12) = c(1, 0, 2)
# XtX + lambda*penmat = diag(c(7, 1, 7)/6
# so coef(fd1.3.5) = solve(XtX + lambda*penmat, Xt.y)
#                  = c(6/7, 0, 12/7)
##
## linear spline fit
##
# 3 bases, order 2 = degree 1
b2.3 <- create.bspline.basis(norder=2, breaks=c(0, .5, 1))
# interpolate the values 0, 2, 1
fd2.3 <- Data2fd(c(0,2,1), basisobj=b2.3, lambda=0)
#  display the coefficients
round(fd2.3$coefs, 4)
# plot the results
op <- par(mfrow=c(2,1))
plot(b2.3, main='bases')
plot(fd2.3, main='fit')
par(op)
# apply some smoothing
fd2.3. <- Data2fd(c(0,2,1), basisobj=b2.3, lambda=1)
op <- par(mfrow=c(2,1))
plot(b2.3, main='bases')
plot(fd2.3., main='fit', ylim=c(0,2))
par(op)
all.equal(
 unclass(fd2.3)[-1], 
 unclass(fd2.3.)[-1])
##** CONCLUSION:  
##** The only differences between fd2.3 and fd2.3.
##** are the coefficients, as we would expect.  

##
## quadratic spline fit
##
# 4 bases, order 3 = degree 2 = continuous, bounded, locally quadratic
b3.4 <- create.bspline.basis(norder=3, breaks=c(0, .5, 1))
# fit values c(0,4,2,3) without interpolation
fd3.4 <- Data2fd(c(0,4,2,3), basisobj=b3.4, lambda=0)
round(fd3.4$coefs, 4)
op <- par(mfrow=c(2,1))
plot(b3.4)
plot(fd3.4)
points(c(0,1/3,2/3,1), c(0,4,2,3))
par(op)
#  try smoothing
fd3.4. <- Data2fd(c(0,4,2,3), basisobj=b3.4, lambda=1)
round(fd3.4.$coef, 4)
op <- par(mfrow=c(2,1))
plot(b3.4)
plot(fd3.4., ylim=c(0,4))
points(seq(0,1,len=4), c(0,4,2,3))
par(op)
##
##  Two simple Fourier examples
##
gaitbasis3 <- create.fourier.basis(nbasis=5)
gaitfd3    <- Data2fd(gait, basisobj=gaitbasis3)
# plotfit.fd(gait, seq(0,1,len=20), gaitfd3)
#    set up the fourier basis
daybasis <- create.fourier.basis(c(0, 365), nbasis=65)
#  Make temperature fd object
#  Temperature data are in 12 by 365 matrix tempav
#    See analyses of weather data.
tempfd <- Data2fd(CanadianWeather$dailyAv[,,"Temperature.C"],
                  day.5, daybasis)
#  plot the temperature curves
par(mfrow=c(1,1))
plot(tempfd)
##
## argvals of class Date and POSIXct
##
#  These classes of time can generate very large numbers when converted to 
#  numeric vectors.  For basis systems such as polynomials or splines,
#  severe rounding error issues can arise if the time interval for the 
#  data is very large.  To offset this, it is best to normalize the
#  numeric version of the data before analyzing them.
#  Date class time unit is one day, divide by 365.25.
invasion1 <- as.Date('1775-09-04')
invasion2 <- as.Date('1812-07-12')
earlyUS.Canada <- as.numeric(c(invasion1, invasion2))/365.25
BspInvasion <- create.bspline.basis(earlyUS.Canada)
earlyYears  <- seq(invasion1, invasion2, length.out=7)
earlyQuad   <- (as.numeric(earlyYears-invasion1)/365.25)^2
earlyYears  <- as.numeric(earlyYears)/365.25
fitQuad <- Data2fd(earlyYears, earlyQuad, BspInvasion)
# POSIXct: time unit is one second, divide by 365.25*24*60*60
rescale     <- 365.25*24*60*60
AmRev.ct    <- as.POSIXct1970(c('1776-07-04', '1789-04-30'))
BspRev.ct   <- create.bspline.basis(as.numeric(AmRev.ct)/rescale)
AmRevYrs.ct <- seq(AmRev.ct[1], AmRev.ct[2], length.out=14)
AmRevLin.ct <- as.numeric(AmRevYrs.ct-AmRev.ct[1])
AmRevYrs.ct <- as.numeric(AmRevYrs.ct)/rescale
AmRevLin.ct <- as.numeric(AmRevLin.ct)/rescale
fitLin.ct   <- Data2fd(AmRevYrs.ct, AmRevLin.ct, BspRev.ct)
# }

Run the code above in your browser using DataLab