crs (version 0.15-33)

# crs: Categorical Regression Splines

## Description

`crs` computes a regression spline estimate of a one (1) dimensional dependent variable on an `r`-dimensional vector of continuous and categorical (`factor`/`ordered`) predictors (Ma and Racine (2013), Ma, Racine and Yang (2015)).

## Usage

```crs(…)
# S3 method for default
crs(xz,
y,
degree = NULL,
segments = NULL,
include = NULL,
kernel = TRUE,
lambda = NULL,
complexity = c("degree-knots","degree","knots"),
knots = c("quantiles","uniform","auto"),
basis = c("auto","additive","tensor","glp"),
deriv = 0,
data.return = FALSE,
prune = FALSE,
model.return = FALSE,
tau = NULL,
weights = NULL,
…)# S3 method for formula
crs(formula,
data = list(),
degree = NULL,
segments = NULL,
include = NULL,
degree.max = 10,
segments.max = 10,
degree.min = 0,
segments.min = 1,
cv.df.min = 1,
cv = c("nomad","exhaustive","none"),
cv.threshold = 1000,
cv.func = c("cv.ls","cv.gcv","cv.aic"),
kernel = TRUE,
lambda = NULL,
lambda.discrete = FALSE,
lambda.discrete.num = 100,
complexity = c("degree-knots","degree","knots"),
knots = c("quantiles","uniform","auto"),
basis = c("auto","additive","tensor","glp"),
deriv = 0,
data.return = FALSE,
prune = FALSE,
model.return = FALSE,
restarts = 0,
random.seed = 42,
max.bb.eval = 10000,
initial.mesh.size.real = "r1.0e-01",
initial.mesh.size.integer = "1",
min.mesh.size.real = paste(sqrt(.Machine\$double.eps)),
min.mesh.size.integer = 1,
min.poll.size.real = 1,
min.poll.size.integer = 1,
opts=list(),
nmulti = 5,
tau = NULL,
weights = NULL,
singular.ok = FALSE,
…)```

## Arguments

xz

numeric (`x`) and or nominal/ordinal (`factor`/`ordered`) predictors (`z`)

y

a numeric vector of responses.

degree

integer/vector specifying the polynomial degree of the B-spline basis for each dimension of the continuous `x` (default `degree=3`, i.e. cubic spline)

segments

integer/vector specifying the number of segments of the B-spline basis for each dimension of the continuous `x` (i.e. number of knots minus one) (default `segments=1`, i.e. Bezier curve)

include

integer/vector specifying whether each of the nominal/ordinal (`factor`/`ordered`) predictors in `x` are included or omitted from the resulting estimate

lambda

a vector of bandwidths for each dimension of the categorical `z`

lambda.discrete

if `lambda.discrete=TRUE`, the bandwidth will be discretized into `lambda.discrete.num+1` points and `lambda` will be chosen from these points

lambda.discrete.num

a positive integer indicating the number of discrete values that lambda can assume - this parameter will only be used when `lambda.discrete=TRUE`

formula

a symbolic description of the model to be fit

data

an optional data frame containing the variables in the model

cv

a character string (default `cv="nomad"`) indicating whether to use nonsmooth mesh adaptive direct search, exhaustive search, or no search (i.e. use user supplied values for `degree`, `segments`, and `lambda`)

cv.threshold

an integer (default `cv.threshold=1000`) for simple problems with no categorical predictors (i.e. `kernel=FALSE` otherwise `optim`/`snomadr` search is necessary) such that, if the number of combinations of `degree`/`segments` is less than the threshold and `cv="nomad"`, instead use exhaustive search (`cv="exhaustive"`)

cv.func

a character string (default `cv.func="cv.ls"`) indicating which method to use to select smoothing parameters. `cv.gcv` specifies generalized cross-validation (Craven and Wahba (1979)), `cv.aic` specifies expected Kullback-Leibler cross-validation (Hurvich, Simonoff, and Tsai (1998)), and `cv.ls` specifies least-squares cross-validation

kernel

a logical value (default `kernel=TRUE`) indicating whether to use kernel smoothing or not

degree.max

the maximum degree of the B-spline basis for each of the continuous predictors (default `degree.max=10`)

segments.max

the maximum segments of the B-spline basis for each of the continuous predictors (default `segments.max=10`)

degree.min

the minimum degree of the B-spline basis for each of the continuous predictors (default `degree.min=0`)

segments.min

the minimum segments of the B-spline basis for each of the continuous predictors (default `segments.min=1`)

cv.df.min

the minimum degrees of freedom to allow when conducting NOMAD-based cross-validation (default `cv.df.min=1`)

complexity

a character string (default `complexity="degree-knots"`) indicating whether model ‘complexity’ is determined by the degree of the spline or by the number of segments (i.e. number of knots minus one). This option allows the user to use cross-validation to select either the spline degree (number of knots held fixed) or the number of knots (spline degree held fixed) or both the spline degree and number of knots

For the continuous predictors the regression spline model employs either the additive or tensor product B-spline basis matrix for a multivariate polynomial spline via the B-spline routines in the GNU Scientific Library (https://www.gnu.org/software/gsl/) and the `tensor.prod.model.matrix` function

knots

a character string (default `knots="quantiles"`) specifying where knots are to be placed. ‘quantiles’ specifies knots placed at equally spaced quantiles (equal number of observations lie in each segment) and ‘uniform’ specifies knots placed at equally spaced intervals. If `knots="auto"`, the knot type will be automatically determined by cross-validation

basis

a character string (default `basis="auto"`) indicating whether the additive or tensor product B-spline basis matrix for a multivariate polynomial spline or generalized B-spline polynomial basis should be used. Note this can be automatically determined by cross-validation if `cv="nomad"` or `cv="exhaustive"` and `basis="auto"`, and is an ‘all or none’ proposition (i.e. interaction terms for all predictors or for no predictors given the nature of ‘tensor products’). Note also that if there is only one predictor this defaults to `basis="additive"` to avoid unnecessary computation as the spline bases are equivalent in this case

deriv

an integer `l` (default `deriv=0`) specifying whether to compute the univariate `l`th partial derivative for each continuous predictor (and difference in levels for each categorical predictor) or not and if so what order. Note that if `deriv` is higher than the spline degree of the associated continuous predictor then the derivative will be zero and a warning issued to this effect

data.return

a logical value indicating whether to return `x,z,y` or not (default `data.return=FALSE`)

prune

a logical value (default `prune=FALSE`) specifying whether the (final) model is to be ‘pruned’ using a stepwise cross-validation criterion based upon a modified version of `stepAIC` (see below for details)

model.return

a logical value indicating whether to return the list of `lm` models or not when `kernel=TRUE` (default `model.return=FALSE`)

restarts

integer specifying the number of times to restart the process of finding extrema of the cross-validation function (for the bandwidths only) from different (random) initial points

random.seed

when it is not missing and not equal to 0, the initial points will be generated using this seed when using `frscvNOMAD` or `krscvNOMAD` and `nmulti > 0`

max.bb.eval

argument passed to the NOMAD solver (see `snomadr` for further details)

initial.mesh.size.real

argument passed to the NOMAD solver (see `snomadr` for further details)

initial.mesh.size.integer

argument passed to the NOMAD solver (see `snomadr` for further details)

min.mesh.size.real

argument passed to the NOMAD solver (see `snomadr` for further details)

min.mesh.size.integer

arguments passed to the NOMAD solver (see `snomadr` for further details)

min.poll.size.real

arguments passed to the NOMAD solver (see `snomadr` for further details)

min.poll.size.integer

arguments passed to the NOMAD solver (see `snomadr` for further details)

opts

list of optional arguments to be passed to `snomadr`

nmulti

integer number of times to restart the process of finding extrema of the cross-validation function from different (random) initial points (default `nmulti=5`)

tau

if non-null a number in (0,1) denoting the quantile for which a quantile regression spline is to be estimated rather than estimating the conditional mean (default `tau=NULL`). Criterion function set by `cv.func=` are modified accordingly to admit quantile regression.

weights

an optional vector of weights to be used in the fitting process. Should be ‘NULL’ or a numeric vector. If non-NULL, weighted least squares is used with weights ‘weights’ (that is, minimizing ‘sum(w*e^2)’); otherwise ordinary least squares is used.

singular.ok

a logical value (default `singular.ok=FALSE`) that, when `FALSE`, discards singular bases during cross-validation (a check for ill-conditioned bases is performed).

optional arguments

## Value

`crs` returns a `crs` object. The generic functions `fitted` and `residuals` extract (or generate) estimated values and residuals. Furthermore, the functions `summary`, `predict`, and `plot` (options `mean=FALSE`, `deriv=i` where \(i\) is an integer, `ci=FALSE`, `persp.rgl=FALSE`, `plot.behavior=c("plot","plot-data","data")`, `xtrim=0.0`,`xq=0.5`) support objects of this type. The returned object has the following components:

fitted.values

estimates of the regression function (conditional mean) at the sample points or evaluation points

lwr,upr

lower/upper bound for a 95% confidence interval for the `fitted.values` (conditional mean) obtained from `predict.lm` via the argument `interval="confidence"`

residuals

residuals computed at the sample points or evaluation points

degree

integer/vector specifying the degree of the B-spline basis for each dimension of the continuous `x`

segments

integer/vector specifying the number of segments of the B-spline basis for each dimension of the continuous `x`

include

integer/vector specifying whether each of the nominal/ordinal (`factor`/`ordered`) predictors `z` are included or omitted from the resulting estimate if `kernel=FALSE` (see below)

kernel

a logical value indicating whether kernel smoothing was used (`kernel=TRUE`) or not

lambda

vector of bandwidths used if `kernel=TRUE`

call

a symbolic description of the model

r.squared

coefficient of determination (Doksum and Samarov (1995))

model.lm

an object of ‘`class`’ ‘`lm`’ if `kernel=FALSE` or a list of objects of ‘`class`’ ‘`lm`’ if `kernel=TRUE` (accessed by `model.lm[]`, `model.lm[]`,…,. By way of example, if `foo` is a `crs` object and `kernel=FALSE`, then `foo\$model.lm` is an object of ‘`class`’ ‘`lm`’, while objects of ‘`class`’ ‘`lm`’ return the `model.frame` in `model.lm\$model` which can be accessed via `foo\$model.lm\$model` where `foo` is the `crs` object (the model frame `foo\$model.lm\$model` contains the B-spline bases underlying the estimate which might be of interest). Again by way of example, when `kernel=TRUE` then `foo\$model.lm[]\$model` contains the model frame for the first unique combination of categorical predictors, `foo\$model.lm[]\$model` the second and so forth (the weights will potentially differ for each model depending on the value(s) of `lambda`)

deriv.mat

a matrix of derivatives (or differences in levels for the categorical `z`) whose order is determined by `deriv=` in the `crs` call

deriv.mat.lwr

a matrix of 95% coverage lower bounds for `deriv.mat`

deriv.mat.upr

a matrix of 95% coverage upper bounds for `deriv.mat`

hatvalues

the `hatvalues` for the estimated model

P.hat

the kernel probability estimates corresponding to the categorical predictors in the estimated model

## Usage Issues

Note that when `kernel=FALSE` `summary` supports the option `sigtest=TRUE` that conducts an F-test for significance for each predictor.

## Details

Typical usages are (see below for a list of options and also the examples at the end of this help file)

```    ## Estimate the model and let the basis type be determined by
## cross-validation (i.e. cross-validation will determine whether to
## use the additive, generalized, or tensor product basis)model <- crs(y~x1+x2)## Estimate the model for a specified degree/segment/bandwidth
## combination and do not run cross-validation (will use the
## additive basis by default)model <- crs(y~x1+factor(x2),cv="none",degree=3,segments=1,lambda=.1)## Plot the mean and (asymptotic) error boundsplot(model,mean=TRUE,ci=TRUE)## Plot the first partial derivative and (asymptotic) error boundsplot(model,deriv=1,ci=TRUE)
```

`crs` computes a regression spline estimate of a one (1) dimensional dependent variable on an `r`-dimensional vector of continuous and categorical (`factor`/`ordered`) predictors.

The regression spline model employs the tensor product B-spline basis matrix for a multivariate polynomial spline via the B-spline routines in the GNU Scientific Library (https://www.gnu.org/software/gsl/) and the `tensor.prod.model.matrix` function.

When `basis="additive"` the model becomes additive in nature (i.e. no interaction/tensor terms thus semiparametric not fully nonparametric).

When `basis="tensor"` the model uses the multivariate tensor product basis.

When `kernel=FALSE` the model uses indicator basis functions for the nominal/ordinal (`factor`/`ordered`) predictors rather than kernel weighting.

When `kernel=TRUE` the product kernel function for the discrete predictors is of the ‘Li-Racine’ type (see Li and Racine (2007) for details).

When `cv="nomad"`, numerical search is undertaken using Nonsmooth Optimization by Mesh Adaptive Direct Search (Abramson, Audet, Couture, Dennis, Jr., and Le Digabel (2011)).

When `kernel=TRUE` and `cv="exhaustive"`, numerical search is undertaken using `optim` and the box-constrained `L-BFGS-B` method (see `optim` for details). The user may restart the algorithm as many times as desired via the `restarts` argument (default `restarts=0`). The approach ascends from `degree=0` (or `segments=0`) through `degree.max` and for each value of `degree` (or `segments`) searches for the optimal bandwidths. After the most complex model has been searched then the optimal `degree`/`segments`/`lambda` combination is selected. If any element of the optimal `degree` (or `segments`) vector coincides with `degree.max` (or `segments.max`) a warning is produced and the user ought to restart their search with a larger value of `degree.max` (or `segments.max`).

Note that the default `plot` method for a `crs` object provides some diagnostic measures, in particular, a) residuals versus fitted values (useful for checking the assumption that `E(u|x)=0`), b) a normal quantile-quantile plot which allows residuals to be assessed for normality (`qqnorm`), c) a scale-location plot that is useful for checking the assumption that the errors are iid and, in particular, that the variance is homogeneous, and d) ‘Cook's distance’ which computes the single-case influence function. See below for other arguments for the plot function for a `crs` object.

Note that setting `prune=TRUE` produces a final ‘pruning’ of the model via a stepwise cross-validation criterion achieved by modifying `stepAIC` and replacing `extractAIC` with `extractCV` throughout the function. This option may be enabled to remove potentially superfluous bases thereby improving the finite-sample efficiency of the resulting model. Note that if the cross-validation score for the pruned model is no better than that for the original model then the original model is returned with a warning to this effect. Note also that this option can only be used when `kernel=FALSE`.

## References

Abramson, M.A. and C. Audet and G. Couture and J.E. Dennis Jr. and and S. Le Digabel (2011), “The NOMAD project”. Software available at https://www.gerad.ca/nomad.

Craven, P. and G. Wahba (1979), “Smoothing Noisy Data With Spline Functions,” Numerische Mathematik, 13, 377-403.

Doksum, K. and A. Samarov (1995), “Nonparametric Estimation of Global Functionals and a Measure of the Explanatory Power of Covariates in Regression,” The Annals of Statistics, 23 1443-1473.

Hurvich, C.M. and J.S. Simonoff and C.L. Tsai (1998), “Smoothing Parameter Selection in Nonparametric Regression Using an Improved Akaike Information Criterion,” Journal of the Royal Statistical Society B, 60, 271-293.

Le Digabel, S. (2011), “Algorithm 909: NOMAD: Nonlinear Optimization With The MADS Algorithm”. ACM Transactions on Mathematical Software, 37(4):44:1-44:15.

Li, Q. and J.S. Racine (2007), Nonparametric Econometrics: Theory and Practice, Princeton University Press.

Ma, S. and J.S. Racine and L. Yang (2015), “Spline Regression in the Presence of Categorical Predictors,” Journal of Applied Econometrics, Volume 30, 705-717.

Ma, S. and J.S. Racine (2013), “Additive Regression Splines with Irrelevant Categorical and Continuous Regressors,” Statistica Sinica, Volume 23, 515-541.

Racine, J.S. (2011), “Cross-Validated Quantile Regression Splines,” manuscript.

## See Also

`smooth.spline`, `loess`, `npreg`

## Examples

```# NOT RUN {
set.seed(42)
## Example - simulated data
n <- 1000
num.eval <- 50
x1 <- runif(n)
x2 <- runif(n)
z <- rbinom(n,1,.5)
dgp <- cos(2*pi*x1)+sin(2*pi*x2)+z
z <- factor(z)
y <- dgp + rnorm(n,sd=.5)

## Estimate a model with specified degree, segments, and bandwidth
model <- crs(y~x1+x2+z,degree=c(5,5),
segments=c(1,1),
lambda=0.1,
cv="none",
kernel=TRUE)
summary(model)

## Perspective plot
x1.seq <- seq(min(x1),max(x1),length=num.eval)
x2.seq <- seq(min(x2),max(x2),length=num.eval)
x.grid <- expand.grid(x1.seq,x2.seq)
newdata <- data.frame(x1=x.grid[,1],x2=x.grid[,2],
z=factor(rep(0,num.eval**2),levels=c(0,1)))
z0 <- matrix(predict(model,newdata=newdata),num.eval,num.eval)
newdata <- data.frame(x1=x.grid[,1],x2=x.grid[,2],
z=factor(rep(1,num.eval),levels=c(0,1)))
z1 <- matrix(predict(model,newdata=newdata),num.eval,num.eval)
zlim=c(min(z0,z1),max(z0,z1))
persp(x=x1.seq,y=x2.seq,z=z0,
xlab="x1",ylab="x2",zlab="y",zlim=zlim,
ticktype="detailed",
border="red",
theta=45,phi=45)
par(new=TRUE)
persp(x=x1.seq,y=x2.seq,z=z1,
xlab="x1",ylab="x2",zlab="y",zlim=zlim,
theta=45,phi=45,
ticktype="detailed",
border="blue")

## Partial regression surface plot
plot(model,mean=TRUE,ci=TRUE)
# }
# NOT RUN {
## A plot example where we extract the partial surfaces, confidence
## intervals etc. automatically generated by plot(mean=TRUE,...) but do
## not plot, rather save for separate use.
pdat <- plot(model,mean=TRUE,ci=TRUE,plot.behavior="data")

## Column 1 is the (evaluation) predictor ([,1]), 2-4 ([,-1]) the mean,
## lwr, and upr (note the returned value is a 'list' hence pdat[] is
## data for the first predictor, pdat[] the second etc). Note that
## matplot() can plot this nicely.
matplot(pdat[][,1],pdat[][,-1],
xlab=names(pdat[]),ylab=names(pdat[]),
lty=c(1,2,2),col=c(1,2,2),type="l")
# }
```