# areg

##### Additive Regression with Optimal Transformations on Both Sides using Canonical Variates

Expands continuous variables into restricted cubic spline bases and
categorical variables into dummy variables and fits a multivariate
equation using canonical variates. This finds optimum transformations
that maximize \(R^2\). Optionally, the bootstrap is used to estimate
the covariance matrix of both left- and right-hand-side transformation
parameters, and to estimate the bias in the \(R^2\) due to overfitting
and compute the bootstrap optimism-corrected \(R^2\).
Cross-validation can also be used to get an unbiased estimate of
\(R^2\) but this is not as precise as the bootstrap estimate. The
bootstrap and cross-validation may also used to get estimates of mean
and median absolute error in predicted values on the original `y`

scale. These two estimates are perhaps the best ones for gauging the
accuracy of a flexible model, because it is difficult to compare
\(R^2\) under different y-transformations, and because \(R^2\)
allows for an out-of-sample recalibration (i.e., it only measures
relative errors).

Note that uncertainty about the proper transformation of `y`

causes
an enormous amount of model uncertainty. When the transformation for
`y`

is estimated from the data a high variance in predicted values
on the original `y`

scale may result, especially if the true
transformation is linear. Comparing bootstrap or cross-validated mean
absolute errors with and without restricted the `y`

transform to be
linear (`ytype='l'`

) may help the analyst choose the proper model
complexity.

- Keywords
- multivariate, models, regression, smooth

##### Usage

```
areg(x, y, xtype = NULL, ytype = NULL, nk = 4,
B = 0, na.rm = TRUE, tolerance = NULL, crossval = NULL)
```# S3 method for areg
print(x, digits=4, …)

# S3 method for areg
plot(x, whichx = 1:ncol(x$x), …)

# S3 method for areg
predict(object, x, type=c('lp','fitted','x'),
what=c('all','sample'), …)

##### Arguments

- x
A single predictor or a matrix of predictors. Categorical predictors are required to be coded as integers (as

`factor`

does internally). For`predict`

,`x`

is a data matrix with the same integer codes that were originally used for categorical variables.- y
a

`factor`

, categorical, character, or numeric response variable- xtype
a vector of one-letter character codes specifying how each predictor is to be modeled, in order of columns of

`x`

. The codes are`"s"`

for smooth function (using restricted cubic splines),`"l"`

for no transformation (linear), or`"c"`

for categorical (to cause expansion into dummy variables). Default is`"s"`

if`nk > 0`

and`"l"`

if`nk=0`

.- ytype
same coding as for

`xtype`

. Default is`"s"`

for a numeric variable with more than two unique values,`"l"`

for a binary numeric variable, and`"c"`

for a factor, categorical, or character variable.- nk
number of knots, 0 for linear, or 3 or more. Default is 4 which will fit 3 parameters to continuous variables (one linear term and two nonlinear terms)

- B
number of bootstrap resamples used to estimate covariance matrices of transformation parameters. Default is no bootstrapping.

- na.rm
set to

`FALSE`

if you are sure that observations with`NA`

s have already been removed- tolerance
singularity tolerance. List source code for

`lm.fit.qr.bare`

for details.- crossval
set to a positive integer k to compute k-fold cross-validated R-squared (square of first canonical correlation) and mean and median absolute error of predictions on the original scale

- digits
number of digits to use in formatting for printing

- object
an object created by

`areg`

- whichx
integer or character vector specifying which predictors are to have their transformations plotted (default is all). The

`y`

transformation is always plotted.- type
tells

`predict`

whether to obtain predicted untransformed`y`

(`type='lp'`

, the default) or predicted`y`

on the original scale (`type='fitted'`

), or the design matrix for the right-hand side (`type='x'`

).- what
When the

`y`

-transform is non-monotonic you may specify`what='sample'`

to`predict`

to obtain a random sample of`y`

values on the original scale instead of a matrix of all`y`

-inverses. See`inverseFunction`

.- …
arguments passed to the plot function.

##### Details

`areg`

is a competitor of `ace`

in the `acepack`

package. Transformations from `ace`

are seldom smooth enough and
are often overfitted. With `areg`

the complexity can be controlled
with the `nk`

parameter, and predicted values are easy to obtain
because parametric functions are fitted.

If one side of the equation has a categorical variable with more than two categories and the other side has a continuous variable not assumed to act linearly, larger sample sizes are needed to reliably estimate transformations, as it is difficult to optimally score categorical variables to maximize \(R^2\) against a simultaneously optimally transformed continuous variable.

##### Value

a list of class `"areg"`

containing many objects

##### References

Breiman and Friedman, Journal of the American Statistical Association (September, 1985).

##### See Also

##### Examples

```
# NOT RUN {
set.seed(1)
ns <- c(30,300,3000)
for(n in ns) {
y <- sample(1:5, n, TRUE)
x <- abs(y-3) + runif(n)
par(mfrow=c(3,4))
for(k in c(0,3:5)) {
z <- areg(x, y, ytype='c', nk=k)
plot(x, z$tx)
title(paste('R2=',format(z$rsquared)))
tapply(z$ty, y, range)
a <- tapply(x,y,mean)
b <- tapply(z$ty,y,mean)
plot(a,b)
abline(lsfit(a,b))
# Should get same result to within linear transformation if reverse x and y
w <- areg(y, x, xtype='c', nk=k)
plot(z$ty, w$tx)
title(paste('R2=',format(w$rsquared)))
abline(lsfit(z$ty, w$tx))
}
}
par(mfrow=c(2,2))
# Example where one category in y differs from others but only in variance of x
n <- 50
y <- sample(1:5,n,TRUE)
x <- rnorm(n)
x[y==1] <- rnorm(sum(y==1), 0, 5)
z <- areg(x,y,xtype='l',ytype='c')
z
plot(z)
z <- areg(x,y,ytype='c')
z
plot(z)
# }
# NOT RUN {
# Examine overfitting when true transformations are linear
par(mfrow=c(4,3))
for(n in c(200,2000)) {
x <- rnorm(n); y <- rnorm(n) + x
for(nk in c(0,3,5)) {
z <- areg(x, y, nk=nk, crossval=10, B=100)
print(z)
plot(z)
title(paste('n=',n))
}
}
par(mfrow=c(1,1))
# Underfitting when true transformation is quadratic but overfitting
# when y is allowed to be transformed
set.seed(49)
n <- 200
x <- rnorm(n); y <- rnorm(n) + .5*x^2
#areg(x, y, nk=0, crossval=10, B=100)
#areg(x, y, nk=4, ytype='l', crossval=10, B=100)
z <- areg(x, y, nk=4) #, crossval=10, B=100)
z
# Plot x vs. predicted value on original scale. Since y-transform is
# not monotonic, there are multiple y-inverses
xx <- seq(-3.5,3.5,length=1000)
yhat <- predict(z, xx, type='fitted')
plot(x, y, xlim=c(-3.5,3.5))
for(j in 1:ncol(yhat)) lines(xx, yhat[,j], col=j)
# Plot a random sample of possible y inverses
yhats <- predict(z, xx, type='fitted', what='sample')
points(xx, yhats, pch=2)
# }
# NOT RUN {
# True transformation of x1 is quadratic, y is linear
n <- 200
x1 <- rnorm(n); x2 <- rnorm(n); y <- rnorm(n) + x1^2
z <- areg(cbind(x1,x2),y,xtype=c('s','l'),nk=3)
par(mfrow=c(2,2))
plot(z)
# y transformation is inverse quadratic but areg gets the same answer by
# making x1 quadratic
n <- 5000
x1 <- rnorm(n); x2 <- rnorm(n); y <- (x1 + rnorm(n))^2
z <- areg(cbind(x1,x2),y,nk=5)
par(mfrow=c(2,2))
plot(z)
# Overfit 20 predictors when no true relationships exist
n <- 1000
x <- matrix(runif(n*20),n,20)
y <- rnorm(n)
z <- areg(x, y, nk=5) # add crossval=4 to expose the problem
# Test predict function
n <- 50
x <- rnorm(n)
y <- rnorm(n) + x
g <- sample(1:3, n, TRUE)
z <- areg(cbind(x,g),y,xtype=c('s','c'))
range(predict(z, cbind(x,g)) - z$linear.predictors)
# }
```

*Documentation reproduced from package Hmisc, version 4.3-1, License: GPL (>= 2)*