# transcan

##### Transformations/Imputations using Canonical Variates

`transcan`

is a nonlinear additive transformation and imputation
function, and there are several functions for using and operating on
its results. `transcan`

automatically transforms continuous and
categorical variables to have maximum correlation with the best linear
combination of the other variables. There is also an option to use a
substitute criterion - maximum correlation with the first principal
component of the other variables. Continuous variables are expanded
as restricted cubic splines and categorical variables are expanded as
contrasts (e.g., dummy variables). By default, the first canonical
variate is used to find optimum linear combinations of component
columns. This function is similar to `ace`

except that
transformations for continuous variables are fitted using restricted
cubic splines, monotonicity restrictions are not allowed, and
`NA`

s are allowed. When a variable has any `NA`

s,
transformed scores for that variable are imputed using least squares
multiple regression incorporating optimum transformations, or
`NA`

s are optionally set to constants. Shrinkage can be used to
safeguard against overfitting when imputing. Optionally, imputed
values on the original scale are also computed and returned. For this
purpose, recursive partitioning or multinomial logistic models can
optionally be used to impute categorical variables, using what is
predicted to be the most probable category.

By default, `transcan`

imputes `NA`

s with “best
guess” expected values of transformed variables, back transformed to
the original scale. Values thus imputed are most like conditional
medians assuming the transformations make variables' distributions
symmetric (imputed values are similar to conditionl modes for
categorical variables). By instead specifying `n.impute`

,
`transcan`

does approximate multiple imputation from the
distribution of each variable conditional on all other variables.
This is done by sampling `n.impute`

residuals from the
transformed variable, with replacement (a la bootstrapping), or by
default, using Rubin's approximate Bayesian bootstrap, where a sample
of size `n` with replacement is selected from the residuals on
`n` non-missing values of the target variable, and then a sample
of size `m` with replacement is chosen from this sample, where
`m` is the number of missing values needing imputation for the
current multiple imputation repetition. Neither of these bootstrap
procedures assume normality or even symmetry of residuals. For
sometimes-missing categorical variables, optimal scores are computed
by adding the “best guess” predicted mean score to random
residuals off this score. Then categories having scores closest to
these predicted scores are taken as the random multiple imputations
(`impcat = "rpart"`

is not currently allowed
with `n.impute`

). The literature recommends using ```
n.impute
= 5
```

or greater. `transcan`

provides only an approximation to
multiple imputation, especially since it “freezes” the
imputation model before drawing the multiple imputations rather than
using different estimates of regression coefficients for each
imputation. For multiple imputation, the `aregImpute`

function
provides a much better approximation to the full Bayesian approach
while still not requiring linearity assumptions.

When you specify `n.impute`

to `transcan`

you can use
`fit.mult.impute`

to re-fit any model `n.impute`

times based
on `n.impute`

completed datasets (if there are any sometimes
missing variables not specified to `transcan`

, some observations
will still be dropped from these fits). After fitting `n.impute`

models, `fit.mult.impute`

will return the fit object from the
last imputation, with `coefficients`

replaced by the average of
the `n.impute`

coefficient vectors and with a component
`var`

equal to the imputation-corrected variance-covariance
matrix. `fit.mult.impute`

can also use the object created by the
`mice`

function in the mice library to draw the
multiple imputations, as well as objects created by
`aregImpute`

. The following components of fit objects are
also replaced with averages over the `n.impute`

model fits:
`linear.predictors`

, `fitted.values`

, `stats`

,
`means`

, `icoef`

, `scale`

, `center`

,
`y.imputed`

.

The `summary`

method for `transcan`

prints the function
call, \(R^2\) achieved in transforming each variable, and for each
variable the coefficients of all other transformed variables that are
used to estimate the transformation of the initial variable. If
`imputed=TRUE`

was used in the call to transcan, also uses the
`describe`

function to print a summary of imputed values. If
`long = TRUE`

, also prints all imputed values with observation
identifiers. There is also a simple function `print.transcan`

which merely prints the transformation matrix and the function call.
It has an optional argument `long`

, which if set to `TRUE`

causes detailed parameters to be printed. Instead of plotting while
`transcan`

is running, you can plot the final transformations
after the fact using `plot.transcan`

or `ggplot.transcan`

,
if the option `trantab = TRUE`

was specified to `transcan`

.
If in addition the option
`imputed = TRUE`

was specified to `transcan`

,
`plot`

and `ggplot`

will show the location of imputed values
(including multiples) along the axes. For `ggplot`

, imputed
values are shown as red plus signs.

`impute`

method for `transcan`

does imputations for a
selected original data variable, on the original scale (if
`imputed=TRUE`

was given to `transcan`

). If you do not
specify a variable to `impute`

, it will do imputations for all
variables given to `transcan`

which had at least one missing
value. This assumes that the original variables are accessible (i.e.,
they have been attached) and that you want the imputed variables to
have the same names are the original variables. If `n.impute`

was
specified to `transcan`

you must tell `impute`

which
`imputation`

to use. Results are stored in `.GlobalEnv`

when `list.out`

is not specified (it is recommended to use
`list.out=TRUE`

).

The `predict`

method for `transcan`

computes
predicted variables and imputed values from a matrix of new data.
This matrix should have the same column variables as the original
matrix used with `transcan`

, and in the same order (unless a
formula was used with `transcan`

).

The `Function`

function is a generic function
generator. `Function.transcan`

creates R functions to transform
variables using transformations created by `transcan`

. These
functions are useful for getting predicted values with predictors set
to values on the original scale.

The `vcov`

methods are defined here so that
imputation-corrected variance-covariance matrices are readily
extracted from `fit.mult.impute`

objects, and so that
`fit.mult.impute`

can easily compute traditional covariance
matrices for individual completed datasets.

The subscript method for `transcan`

preserves attributes.

The `invertTabulated`

function does either inverse linear
interpolation or uses sampling to sample qualifying x-values having
y-values near the desired values. The latter is used to get inverse
values having a reasonable distribution (e.g., no floor or ceiling
effects) when the transformation has a flat or nearly flat segment,
resulting in a many-to-one transformation in that region. Sampling
weights are a combination of the frequency of occurrence of x-values
that are within `tolInverse`

times the range of `y`

and the
squared distance between the associated y-values and the target
y-value (`aty`

).

- Keywords
- multivariate, models, methods, regression, smooth

##### Usage

```
transcan(x, method=c("canonical","pc"),
categorical=NULL, asis=NULL, nk, imputed=FALSE, n.impute,
boot.method=c('approximate bayesian', 'simple'),
trantab=FALSE, transformed=FALSE,
impcat=c("score", "multinom", "rpart"),
mincut=40,
inverse=c('linearInterp','sample'), tolInverse=.05,
pr=TRUE, pl=TRUE, allpl=FALSE, show.na=TRUE,
imputed.actual=c('none','datadensity','hist','qq','ecdf'),
iter.max=50, eps=.1, curtail=TRUE,
imp.con=FALSE, shrink=FALSE, init.cat="mode",
nres=if(boot.method=='simple')200 else 400,
data, subset, na.action, treeinfo=FALSE,
rhsImp=c('mean','random'), details.impcat='', …)
```# S3 method for transcan
summary(object, long=FALSE, digits=6, …)

# S3 method for transcan
print(x, long=FALSE, …)

# S3 method for transcan
plot(x, …)

# S3 method for transcan
ggplot(data, mapping, scale=FALSE, …, environment)

# S3 method for transcan
impute(x, var, imputation, name, pos.in, data,
list.out=FALSE, pr=TRUE, check=TRUE, …)

fit.mult.impute(formula, fitter, xtrans, data, n.impute, fit.reps=FALSE,
dtrans, derived, vcovOpts=NULL, pr=TRUE, subset, …)

# S3 method for transcan
predict(object, newdata, iter.max=50, eps=0.01, curtail=TRUE,
type=c("transformed","original"),
inverse, tolInverse, check=FALSE, …)

Function(object, …)

# S3 method for transcan
Function(object, prefix=".", suffix="", pos=-1, …)

invertTabulated(x, y, freq=rep(1,length(x)),
aty, name='value',
inverse=c('linearInterp','sample'),
tolInverse=0.05, rule=2)

# S3 method for default
vcov(object, regcoef.only=FALSE, …)

# S3 method for fit.mult.impute
vcov(object, regcoef.only=TRUE,
intercepts='mid', …)

##### Arguments

- x
a matrix containing continuous variable values and codes for categorical variables. The matrix must have column names (

`dimnames`

). If row names are present, they are used in forming the`names`

attribute of imputed values if`imputed = TRUE`

.`x`

may also be a formula, in which case the model matrix is created automatically, using data in the calling frame. Advantages of using a formula are that`categorical`

variables can be determined automatically by a variable being a`factor`

variable, and variables with two unique levels are modeled`asis`

. Variables with 3 unique values are considered to be`categorical`

if a formula is specified. For a formula you may also specify that a variable is to remain untransformed by enclosing its name with the identify function, e.g.`I(x3)`

. The user may add other variable names to the`asis`

and`categorical`

vectors. For`invertTabulated`

,`x`

is a vector or a list with three components: the x vector, the corresponding vector of transformed values, and the corresponding vector of frequencies of the pair of original and transformed variables. For`print`

,`plot`

,`ggplot`

,`impute`

, and`predict`

,`x`

is an object created by`transcan`

.- formula
any R model formula

- fitter
any R,

`rms`

, modeling function (not in quotes) that computes a vector of`coefficients`

and for which`vcov`

will return a variance-covariance matrix. E.g.,`fitter = lm`

,`glm`

,`ols`

. At present models involving non-regression parameters (e.g., scale parameters in parametric survival models) are not handled fully.- xtrans
an object created by

`transcan`

,`aregImpute`

, or`mice`

- method
use

`method="canonical"`

or any abbreviation thereof, to use canonical variates (the default).`method="pc"`

transforms a variable instead so as to maximize the correlation with the first principal component of the other variables.- categorical
a character vector of names of variables in

`x`

which are categorical, for which the ordering of re-scored values is not necessarily preserved. If`categorical`

is omitted, it is assumed that all variables are continuous (or binary). Set`categorical="*"`

to treat all variables as categorical.- asis
a character vector of names of variables that are not to be transformed. For these variables, the guts of

`lm.fit`

`method="qr"`

is used to impute missing values. You may want to treat binary variables`asis`

(this is automatic if using a formula). If`imputed = TRUE`

, you may want to use`"categorical"`for binary variables if you want to force imputed values to be one of the original data values. Set`asis="*"`

to treat all variables`asis`

.- nk
number of knots to use in expanding each continuous variable (not listed in

`asis`

) in a restricted cubic spline function. Default is 3 (yielding 2 parameters for a variable) if \(\var{n} < 30\), 4 if \(30 <= \var{n} < 100\), and 5 if \(\var{n} \ge 100\) (4 parameters).- imputed
Set to

`TRUE`

to return a list containing imputed values on the original scale. If the transformation for a variable is non-monotonic, imputed values are not unique.`transcan`

uses the`approx`

function, which returns the highest value of the variable with the transformed score equalling the imputed score.`imputed=TRUE`

also causes original-scale imputed values to be shown as tick marks on the top margin of each graph when`show.na=TRUE`

(for the final iteration only). For categorical predictors, these imputed values are passed through the`jitter`

function so that their frequencies can be visualized. When`n.impute`

is used, each`NA`

will have`n.impute`

tick marks.- n.impute
number of multiple imputations. If omitted, single predicted expected value imputation is used.

`n.impute=5`

is frequently recommended.- boot.method
default is to use the approximate Bayesian bootstrap (sample with replacement from sample with replacement of the vector of residuals). You can also specify

`boot.method="simple"`

to use the usual bootstrap one-stage sampling with replacement.- trantab
Set to

`TRUE`

to add an attribute`trantab`

to the returned matrix. This contains a vector of lists each with components`x`

and`y`

containing the unique values and corresponding transformed values for the columns of`x`

. This is set up to be used easily with the`approx`

function. You must specify`trantab=TRUE`

if you want to later use the`predict.transcan`

function with`type = "original"`

.- transformed
set to

`TRUE`

to cause`transcan`

to return an object`transformed`

containing the matrix of transformed variables- impcat
This argument tells how to impute categorical variables on the original scale. The default is

`impcat="score"`

to impute the category whose canonical variate score is closest to the predicted score. Use`impcat="rpart"`

to impute categorical variables using the values of all other transformed predictors in conjunction with the`rpart`

function. A better but somewhat slower approach is to use`impcat="multinom"`

to fit a multinomial logistic model to the categorical variable, at the last iteraction of the`transcan`

algorithm. This uses the`multinom`

function in the nnet library of the MASS package (which is assumed to have been installed by the user) to fit a polytomous logistic model to the current working transformations of all the other variables (using conditional mean imputation for missing predictors). Multiple imputations are made by drawing multinomial values from the vector of predicted probabilities of category membership for the missing categorical values.- mincut
If

`imputed=TRUE`

, there are categorical variables, and`impcat = "rpart"`

,`mincut`

specifies the lowest node size that will be allowed to be split. The default is 40.- inverse
By default, imputed values are back-solved on the original scale using inverse linear interpolation on the fitted tabulated transformed values. This will cause distorted distributions of imputed values (e.g., floor and ceiling effects) when the estimated transformation has a flat or nearly flat section. To instead use the

`invertTabulated`

function (see above) with the`"sample"`

option, specify`inverse="sample"`

.- tolInverse
the multiplyer of the range of transformed values, weighted by

`freq`

and by the distance measure, for determining the set of x values having y values within a tolerance of the value of`aty`

in`invertTabulated`

. For`predict.transcan`

,`inverse`

and`tolInverse`

are obtained from options that were specified to`transcan`

by default. Otherwise, if not specified by the user, these default to the defaults used to`invertTabulated`

.- pr
For

`transcan`

, set to`FALSE`

to suppress printing \(R^2\) and shrinkage factors. Set`impute.transcan=FALSE`

to suppress messages concerning the number of`NA`

values imputed. Set`fit.mult.impute=FALSE`

to suppress printing variance inflation factors accounting for imputation, rate of missing information, and degrees of freedom.- pl
Set to

`FALSE`

to suppress plotting the final transformations with distribution of scores for imputed values (if`show.na=TRUE`

).- allpl
Set to

`TRUE`

to plot transformations for intermediate iterations.- show.na
Set to

`FALSE`

to suppress the distribution of scores assigned to missing values (as tick marks on the right margin of each graph). See also`imputed`

.- imputed.actual
The default is

`"none"`to suppress plotting of actual vs. imputed values for all variables having any`NA`

values. Other choices are`"datadensity"`to use`datadensity`

to make a single plot,`"hist"`to make a series of back-to-back histograms,`"qq"`to make a series of q-q plots, or`"ecdf"`to make a series of empirical cdfs. For`imputed.actual="datadensity"`

for example you get a rug plot of the non-missing values for the variable with beneath it a rug plot of the imputed values. When`imputed.actual`

is not`"none"`,`imputed`

is automatically set to`TRUE`

.- iter.max
maximum number of iterations to perform for

`transcan`

or`predict`

. For`predict`

, only one iteration is used if there are no`NA`

values in the data or if`imp.con`

was used.- eps
convergence criterion for

`transcan`

and`predict`

.`eps`

is the maximum change in transformed values from one iteration to the next. If for a given iteration all new transformations of variables differ by less than`eps`

(with or without negating the transformation to allow for “flipping”) from the transformations in the previous iteration, one more iteration is done for`transcan`

. During this last iteration, individual transformations are not updated but coefficients of transformations are. This improves stability of coefficients of canonical variates on the right-hand-side.`eps`

is ignored when`rhsImp="random"`

.- curtail
for

`transcan`

, causes imputed values on the transformed scale to be truncated so that their ranges are within the ranges of non-imputed transformed values. For`predict`

,`curtail`

defaults to`TRUE`

to truncate predicted transformed values to their ranges in the original fit (`xt`

).- imp.con
for

`transcan`

, set to`TRUE`

to impute`NA`

values on the original scales with constants (medians or most frequent category codes). Set to a vector of constants to instead always use these constants for imputation. These imputed values are ignored when fitting the current working transformation for asingle variable.- shrink
default is

`FALSE`

to use ordinary least squares or canonical variate estimates. For the purposes of imputing`NA`

s, you may want to set`shrink=TRUE`

to avoid overfitting when developing a prediction equation to predict each variables from all the others (see details below).- init.cat
method for initializing scorings of categorical variables. Default is

`"mode"`to use a dummy variable set to 1 if the value is the most frequent value (this is the default). Use`"random"`to use a random 0-1 variable. Set to`"asis"`to use the original integer codes asstarting scores.- nres
number of residuals to store if

`n.impute`

is specified. If the dataset has fewer than`nres`

observations, all residuals are saved. Otherwise a random sample of the residuals of length`nres`

without replacement is saved. The default for`nres`

is higher if`boot.method="approximate bayesian"`

.- data
Data frame used to fill the formula. For

`ggplot`

is the result of`transcan`

with`trantab=TRUE`

.- subset
an integer or logical vector specifying the subset of observations to fit

- na.action
These may be used if

`x`

is a formula. The default`na.action`

is`na.retain`

(defined by`transcan`

) which keeps all observations with any`NA`

values. For`impute.transcan`

,`data`

is a data frame to use as the source of variables to be imputed, rather than using`pos.in`

. For`fit.mult.impute`

,`data`

is mandatory and is a data frame containing the data to be used in fitting the model but before imputations are applied. Variables omitted from`data`

are assumed to be available from frame1 and do not need to be imputed.- treeinfo
Set to

`TRUE`

to get additional information printed when`impcat="rpart"`

, such as the predicted probabilities of category membership.- rhsImp
Set to

`"random"`to use random draw imputation when a sometimes missing variable is moved to be a predictor of other sometimes missing variables. Default is`rhsImp="mean"`

, which uses conditional mean imputation on the transformed scale. Residuals used are residuals from the transformed scale. When`"random"`is used,`transcan`

runs 5 iterations and ignores`eps`

.- details.impcat
set to a character scalar that is the name of a category variable to include in the resulting

`transcan`

object an element`details.impcat`

containing details of how the categorical variable was multiply imputed.- …
arguments passed to

`scat1d`

or to the`fitter`

function (for`fit.mult.impute`

). For`ggplot.transcan`

, these arguments are passed to`facet_wrap`

, e.g.`ncol=2`

.- long
for

`summary`

, set to`TRUE`

to print all imputed values. For`print`

, set to`TRUE`

to print details of transformations/imputations.- digits
number of significant digits for printing values by

`summary`

- scale
for

`ggplot.transcan`

set`scale=TRUE`

to scale transformed values to [0,1] before plotting.- mapping,environment
not used; needed because of rules about generics

- var
For

`impute`

, is a variable that was originally a column in`x`

, for which imputated values are to be filled in.`imputed=TRUE`

must have been used in`transcan`

. Omit`var`

to impute all variables, creating new variables in position`pos`

(see`assign`

).- imputation
specifies which of the multiple imputations to use for filling in

`NA`

values- name
name of variable to impute, for

`impute`

function. Default is character string version of the second argument (`var`

) in the call to`impute`

. For`invertTabulated`

, is the name of variable being transformed (used only for warning messages).- pos.in
location as defined by

`assign`

to find variables that need to be imputed, when all variables are to be imputed automatically by`impute.transcan`

(i.e., when no input variable name is specified). Default is position that contains the first variable to be imputed.- list.out
If

`var`

is not specified, you can set`list.out=TRUE`

to have`impute.transcan`

return a list containing variables with needed values imputed. This list will contain a single imputation. Variables not needing imputation are copied to the list as-is. You can use this list for analysis just like a data frame.- check
set to

`FALSE`

to suppress certain warning messages- newdata
a new data matrix for which to compute transformed variables. Categorical variables must use the same integer codes as were used in the call to

`transcan`

. If a formula was originally specified to`transcan`

(instead of a data matrix),`newdata`

is optional and if given must be a data frame; a model frame is generated automatically from the previous formula. The`na.action`

is handled automatically, and the levels for factor variables must be the same and in the same order as were used in the original variables specified in the formula given to`transcan`

.- fit.reps
set to

`TRUE`

to save all fit objects from the fit for each imputation in`fit.mult.impute`

. Then the object returned will have a component`fits`

which is a list whose`i`th element is the`i`th fit object.- dtrans
provides an approach to creating derived variables from a single filled-in dataset. The function specified as

`dtrans`

can even reshape the imputed dataset. An example of such usage is fitting time-dependent covariates in a Cox model that are created by “start,stop” intervals. Imputations may be done on a one record per subject data frame that is converted by`dtrans`

to multiple records per subject. The imputation can enforce consistency of certain variables across records so that for example a missing value of`sex`will not be imputed as`male`for one of the subject's records and`female`as another. An example of how`dtrans`

might be specified is`dtrans=function(w) {w$age <- w$years + w$months/12; w}`

where`months`

might havebeen imputed but`years`

was never missing. An outline for using `dtrans` to impute missing baseline variables in a longitudinal analysis appears in Details below.- derived
an expression containing R expressions for computing derived variables that are used in the model formula. This is useful when multiple imputations are done for component variables but the actual model uses combinations of these (e.g., ratios or other derivations). For a single derived variable you can specified for example

`derived=expression(ratio <- weight/height)`

. For multiple derived variables use the form`derived=expression({ratio <- weight/height; product <- weight*height})`

or put the expression on separate input lines. To monitor the multiply-imputed derived variables you can add to the`expression`

a command such as`print(describe(ratio))`

. See the example below. Note that`derived`

is not yet implemented.- vcovOpts
a list of named additional arguments to pass to the

`vcov`

method for`fitter`

. Useful for`orm`

models for retaining all intercepts (`vcovOpts=list(intercepts='all')`

) instead of just the middle one.- type
By default, the matrix of transformed variables is returned, with imputed values on the transformed scale. If you had specified

`trantab=TRUE`

to`transcan`

, specifying`type="original"`

does the table look-ups with linear interpolation to return the input matrix`x`

but with imputed values on the original scale inserted for`NA`

values. For categorical variables, the method used here is to select the category code having a corresponding scaled value closest to the predicted transformed value. This corresponds to the default`impcat`

. Note: imputed values thus returned when`type="original"`

are single expected value imputations even in`n.impute`

is given.- object
an object created by

`transcan`

, or an object to be converted to R function code, typically a model fit object of some sort- prefix, suffix
When creating separate R functions for each variable in

`x`

, the name of the new function will be`prefix`

placed in front of the variable name, and`suffix`

placed in back of the name. The default is to use names of the form`.varname`, where`varname`is the variable name.- pos
position as in

`assign`

at which to store new functions (for`Function`

). Default is`pos=-1`

.- y
a vector corresponding to

`x`

for`invertTabulated`

, if its first argument`x`

is not a list- freq
a vector of frequencies corresponding to cross-classified

`x`

and`y`

if`x`

is not a list. Default is a vector of ones.- aty
vector of transformed values at which inverses are desired

- rule
see

`approx`

.`transcan`

assumes`rule`

is always 2.- regcoef.only
set to

`TRUE`

to make`vcov.default`

delete positions in the covariance matrix for any non-regression coefficients (e.g., log scale parameter from`psm`

or`survreg`

)- intercepts
this is primarily for

`orm`

objects. Set to`"none"`

to discard all intercepts from the covariance matrix, or to`"all"`

or`"mid"`

to keep all elements generated by`orm`

(`orm`

only outputs the covariance matrix for the intercept corresponding to the median). You can also set`intercepts`

to a vector of subscripts for selecting particular intercepts in a multi-intercept model.

##### Details

The starting approximation to the transformation for each variable is
taken to be the original coding of the variable. The initial
approximation for each missing value is taken to be the median of the
non-missing values for the variable (for continuous ones) or the most
frequent category (for categorical ones). Instead, if `imp.con`

is a vector, its values are used for imputing `NA`

values. When
using each variable as a dependent variable, `NA`

values on that
variable cause all observations to be temporarily deleted. Once a new
working transformation is found for the variable, along with a model
to predict that transformation from all the other variables, that
latter model is used to impute `NA`

values in the selected
dependent variable if `imp.con`

is not specified.

When that variable is used to predict a new dependent variable, the
current working imputed values are inserted. Transformations are
updated after each variable becomes a dependent variable, so the order
of variables on `x`

could conceivably make a difference in the
final estimates. For obtaining out-of-sample
predictions/transformations, `predict`

uses the same
iterative procedure as `transcan`

for imputation, with the same
starting values for fill-ins as were used by `transcan`

. It also
(by default) uses a conservative approach of curtailing transformed
variables to be within the range of the original ones. Even when
`method = "pc"`

is specified, canonical variables are used for
imputing missing values.

Note that fitted transformations, when evaluated at imputed variable
values (on the original scale), will not precisely match the
transformed imputed values returned in `xt`

. This is because
`transcan`

uses an approximate method based on linear
interpolation to back-solve for imputed values on the original scale.

Shrinkage uses the method of
Van Houwelingen and Le Cessie (1990) (similar to
Copas, 1983). The shrinkage factor is
$$\frac{1-\frac{(1-\var{R2})(\var{n}-1)}{\var{n}-\var{k}-1}}{\var{R2}}$$
where `R2` is the apparent \(R^2\)d for predicting the
variable, `n` is the number of non-missing values, and `k` is
the effective number of degrees of freedom (aside from intercepts). A
heuristic estimate is used for `k`:

, where
`A` - 1 + sum(max(0,`Bi` - 1))/`m` + `m``A` is the number of d.f. required to represent the variable being
predicted, the `Bi` are the number of columns required to
represent all the other variables, and `m` is the number of all
other variables. Division by `m` is done because the
transformations for the other variables are fixed at their current
transformations the last time they were being predicted. The
\(+ \var{m}\) term comes from the number of coefficients estimated
on the right hand side, whether by least squares or canonical
variates. If a shrinkage factor is negative, it is set to 0. The
shrinkage factor is the ratio of the adjusted \(R^2\)d to
the ordinary \(R^2\)d. The adjusted \(R^2\)d is
$$1-\frac{(1-\var{R2})(\var{n}-1)}{\var{n}-\var{k}-1}$$
which is also set to zero if it is negative. If `shrink=FALSE`

and the adjusted \(R^2\)s are much smaller than the
ordinary \(R^2\)s, you may want to run `transcan`

with `shrink=TRUE`

.

Canonical variates are scaled to have variance of 1.0, by multiplying
canonical coefficients from `cancor`

by
\(\sqrt{\var{n}-1}\).

When specifying a non-rms library fitting function to
`fit.mult.impute`

(e.g., `lm`

, `glm`

),
running the result of `fit.mult.impute`

through that fit's
`summary`

method will not use the imputation-adjusted
variances. You may obtain the new variances using `fit$var`

or
`vcov(fit)`

.

When you specify a rms function to `fit.mult.impute`

(e.g.
`lrm`

, `ols`

, `cph`

,
`psm`

, `bj`

, `Rq`

,
`Gls`

, `Glm`

), automatically computed
transformation parameters (e.g., knot locations for
`rcs`

) that are estimated for the first imputation are
used for all other imputations. This ensures that knot locations will
not vary, which would change the meaning of the regression
coefficients.

Warning: even though `fit.mult.impute`

takes imputation into
account when estimating variances of regression coefficient, it does
not take into account the variation that results from estimation of
the shapes and regression coefficients of the customized imputation
equations. Specifying `shrink=TRUE`

solves a small part of this
problem. To fully account for all sources of variation you should
consider putting the `transcan`

invocation inside a bootstrap or
loop, if execution time allows. Better still, use
`aregImpute`

or a package such as as mice that uses
real Bayesian posterior realizations to multiply impute missing values
correctly.

It is strongly recommended that you use the Hmisc `naclus`

function to determine is there is a good basis for imputation.
`naclus`

will tell you, for example, if systolic blood
pressure is missing whenever diastolic blood pressure is missing. If
the only variable that is well correlated with diastolic bp is
systolic bp, there is no basis for imputing diastolic bp in this case.

At present, `predict`

does not work with multiple imputation.

When calling `fit.mult.impute`

with `glm`

as the
`fitter`

argument, if you need to pass a `family`

argument
to `glm`

do it by quoting the family, e.g.,
`family="binomial"`

.

`fit.mult.impute`

will not work with proportional odds models
when regression imputation was used (as opposed to predictive mean
matching). That's because regression imputation will create values of
the response variable that did not exist in the dataset, altering the
intercept terms in the model.

You should be able to use a variable in the formula given to
`fit.mult.impute`

as a numeric variable in the regression model
even though it was a factor variable in the invocation of
`transcan`

. Use for example ```
fit.mult.impute(y ~ codes(x),
lrm, trans)
```

(thanks to Trevor Thompson
trevor@hp5.eushc.org).

Here is an outline of the steps necessary to impute baseline variables
using the `dtrans`

argument, when the analysis to be repeated by
`fit.mult.impute`

is a longitudinal analysis (using
e.g. `Gls`

).

Create a one row per subject data frame containing baseline variables plus follow-up variables that are assigned to windows. For example, you may have dozens of repeated measurements over years but you capture the measurements at the times measured closest to 1, 2, and 3 years after study entry

Make sure the dataset contains the subject ID

This dataset becomes the one passed to

`aregImpute`

as`data=`

. You will be imputing missing baseline variables from follow-up measurements defined at fixed times.Have another dataset with all the non-missing follow-up values on it, one record per measurement time per subject. This dataset should not have the baseline variables on it, and the follow-up measurements should not be named the same as the baseline variable(s); the subject ID must also appear

Add the dtrans argument to

`fit.mult.impute`

to define a function with one argument representing the one record per subject dataset with missing values filled it from the current imputation. This function merges the above 2 datasets; the returned value of this function is the merged data frame.This merged-on-the-fly dataset is the one handed by

`fit.mult.impute`

to your fitting function, so variable names in the formula given to`fit.mult.impute`

must matched the names created by the merge

##### Value

For `transcan`

, a list of class `transcan` with elements

(with the function call)

(number of iterations done)

containing the \(R^2\)s and adjusted \(R^2\)s achieved in predicting each variable from all the others

the values supplied for `categorical`

the values supplied for `asis`

the within-variable coefficients used to compute the first canonical variate

the (possibly shrunk) across-variables coefficients of the first canonical variate that predicts each variable in-turn.

the parameters of the transformation (knots for splines, contrast matrix for categorical variables)

the initial estimates for missing values (`NA`

if variable
never missing)

the matrix of ranges of the transformed variables (min and max in first and secondrow)

a vector of scales used to determine convergence for a transformation.

the formula (if `x`

was a formula)

predict returns a matrix with the same number of columns or variables as were in x. fit.mult.impute returns a fit object that is a modification of the fit object created by fitting the completed dataset for the final imputation. The var matrix in the fit object has the imputation-corrected variance-covariance matrix. coefficients is the average (over imputations) of the coefficient vectors, variance.inflation.impute is a vector containing the ratios of the diagonals of the between-imputation variance matrix to the diagonals of the average apparent (within-imputation) variance matrix. missingInfo is Rubin's rate of missing information and dfmi is Rubin's degrees of freedom for a t-statistic for testing a single parameter. The last two objects are vectors corresponding to the diagonal of the variance matrix. The class "fit.mult.impute" is prepended to the other classes produced by the fitting function.

fit.mult.impute stores intercepts attributes in the coefficient matrix and in var for orm fits.

##### Side Effects

prints, plots, and `impute.transcan`

creates new variables.

##### References

Kuhfeld, Warren F: The PRINQUAL Procedure. SAS/STAT User's Guide, Fourth Edition, Volume 2, pp. 1265--1323, 1990.

Van Houwelingen JC, Le Cessie S: Predictive value of statistical models. Statistics in Medicine 8:1303--1325, 1990.

Copas JB: Regression, prediction and shrinkage. JRSS B 45:311--354, 1983.

He X, Shen L: Linear regression after spline transformation. Biometrika 84:474--481, 1997.

Little RJA, Rubin DB: Statistical Analysis with Missing Data. New York: Wiley, 1987.

Rubin DJ, Schenker N: Multiple imputation in health-care databases: An overview and some applications. Stat in Med 10:585--598, 1991.

Faris PD, Ghali WA, et al:Multiple imputation versus data enhancement for dealing with missing data in observational health care outcome analyses. J Clin Epidem 55:184--191, 2002.

##### See Also

`aregImpute`

, `impute`

, `naclus`

,
`naplot`

, `ace`

,
`avas`

, `cancor`

,
`prcomp`

, `rcspline.eval`

,
`lsfit`

, `approx`

, `datadensity`

,
`mice`

, `ggplot`

##### Examples

```
# NOT RUN {
x <- cbind(age, disease, blood.pressure, pH)
#cbind will convert factor object `disease' to integer
par(mfrow=c(2,2))
x.trans <- transcan(x, categorical="disease", asis="pH",
transformed=TRUE, imputed=TRUE)
summary(x.trans) #Summary distribution of imputed values, and R-squares
f <- lm(y ~ x.trans$transformed) #use transformed values in a regression
#Now replace NAs in original variables with imputed values, if not
#using transformations
age <- impute(x.trans, age)
disease <- impute(x.trans, disease)
blood.pressure <- impute(x.trans, blood.pressure)
pH <- impute(x.trans, pH)
#Do impute(x.trans) to impute all variables, storing new variables under
#the old names
summary(pH) #uses summary.impute to tell about imputations
#and summary.default to tell about pH overall
# Get transformed and imputed values on some new data frame xnew
newx.trans <- predict(x.trans, xnew)
w <- predict(x.trans, xnew, type="original")
age <- w[,"age"] #inserts imputed values
blood.pressure <- w[,"blood.pressure"]
Function(x.trans) #creates .age, .disease, .blood.pressure, .pH()
#Repeat first fit using a formula
x.trans <- transcan(~ age + disease + blood.pressure + I(pH),
imputed=TRUE)
age <- impute(x.trans, age)
predict(x.trans, expand.grid(age=50, disease="pneumonia",
blood.pressure=60:260, pH=7.4))
z <- transcan(~ age + factor(disease.code), # disease.code categorical
transformed=TRUE, trantab=TRUE, imputed=TRUE, pl=FALSE)
ggplot(z, scale=TRUE)
plot(z$transformed)
# }
# NOT RUN {
# Multiple imputation and estimation of variances and covariances of
# regression coefficient estimates accounting for imputation
set.seed(1)
x1 <- factor(sample(c('a','b','c'),100,TRUE))
x2 <- (x1=='b') + 3*(x1=='c') + rnorm(100)
y <- x2 + 1*(x1=='c') + rnorm(100)
x1[1:20] <- NA
x2[18:23] <- NA
d <- data.frame(x1,x2,y)
n <- naclus(d)
plot(n); naplot(n) # Show patterns of NAs
f <- transcan(~y + x1 + x2, n.impute=10, shrink=FALSE, data=d)
options(digits=3)
summary(f)
f <- transcan(~y + x1 + x2, n.impute=10, shrink=TRUE, data=d)
summary(f)
h <- fit.mult.impute(y ~ x1 + x2, lm, f, data=d)
# Add ,fit.reps=TRUE to save all fit objects in h, then do something like:
# for(i in 1:length(h$fits)) print(summary(h$fits[[i]]))
diag(vcov(h))
h.complete <- lm(y ~ x1 + x2, na.action=na.omit)
h.complete
diag(vcov(h.complete))
# Note: had the rms ols function been used in place of lm, any
# function run on h (anova, summary, etc.) would have automatically
# used imputation-corrected variances and covariances
# Example demonstrating how using the multinomial logistic model
# to impute a categorical variable results in a frequency
# distribution of imputed values that matches the distribution
# of non-missing values of the categorical variable
# }
# NOT RUN {
set.seed(11)
x1 <- factor(sample(letters[1:4], 1000,TRUE))
x1[1:200] <- NA
table(x1)/sum(table(x1))
x2 <- runif(1000)
z <- transcan(~ x1 + I(x2), n.impute=20, impcat='multinom')
table(z$imputed$x1)/sum(table(z$imputed$x1))
# Here is how to create a completed dataset
d <- data.frame(x1, x2)
z <- transcan(~x1 + I(x2), n.impute=5, data=d)
imputed <- impute(z, imputation=1, data=d,
list.out=TRUE, pr=FALSE, check=FALSE)
sapply(imputed, function(x)sum(is.imputed(x)))
sapply(imputed, function(x)sum(is.na(x)))
# }
# NOT RUN {
# Example where multiple imputations are for basic variables and
# modeling is done on variables derived from these
set.seed(137)
n <- 400
x1 <- runif(n)
x2 <- runif(n)
y <- x1*x2 + x1/(1+x2) + rnorm(n)/3
x1[1:5] <- NA
d <- data.frame(x1,x2,y)
w <- transcan(~ x1 + x2 + y, n.impute=5, data=d)
# Add ,show.imputed.actual for graphical diagnostics
# }
# NOT RUN {
g <- fit.mult.impute(y ~ product + ratio, ols, w,
data=data.frame(x1,x2,y),
derived=expression({
product <- x1*x2
ratio <- x1/(1+x2)
print(cbind(x1,x2,x1*x2,product)[1:6,])}))
# }
# NOT RUN {
# Here's a method for creating a permanent data frame containing
# one set of imputed values for each variable specified to transcan
# that had at least one NA, and also containing all the variables
# in an original data frame. The following is based on the fact
# that the default output location for impute.transcan is
# given by the global environment
# }
# NOT RUN {
xt <- transcan(~. , data=mine,
imputed=TRUE, shrink=TRUE, n.impute=10, trantab=TRUE)
attach(mine, use.names=FALSE)
impute(xt, imputation=1) # use first imputation
# omit imputation= if using single imputation
detach(1, 'mine2')
# }
# NOT RUN {
# Example of using invertTabulated outside transcan
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(1,2,3,4,5,5,5,5,9,10)
freq <- c(1,1,1,1,1,2,3,4,1,1)
# x=5,6,7,8 with prob. .1 .2 .3 .4 when y=5
# Within a tolerance of .05*(10-1) all y's match exactly
# so the distance measure does not play a role
set.seed(1) # so can reproduce
for(inverse in c('linearInterp','sample'))
print(table(invertTabulated(x, y, freq, rep(5,1000), inverse=inverse)))
# Test inverse='sample' when the estimated transformation is
# flat on the right. First show default imputations
set.seed(3)
x <- rnorm(1000)
y <- pmin(x, 0)
x[1:500] <- NA
for(inverse in c('linearInterp','sample')) {
par(mfrow=c(2,2))
w <- transcan(~ x + y, imputed.actual='hist',
inverse=inverse, curtail=FALSE,
data=data.frame(x,y))
if(inverse=='sample') next
# cat('Click mouse on graph to proceed\n')
# locator(1)
}
# }
```

*Documentation reproduced from package Hmisc, version 4.3-0, License: GPL (>= 2)*