stdize: Standardize data

Description

stdize standardizes variables by centring and scaling.

stdizeFit modifies a model call or existing model to use standardized variables.

Usage

## S3 method for class 'default':
stdize(x, center = TRUE, scale = TRUE, ...)
## S3 method for class 'logical':
stdize(x, binary = c("center", "scale", "binary", "half", "omit"),
    center = TRUE, scale = FALSE, ...)
## also for two-level factors
## S3 method for class 'data.frame':
stdize(x, binary = c("center", "scale", "binary", "half", "omit"),
    center = TRUE, scale = TRUE, omit.cols = NULL, source = NULL,
    prefix = TRUE, append = FALSE, ...)
## S3 method for class 'formula':
stdize(x, data = NULL, response = FALSE,
    binary = c("center", "scale", "binary", "half", "omit"),
    center = TRUE, scale = TRUE, omit.cols = NULL, prefix = TRUE,
    append = FALSE, ...)
    
stdizeFit(object, data, which = c("formula", "subset", "offset", "weights"),
    evaluate = TRUE, quote = NA)

Arguments

a numeric or logical vector, factor, numeric matrix, data.frame or a formula.

center, scale

either a logical value or vector, or a numeric vector of length equal to the number of columns of x (see Details). scale can be also a function to use for scaling.

binary

specifies how binary variables (logical or two-level factors) are scaled. Default is to "center" by subtracting the mean assuming levels are equal to 0 and 1; use "scale" to both centre and scale by

source

a reference data.frame, being a result of previous stdize, from which scale and center values are taken. Column names are matched. This can be used for scaling new data usi

omit.cols

column names or numeric indices of columns that should be left unaltered.

prefix

either a logical value specifying whether the names of transformed columns should be prefixed, or a two-element character vector giving the prefixes. The prefixes default to z. for scaled and c. fo

append

logical, if TRUE, modified columns are appended to the original data frame.

response

logical, stating whether the response be standardized. By default only variables on the right-hand side of formula are standardized.

data

an object coercible to data.frame, containing the variables in formula. Passed to, and used by model.frame. For stdizeFit, a

...

for the formula method, additional arguments passed to model.frame. For other methods it is silently ignored.

object

a fitted model object or an expression being a call to the modelling function.

which

a character string naming arguments which should be modified. This should be all arguments which are evaluated in the data environment. Can be also TRUE to modify the expression as a whole. The data<

evaluate

if TRUE, the modified call is evaluated and the fitted model object is returned.

quote

if TRUE, avoids evaluating object. Equivalent to stdizeFit(quote(expr), ...). Defaults to NA in which case object being a call to non-primitive function is quoted.

Value

stdize returns a vector or object of the same dimensions as x, where the values are centred and/or scaled. Transformation is carried out column-wise in data.frames and matrices.
If center or scale are logical scalars or vectors of length equal to the number of columns of x, the centring is done by subtracting the mean (if center corresponding to the column is TRUE), and scaling is done by dividing the (centred) value by standard deviation (if corresponding scale is TRUE). If center or scale are numeric vectors with length equal to the number of columns of x (or numeric scalars for vector methods), then these are used instead. Any NAs in the numeric vector result in no centering or scaling on the corresponding column.
Binary variables, logical or factors with two levels, are converted to numeric variables and transformed according to the argument binary, unless center or scale are explicitly given.
The returned value is compatible with that of scale in that the numeric centring and scalings used are stored in attributes attributes "scaled:center" and "scaled:scale" (these can be NA if no centring or scaling has been done).
stdizeFit returns a modified, unevaluated call where the variable names are replaced to point the transformed variables, or if evaluate is TRUE, a fitted model object.

encoding

utf-8

Details

stdize resembles scale, but uses special rules for factors, similarly to standardize in package arm.

stdize differs from standardize in that it is used on data rather than on the fitted model object. The scaled data should afterwards be passed to the modelling function, instead of the original data.

Unlike standardize, it applies special binary scaling only to two-level factors and logical variables, rather than to any variable with two unique values.

Variables of only one unique value are unchanged.

By default, stdize scales by dividing by standard deviation rather than twice the SD as standardize does. Scaling by SD is used also on uncentred values, which is different from scale where root-mean-square is used.

References

Gelman, A. (2008) Scaling regression inputs by dividing by two standard deviations. Statistics in medicine 27, 2865-2873.

Examples

Run this code

# compare "stdize" and "scale"
nmat <- matrix(runif(15, 0, 10), ncol = 3)

stdize(nmat)
scale(nmat)

rootmeansq <- function(v) {
    v <- v[!is.na(v)]
    sqrt(sum(v^2) / max(1, length(v) - 1L))
}

scale(nmat, center = FALSE)
stdize(nmat, center = FALSE, scale = rootmeansq)

if(require(lme4)) {
# define scale function as twice the SD to reproduce "arm::standardize"
twosd <- function(v) 2 * sd(v, na.rm = TRUE)

# standardize data (scaled variables are prefixed with "z.")
z.CO2 <- stdize(uptake ~ conc + Plant, data = CO2, omit = "Plant", scale = twosd)
summary(z.CO2)


fmz <- stdizeFit(lmer(uptake ~ conc + I(conc^2) + (1 | Plant)), data = z.CO2)
# produces:
# lmer(uptake ~ z.conc + I(z.conc^2) + (1 | Plant), data = z.CO2)


## standardize using scale and center from "z.CO2", keeping the original data:
z.CO2a <- stdize(CO2, source = z.CO2, append = TRUE)
# Here, the "subset" expression uses untransformed variable, so we modify only
# "formula" argument, keeping "subset" as-is. For that reason we needed the
# untransformed variables in "data".
stdizeFit(lmer(uptake ~ conc + I(conc^2) + (1 | Plant),
    subset = conc > 100,
    ), data = z.CO2a, which = "formula", evaluate = FALSE)


# create new data as a sequence along "conc"
newdata <-  data.frame(conc = seq(min(CO2$conc), max(CO2$conc), length = 10))

# scale new data using scale and center of the original scaled data: 
z.newdata <- stdize(newdata, source = z.CO2)

if(require(graphics)) {
# plot predictions against "conc" on real scale:
plot(newdata$conc, predict(fmz, z.newdata, re.form = NA))
}

# compare with "arm::standardize"
library(arm)
fms <- standardize(lmer(uptake ~ conc + I(conc^2) + (1 | Plant), data = CO2))
plot(newdata$conc, predict(fms, z.newdata, re.form = NA))
}

Run the code above in your browser using DataLab