std: Standardize and center variables

Description

std() computes a z-transformation (standardized and centered) on the input. center() centers the input. std_if() and center_if() are scoped variants of std() and center(), where transformation will be applied only to those variables that match the logical condition of predicate.

Usage

std(
  x,
  ...,
  robust = c("sd", "2sd", "gmd", "mad"),
  include.fac = FALSE,
  append = TRUE,
  suffix = "_z"
)
std_if(
  x,
  predicate,
  robust = c("sd", "2sd", "gmd", "mad"),
  include.fac = FALSE,
  append = TRUE,
  suffix = "_z"
)
center(x, ..., include.fac = FALSE, append = TRUE, suffix = "_c")
center_if(x, predicate, include.fac = FALSE, append = TRUE, suffix = "_c")

Arguments

A vector or data frame.

...

Optional, unquoted names of variables that should be selected for further processing. Required, if x is a data frame (and no vector) and only selected variables from x should be processed. You may also use functions like : or tidyselect's select_helpers. See 'Examples' or package-vignette.

robust

Character vector, indicating the method applied when standardizing variables with std(). By default, standardization is achieved by dividing the centered variables by their standard deviation (robust = "sd"). However, for skewed distributions, the median absolute deviation (MAD, robust = "mad") or Gini's mean difference (robust = "gmd") might be more robust measures of dispersion. For the latter option, sjstats needs to be installed. robust = "2sd" divides the centered variables by two standard deviations, following a suggestion by Gelman (2008), so the rescaled input is comparable to binary variables.

include.fac

Logical, if TRUE, factors will be converted to numeric vectors and also standardized or centered.

append

Logical, if TRUE (the default) and x is a data frame, x including the new variables as additional columns is returned; if FALSE, only the new variables are returned.

suffix

String value, will be appended to variable (column) names of x, if x is a data frame. If x is not a data frame, this argument will be ignored. The default value to suffix column names in a data frame depends on the function call:

recoded variables (rec()) will be suffixed with "_r"
recoded variables (recode_to()) will be suffixed with "_r0"
dichotomized variables (dicho()) will be suffixed with "_d"
grouped variables (split_var()) will be suffixed with "_g"
grouped variables (group_var()) will be suffixed with "_gr"
standardized variables (std()) will be suffixed with "_z"
centered variables (center()) will be suffixed with "_c"
de-meaned variables (de_mean()) will be suffixed with "_dm"
grouped-meaned variables (de_mean()) will be suffixed with "_gm"

If suffix = "" and append = TRUE, existing variables that have been recoded/transformed will be overwritten.

predicate

A predicate function to be applied to the columns. The variables for which predicate returns TRUE are selected.

Value

If x is a vector, returns a vector with standardized or centered variables. If x is a data frame, for append = TRUE, x including the transformed variables as new columns is returned; if append = FALSE, only the transformed variables will be returned. If append = TRUE and suffix = "", recoded variables will replace (overwrite) existing variables.

Details

std() and center() also work on grouped data frames (see group_by). In this case, standardization or centering is applied to the subsets of variables in x. See 'Examples'.

For more complicated models with many predictors, Gelman and Hill (2007) suggest leaving binary inputs as is and only standardize continuous predictors by dividing by two standard deviations. This ensures a rough comparability in the coefficients.

References

Gelman A (2008) Scaling regression inputs by dividing by two standard deviations. Statistics in Medicine 27: 2865-2873. http://www.stat.columbia.edu/~gelman/research/published/standardizing7.pdf

Gelman A, Hill J (2007) Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambdridge, Cambdrige University Press: 55-57

Examples

Run this code

# NOT RUN {
data(efc)
std(efc$c160age) %>% head()
std(efc, e17age, c160age, append = FALSE) %>% head()

center(efc$c160age) %>% head()
center(efc, e17age, c160age, append = FALSE) %>% head()

# NOTE!
std(efc$e17age) # returns a vector
std(efc, e17age) # returns a data frame

# with quasi-quotation
x <- "e17age"
center(efc, !!x, append = FALSE) %>% head()

# works with mutate()
library(dplyr)
efc %>%
  select(e17age, neg_c_7) %>%
  mutate(age_std = std(e17age), burden = center(neg_c_7)) %>%
  head()

# works also with grouped data frames
mtcars %>% std(disp)

# compare new column "disp_z" w/ output above
mtcars %>%
  group_by(cyl) %>%
  std(disp)

data(iris)
# also standardize factors
std(iris, include.fac = TRUE, append = FALSE)
# don't standardize factors
std(iris, include.fac = FALSE, append = FALSE)

# standardize only variables with more than 10 unique values
p <- function(x) dplyr::n_distinct(x) > 10
std_if(efc, predicate = p, append = FALSE)

# }

Run the code above in your browser using DataLab