lm_betaselect: Betas-Select in a Regression Model

Description

Can fit a linear regression models with selected variables standardized; handle product terms correctly and skip categorical predictors in standardization.

Usage

lm_betaselect(
  ...,
  to_standardize = NULL,
  not_to_standardize = NULL,
  skip_response = FALSE,
  do_boot = TRUE,
  bootstrap = 100L,
  iseed = NULL,
  parallel = FALSE,
  ncpus = parallel::detectCores(logical = FALSE) - 1,
  progress = TRUE,
  load_balancing = FALSE,
  model_call = c("lm", "glm")
)
glm_betaselect(
  ...,
  to_standardize = NULL,
  not_to_standardize = NULL,
  skip_response = FALSE,
  do_boot = TRUE,
  bootstrap = 100L,
  iseed = NULL,
  parallel = FALSE,
  ncpus = parallel::detectCores(logical = FALSE) - 1,
  progress = TRUE,
  load_balancing = FALSE
)
# S3 method for lm_betaselect
print(
  x,
  digits = max(3L, getOption("digits") - 3L),
  type = c("beta", "standardized", "raw", "unstandardized"),
  ...
)
# S3 method for glm_betaselect
print(
  x,
  digits = max(3L, getOption("digits") - 3L),
  type = c("beta", "standardized", "raw", "unstandardized"),
  ...
)
raw_output(x)

Value

The function lm_betaselect()

returns an object of the class lm_betaselect, The function glm_betaselect()

returns an object of the class glm_betaselect. They are similar in structure to the output of stats::lm() and stats::glm(), with additional information stored.

The function raw_output() returns an object of the class lm or glm, which are the results of fitting the model to the data by stats::lm()

or stats::glm() without standardization.

Arguments

...: For lm_betaselect(). these arguments will be passed directly to lm(). For glm_betaselect(), these arguments will be passed to glm(). For the print-method of lm_betaselect or glm_betaselect objects, this will be passed to other methods.
to_standardize: A string vector, which should be the names of the variables to be standardized. Default is NULL, indicating all variables are to be standardized.
not_to_standardize: A string vector, which should be the names of the variables that should not be standardized. This argument is useful when most variables, except for a few, are to be standardized. This argument cannot be ued with to_standardize at the same time. Default is NULL, and only to_standardize is used.
skip_response: Logical. If TRUE, will not standardize the response (outcome) variable even if it appears in to_standardize or to_standardize is not specified. Used for models such as logistic regression models in which there are some restrictions on the response variables (e.g., only 0 or 1 for logistic regression).
do_boot: Whether bootstrapping will be conducted. Default is TRUE.
bootstrap: If do_boot is TRUE, this argument is the number of bootstrap samples to draw. Default is 100. Should be set to 5000 or even 10000 for stable results.
iseed: If do_boot is TRUE and this argument is not NULL, it will be used by set.seed() to set the seed for the random number generator. Default is NULL.
parallel: If do_boot is TRUE and this argument is TRUE, parallel processing will be used to do bootstrapping. Default is FALSE because bootstrapping for models fitted by stats::lm() or stats::glm() is rarely slow. Actually, if both parallel and progress are set to TRUE, the speed may even be slower than serial processing.
ncpus: If do_boot is TRUE and parallel is also TRUE, this argument is the number of processes to be used in parallel processing. Default is parallel::detectCores(logical = FALSE) - 1
progress: Logical. If TRUE, progress bars will be displayed for long process. Default is TRUE.
load_balancing: Logical. If parallel is TRUE, this determines whether load balancing will be used. Default is FALSE because the gain in speed is usually minor.
model_call: The model function to be called. If "lm", the default, the model will be fitted by stats::lm(). If "glm", the model will be fitted by stats::glm(). Users should call the corresponding function directly rather than setting this argument manually.
x: An lm_betaselect or glm_betaselect object.
digits: The number of significant digits to be printed for the coefficients.
type: The coefficients to be printed. For "beta" or "standardized", the coefficients after selected variables standardized will be printed. For "raw" or "unstandardized", the coefficients before standardization was done will be printed.

Author

Shu Fai Cheung https://orcid.org/0000-0002-9871-9448

Details

The functions lm_betaselect() and glm_betaselect() let users select which variables to be standardized when computing the standardized solution. They have the following features:

They automatically skip categorical predictors (i.e., factor or string variables).
They do not standardize a product term, which is incorrect. Instead, they compute the product term with its component variables standardized, if requested.
They standardize the selected variables before fitting a model. Therefore, If a model has the term log(x) and x is one of the selected variables, the model used the logarithm of the standardized x in the model, instead of standardized log(x) which is difficult to interpret.
They can be used to generate nonparametric bootstrap confidence intervals for the standardized solution. Bootstrap confidence interval is better than the default confidence interval ignoring the standardization because it takes into account the sampling variance of the standard deviations. Preliminary support for bootstrap confidence has been found for forming confidence intervals for coefficients involving standardized variables in linear regression (Jones & Waller, 2013).

Problems With Common Approaches

In some regression programs, users have limited control on which variables to standardize when requesting the so-called "betas". The solution may be uninterpretable or misleading in these conditions:

Dummy variables are standardized and their coefficients cannot be interpreted as the difference between two groups on the outcome variables.
Product terms (interaction terms) are standardized and they cannot be interpreted as the changes in the effects of focal variables when the moderators change (Cheung, Cheung, Lau, Hui, & Vong, 2022).
Variables with meaningful units can be more difficult to interpret when they are standardized (e.g., age).

How The Function Work

They standardize the original variables before they are used in the model. Therefore, strictly speaking, they do not standardize the predictors in model, but standardize the input variable (Gelman et al., 2021).

The requested model is then fitted to the dataset with selected variables standardized. For the ease of follow-up analysis, both the results with selected variables standardized and the results without standardization are stored. If required, the results without standardization can be retrieved by raw_output().

Methods

The output of lm_betaselect() is an lm_betaselect-class object, and the output of glm_betaselect() is a glm_betaselect-class object. They have the following methods:

A coef-method for extracting the coefficients of the model. (See coef.lm_betaselect() and coef.glm_betaselect() for details.)
A vcov-method for extracting the variance-covariance matrix of the estimates of the coefficients. If bootstrapping is requested, it can return the matrix based on the bootstrapping estimates. (See vcov.lm_betaselect() and vcov.glm_betaselect() for details.)
A confint-method for forming the confidence intervals of the estimates of the coefficients. If bootstrapping is requested, it can return the bootstrap confidence intervals. (See confint.lm_betaselect() and confint.glm_betaselect() for details.)
A summary-method for printing the summary of the results, with additional information such as the number of bootstrap samples and which variables have been standardized. (See summary.lm_betaselect() and summary.glm_betaselect() for details.)
An anova-method for printing the ANOVA table. Can also be used to compare two or more outputs of lm_betaselect() or glm_betaselect() (See anova.glm_betaselect() and anova.glm_betaselect() for details.)
A predict-method for computing predicted values. It can be used to compute the predicted values given a set of new unstandardized data. The data will be standardized before computing the predicted values in the models with standardization. (See predict.lm_betaselect() and predict.glm_betaselect() for details.)
The default update-method for updating a call also works for an lm_betaselect object or a glm_betaselect() object. It can update the model in the same way it updates a model fitted by stats::lm() or stats::glm(), and also update the arguments of lm_betaselect() or glm_betaselect() such as the variables to be standardized. (See stats::update() for details.)

Most other methods for the output of stats::lm() and stats::glm() should also work on an lm_betaselect-class object or a glm_betaselect-class object, respectively. Some of them will give the same results regardless of the variables standardized. Examples are rstandard() and cooks.distance(). For some others, they should be used with cautions if they make use of the variance-covariance matrix of the estimates.

To use the methods for lm objects or glm objects on the results without standardization, simply use raw_output(). For example, to get the fitted values without standardization, call fitted(raw_output(x)), where x is the output of lm_betaselect() or glm_betaselect().

The function raw_output() simply extracts the regression output by stats::lm() or stats::glm() on the variables without standardization.

References

Cheung, S. F., Cheung, S.-H., Lau, E. Y. Y., Hui, C. H., & Vong, W. N. (2022) Improving an old way to measure moderation effect in standardized units. Health Psychology, 41(7), 502-505. tools:::Rd_expr_doi("10.1037/hea0001188")

Craig, C. C. (1936). On the frequency function of xy. The Annals of Mathematical Statistics, 7(1), 1--15. tools:::Rd_expr_doi("10.1214/aoms/1177732541")

Gelman, A., Hill, J., & Vehtari, A. (2021). Regression and other stories. Cambridge University Press. tools:::Rd_expr_doi("10.1017/9781139161879")

Jones, J. A., & Waller, N. G. (2013). Computing confidence intervals for standardized regression coefficients. Psychological Methods, 18(4), 435--453. tools:::Rd_expr_doi("10.1037/a0033269")

Examples

Run this code


data(data_test_mod_cat)

# Standardize only iv

lm_beta_x <- lm_betaselect(dv ~ iv*mod + cov1 + cat1,
                           data = data_test_mod_cat,
                           to_standardize = "iv")
lm_beta_x
summary(lm_beta_x)

# Manually standardize iv and call lm()

data_test_mod_cat$iv_z <- scale(data_test_mod_cat[, "iv"])[, 1]

lm_beta_x_manual <- lm(dv ~ iv_z*mod + cov1 + cat1,
                       data = data_test_mod_cat)

coef(lm_beta_x)
coef(lm_beta_x_manual)

# Standardize all numeric variables

lm_beta_all <- lm_betaselect(dv ~ iv*mod + cov1 + cat1,
                             data = data_test_mod_cat)
# Note that cat1 is not standardized
summary(lm_beta_all)


data(data_test_mod_cat)

data_test_mod_cat$p <- scale(data_test_mod_cat$dv)[, 1]
data_test_mod_cat$p <- ifelse(data_test_mod_cat$p > 0,
                              yes = 1,
                              no = 0)
# Standardize only iv
logistic_beta_x <- glm_betaselect(p ~ iv*mod + cov1 + cat1,
                                  family = binomial,
                                  data = data_test_mod_cat,
                                  to_standardize = "iv")
summary(logistic_beta_x)

logistic_beta_x
summary(logistic_beta_x)

# Manually standardize iv and call glm()

data_test_mod_cat$iv_z <- scale(data_test_mod_cat[, "iv"])[, 1]

logistic_beta_x_manual <- glm(p ~ iv_z*mod + cov1 + cat1,
                              family = binomial,
                              data = data_test_mod_cat)

coef(logistic_beta_x)
coef(logistic_beta_x_manual)

# Standardize all numeric predictors

logistic_beta_allx <- glm_betaselect(p ~ iv*mod + cov1 + cat1,
                                     family = binomial,
                                     data = data_test_mod_cat,
                                     to_standardize = c("iv", "mod", "cov1"))
# Note that cat1 is not standardized
summary(logistic_beta_allx)


summary(raw_output(lm_beta_x))

Run the code above in your browser using DataLab