logis_fe: Main function for fitting the fixed effect logistic model

Description

Fit a fixed effect logistic model via Serial blockwise inversion Newton (SerBIN) or block ascent Newton (BAN) algorithm.

Usage

logis_fe(
  formula = NULL,
  data = NULL,
  Y.char = NULL,
  Z.char = NULL,
  ProvID.char = NULL,
  Y = NULL,
  Z = NULL,
  ProvID = NULL,
  method = "SerBIN",
  max.iter = 1000,
  tol = 1e-05,
  bound = 10,
  cutoff = 10,
  backtrack = TRUE,
  stop = "or",
  threads = 1,
  message = TRUE
)

Value

A list of objects with S3 class "logis_fe":

coefficient: a list containing the estimated coefficients: beta, the fixed effects for each predictor, and gamma, the effect for each provider.
variance: a list containing the variance estimates: beta, the variance-covariance matrix of the predictor coefficients, and gamma, the variance of the provider effects.
linear_pred: the linear predictor of each individual.
fitted: the predicted probability of each observation having a response of 1.
observation: the original response of each individual.
Loglkd: the log-likelihood.
AIC: Akaike info criterion.
BIC: Bayesian info criterion.
AUC: area under the ROC curve.
char_list: a list of the character vectors representing the column names for the response variable, covariates, and provider identifier. For categorical variables, the names reflect the dummy variables created for each category.
data_include: the data used to fit the model, sorted by the provider identifier. For categorical covariates, this includes the dummy variables created for all categories except the reference level. Additionally, it contains three extra columns: included, indicating whether the provider is included based on the cutoff argument; all.events, indicating if all observations in the provider are 1; no.events, indicating if all observations in the provider are 0.

Arguments

formula

a two-sided formula object describing the model to be fitted, with the response variable on the left of a ~ operator and covariates on the right, separated by + operators. The fixed effect of the provider identifier is specified using id().

data

a data frame containing the variables named in the formula, or the columns specified by Y.char, Z.char, and ProvID.char.

Y.char

a character string specifying the column name of the response variable in the data.

Z.char

a character vector specifying the column names of the covariates in the data.

ProvID.char

a character string specifying the column name of the provider identifier in the data.

Y

a numeric vector representing the response variable.

Z

a matrix or data frame representing the covariates, which can include both numeric and categorical variables.

ProvID

a numeric vector representing the provider identifier.

method

a string specifying the algorithm to be used. The default value is "SerBIN".

"SerBIN" uses the Serial blockwise inversion Newton algorithm to fit the model (See Wu et al. (2022)).
"BAN" uses the block ascent Newton algorithm to fit the model (See He et al. (2013)).

max.iter

maximum iteration number if the stopping criterion specified by stop is not satisfied. The default value is 10,000.

tol

tolerance used for stopping the algorithm. See details in stop below. The default value is 1e-5.

bound

a positive number to avoid inflation of provider effects. The default value is 10.

cutoff

An integer specifying the minimum number of observations required for providers. Providers with fewer observations than the cutoff will be labeled as "include = 0" and excluded from model fitting. The default is 10.

backtrack

a Boolean indicating whether backtracking line search is implemented. The default is FALSE.

stop

a character string specifying the stopping rule to determine convergence.

"beta" stop the algorithm when the infinity norm of the difference between current and previous beta coefficients is less than the tol.
"relch" stop the algorithm when the \((loglik(m)-loglik(m-1))/(loglik(m))\) (the difference between the log-likelihood of the current iteration and the previous iteration divided by the log-likelihood of the current iteration) is less than the tol.
"ratch" stop the algorithm when \((loglik(m)-loglik(m-1))/(loglik(m)-loglik(0))\) (the difference between the log-likelihood of the current iteration and the previous iteration divided by the difference of the log-likelihood of the current iteration and the initial iteration) is less than the tol.
"all" stop the algorithm when all the stopping rules ("beta", "relch", "ratch") are met.
"or" stop the algorithm if any one of the rules ("beta", "relch", "ratch") is met.

The default value is or. If iter.max is achieved, it overrides any stop rule for algorithm termination.

threads

a positive integer specifying the number of threads to be used. The default value is 1.

message

a Boolean indicating whether to print the progress of the fitting process. The default is TRUE.

Details

The function accepts three different input formats: a formula and dataset, where the formula is of the form response ~ covariates + id(provider), with provider representing the provider identifier; a dataset along with the column names of the response, covariates, and provider identifier; or the binary outcome vector \(\boldsymbol{Y}\), the covariate matrix or data frame \(\mathbf{Z}\), and the provider identifier vector.

The default algorithm is based on Serial blockwise inversion Newton (SerBIN) proposed by Wu et al. (2022), but users can also choose to use the block ascent Newton (BAN) algorithm proposed by He et al. (2013) to fit the model. Both methodologies build upon the Newton-Raphson method, yet SerBIN simultaneously updates both the provider effect and covariate coefficient. This concurrent update necessitates the inversion of the whole information matrix at each iteration. In contrast, BAN adopts a two-layer updating approach, where the covariate coefficient is sequentially fixed to update the provider effect, followed by fixing the provider effect to update the covariate coefficient.

We suggest using the default "SerBIN" option as it typically converges subsequently much faster for most datasets. However, in rare cases where the SerBIN algorithm encounters second-order derivative irreversibility leading to an error, users can consider using the "BAN" option as an alternative. For a deeper understanding, please consult the original article for detailed insights.

If issues arise during model fitting, consider using the data_check function to perform a data quality check, which can help identify missing values, low variation in covariates, high-pairwise correlation, and multicollinearity. For datasets with missing values, this function automatically removes observations (rows) with any missing values before fitting the model.

References

He K, Kalbfleisch, J, Li, Y, and et al. (2013) Evaluating hospital readmission rates in dialysis providers; adjusting for hospital effects. Lifetime Data Analysis, 19: 490-512.

Wu, W, Yang, Y, Kang, J, He, K. (2022) Improving large-scale estimation and inference for profiling health care providers. Statistics in Medicine, 41(15): 2840-2853.

Examples

Run this code

data(ExampleDataBinary)
outcome <- ExampleDataBinary$Y
covar <- ExampleDataBinary$Z
ProvID <- ExampleDataBinary$ProvID
data <- data.frame(outcome, ProvID, covar)
covar.char <- colnames(covar)
outcome.char <- colnames(data)[1]
ProvID.char <- colnames(data)[2]
formula <- as.formula(paste("outcome ~", paste(covar.char, collapse = " + "), "+ id(ProvID)"))

# Fit logistic linear effect model using three input formats
fit_fe1 <- logis_fe(Y = outcome, Z = covar, ProvID = ProvID)
fit_fe2 <- logis_fe(data = data, Y.char = outcome.char,
Z.char = covar.char, ProvID.char = ProvID.char)
fit_fe3 <- logis_fe(formula, data)

Run the code above in your browser using DataLab