logis_firth: Main function for fitting the fixed effect logistic model using firth correction

Description

Fixed effects (FE) models suffer from separation issues when all outcomes in a cluster are the same, leading to infinite estimates and unreliable inference. Firth’s corrected logistic regression (FLR) overcomes this limitation and outperforms both FE and random effects (RE) models in terms of bias and RMSE.

Usage

logis_firth(
  formula = NULL,
  data = NULL,
  Y.char = NULL,
  Z.char = NULL,
  ProvID.char = NULL,
  Y = NULL,
  Z = NULL,
  ProvID = NULL,
  max.iter = 1000,
  tol = 1e-05,
  bound = 10,
  cutoff = 10,
  threads = 1,
  message = TRUE
)

Value

A list of objects with S3 class "logis_fe":

coefficient: a list containing the estimated coefficients: beta, the fixed effects for each predictor, and gamma, the effect for each provider.
variance: a list containing the variance estimates: beta, the variance-covariance matrix of the predictor coefficients, and gamma, the variance of the provider effects.
linear_pred: the linear predictor of each individual.
fitted: the predicted probability of each observation having a response of 1.
observation: the original response of each individual.
Loglkd: the log-likelihood.
AIC: Akaike info criterion.
BIC: Bayesian info criterion.
AUC: area under the ROC curve.
char_list: a list of the character vectors representing the column names for the response variable, covariates, and provider identifier. For categorical variables, the names reflect the dummy variables created for each category.
data_include: the data used to fit the model, sorted by the provider identifier. For categorical covariates, this includes the dummy variables created for all categories except the reference level. Additionally, it contains three extra columns: included, indicating whether the provider is included based on the cutoff argument; all.events, indicating if all observations in the provider are 1; no.events, indicating if all observations in the provider are 0.

Arguments

formula: a two-sided formula object describing the model to be fitted, with the response variable on the left of a ~ operator and covariates on the right, separated by + operators. The fixed effect of the provider identifier is specified using id().
data: a data frame containing the variables named in the formula, or the columns specified by Y.char, Z.char, and ProvID.char.
Y.char: a character string specifying the column name of the response variable in the data.
Z.char: a character vector specifying the column names of the covariates in the data.
ProvID.char: a character string specifying the column name of the provider identifier in the data.
Y: a numeric vector representing the response variable.
Z: a matrix or data frame representing the covariates, which can include both numeric and categorical variables.
ProvID: a numeric vector representing the provider identifier.
max.iter: maximum iteration number if the stopping criterion specified by stop is not satisfied. The default value is 10,000.
tol: tolerance used for stopping the algorithm. See details in stop below. The default value is 1e-5.
bound: a positive number to avoid inflation of provider effects. The default value is 10.
cutoff: An integer specifying the minimum number of observations required for providers. Providers with fewer observations than the cutoff will be labeled as "include = 0" and excluded from model fitting. The default is 10.
threads: a positive integer specifying the number of threads to be used. The default value is 1.
message: a Boolean indicating whether to print the progress of the fitting process. The default is TRUE.

Details

The function accepts three different input formats: a formula and dataset, where the formula is of the form response ~ covariates + id(provider), with provider representing the provider identifier; a dataset along with the column names of the response, covariates, and provider identifier; or the binary outcome vector \(\boldsymbol{Y}\), the covariate matrix or data frame \(\mathbf{Z}\), and the provider identifier vector.

This function utilizes OpenMP for parallel processing. For macOS, to enable multi-threading, users may need to install the OpenMP library (e.g., brew install libomp) or use a supported compiler such as GCC. If OpenMP is not detected during installation, the function will transparently fall back to single-threaded execution.

References

Firth, D. (1993) Bias reduction of maximum likelihood estimates. Biometrika, 80(1): 27-38.

Examples

Run this code

data(ExampleDataBinary)
outcome <- ExampleDataBinary$Y
covar <- ExampleDataBinary$Z
ProvID <- ExampleDataBinary$ProvID
data <- data.frame(outcome, ProvID, covar)
covar.char <- colnames(covar)
outcome.char <- colnames(data)[1]
ProvID.char <- colnames(data)[2]
formula <- as.formula(paste("outcome ~", paste(covar.char, collapse = " + "), "+ id(ProvID)"))

# Fit logistic linear effect model using three input formats
fit_fe1 <- logis_firth(Y = outcome, Z = covar, ProvID = ProvID)
fit_fe2 <- logis_firth(data = data, Y.char = outcome.char,
Z.char = covar.char, ProvID.char = ProvID.char)
fit_fe3 <- logis_firth(formula, data)

Run the code above in your browser using DataLab