cat_lmm_initialization: Initialization for Catalytic Linear Mixed Model (LMM)

Description

This function prepares and initializes a catalytic linear mixed model by processing input data, extracting necessary variables, generating synthetic datasets, and fitting a model. (Only consider one random effect variance)

Usage

cat_lmm_initialization(
  formula,
  data,
  x_cols,
  y_col,
  z_cols,
  group_col = NULL,
  syn_size = NULL,
  resample_by_group = FALSE,
  resample_only = FALSE,
  na_replace = mean
)

Value

A list containing the values of all the input arguments and the following components:

Function Information:
- function_name: A character string representing the name of the function, "cat_lmm_initialization".
- simple_model: An object of class lme4::lmer or stats::lm, representing the fitted model for generating synthetic response from the original data.
Observation Data Information:
- obs_size: An integer representing the number of observations in the original dataset.
- obs_data: The original data used for fitting the model, returned as a data frame.
- obs_x: A data frame containing the standardized predictor variables from the original dataset.
- obs_y: A numeric vector of the standardized response variable from the original dataset.
- obs_z: A data frame containing the standardized random effect variables from the original dataset.
- obs_group: A numeric vector representing the grouping variable for the original observations.
Synthetic Data Information:
- syn_size: An integer representing the number of synthetic observations generated.
- syn_data: A data frame containing the synthetic dataset, combining synthetic predictor and response variables.
- syn_x: A data frame containing the synthetic predictor variables.
- syn_y: A numeric vector of the synthetic response variable values.
- syn_z: A data frame containing the synthetic random effect variables.
- syn_group: A numeric vector representing the grouping variable for the synthetic observations.
- syn_x_resample_inform: A data frame containing information about the resampling process for synthetic predictors:
  - Coordinate: Preserves the original data values as reference coordinates during processing.
  - Deskewing: Adjusts the data distribution to reduce skewness and enhance symmetry.
  - Smoothing: Reduces noise in the data to stabilize the dataset and prevent overfitting.
  - Flattening: Creates a more uniform distribution by modifying low-frequency categories in categorical variables.
  - Symmetrizing: Balances the data around its mean to improve statistical properties for model fitting.
- syn_z_resample_inform: A data frame containing information about the resampling process for synthetic random effects. The resampling methods are the same as those from syn_x_resample_inform.
Whole Data Information:
- size: An integer representing the total size of the combined original and synthetic datasets.
- data: A combined data frame of the original and synthetic datasets.
- x: A combined data frame of the original and synthetic predictor variables.
- y: A combined numeric vector of the original and synthetic response variables.
- z: A combined data frame of the original and synthetic random effect variables.
- group: A combined numeric vector representing the grouping variable for both original and synthetic datasets.

Arguments

formula: A formula specifying the model. Should include response and predictor variables.
data: A data frame containing the data for modeling.
x_cols: A character vector of column names for fixed effects (predictors).
y_col: A character string for the name of the response variable.
z_cols: A character vector of column names for random effects.
group_col: A character string for the grouping variable (optional). If not given (NULL), it is extracted from the formula.
syn_size: An integer specifying the size of the synthetic dataset to be generated, default is length(x_cols) * 4.
resample_by_group: A logical indicating whether to resample by group, default is FALSE.
resample_only: A logical indicating whether to perform resampling only, default is FALSE.
na_replace: A function to replace NA values in the data, default is mean.

Examples

Run this code

data(mtcars)
cat_init <- cat_lmm_initialization(
  formula = mpg ~ wt + (1 | cyl), # formula for simple model
  data = mtcars,
  x_cols = c("wt"), # Fixed effects
  y_col = "mpg", # Response variable
  z_cols = c("disp", "hp", "drat", "qsec", "vs", "am", "gear", "carb"), # Random effects
  group_col = "cyl", # Grouping column
  syn_size = 100, # Synthetic data size
  resample_by_group = FALSE, # Resampling option
  resample_only = FALSE, # Resampling method
  na_replace = mean # NA replacement method
)
cat_init

Run the code above in your browser using DataLab