Reg.SPLR: Fit Spline Regression Models with Knot Selection

Description

The Reg.SPLR function fits natural spline regression models to time-series OTU data for each subgroup defined by design dummy variables. It uses gam to model the relationship between OTU abundance and time, incorporating a z-score based outlier filtering step (optional) and selecting the optimal number of knots via the Generalized Cross-Validation (GCV) criterion when not provided by the user.

Usage

Reg.SPLR(
  Data_for_Reg,
  pre_processed_data,
  z_score = NA,
  unique_values = 5,
  Knots = NULL,
  max_Knots = 5,
  seed = 123
)

Value

An object of class "MicrobTiSDA_spline_regression" containing:

fitted_model: A nested list of the fitted spline regression models for each design dummy variable and OTU.
knots_info: Contains the corresponding knots information for each model.

Arguments

Data_for_Reg: The data frame output from the Design.
pre_processed_data: The transformed data output from the Data.trans function. A pre-processed OTU data frame with sample IDs as row names and OTU IDs as column names.
z_score: A numeric value specifying the z-score threshold for outlier filtering; if NA, no outlier filtering is performed (default: NA).
unique_values: An integer specifying the minimum number of unique values required in an OTU for spline fitting (default: 5).
Knots: An optional numeric vector specifying the knots to use in the spline regression. If NULL, the optimal number of knots is determined by minimizing the GCV criterion (default: NULL).
max_Knots: An integer indicating the maximum number of knots to consider when selecting the optimal spline model (default: 5).
seed: Random seed.

Details

Reg_SPLR utilize Generalised Additive Model (GAM,see gam) to fit natural splines. Here, let $x_{ni}(t)$ denote the transformed abundance of microbial feature $i$ at time point $t$. To explain the evolution of microbial abundance along the time ($t$), the following generalized additive model is considered $$x_{ni}(t) = \beta_{n0} + \sum_{k=1}^K \beta_{nk} \cdot N_{nk}(t) + \epsilon_{ni}$$ where $\beta_{n0}$ is the intercept term, $\beta_{nk}$ denotes the regression coefficients for the k-th natural spline basis function $N_{nk}(t)$, and $\epsilon_{ni}$ is the error term. The parameter $K$ refers to the number of basis functions, which is equal to the number of knots plus one. (i.e., $K = number of knots + 1$).

If a z-score threshold is specified via z_score, observations with absolute z-scores exceeding this threshold are removed prior to model fitting. When the OTU has more unique values than specified by unique_values, the function searches over a range of knot numbers (from 1 up to max_Knots) to select the optimal model based on the GCV criterion; if Knots is provided, it is used directly in constructing the natural spline basis. The fitted models and their corresponding knots information are stored in a nested list structure and returned.

Examples

Run this code

# \donttest{
# Example metadata with grouping variables
metadata <- data.frame(
  TimePoint = c(1, 2, 3, 4),
  Sample = c('S1', 'S2', 'S3', 'S4'),
  GroupA = c('A', 'A', 'B', 'B'),
  GroupB = c('X', 'Y', 'X', 'Y')
)

# Example pre-processed data (e.g., transformed abundance data)
Pre_processed_Data <- data.frame(
  Feature1 = rnorm(4),
  Feature2 = rnorm(4)
)

# Create design matrix using grouping variables
design_data <- Design(metadata, Group_var = c('GroupA', 'GroupB'), Pre_processed_Data,
                      Sample_Time = 'TimePoint', Sample_ID = 'Sample')

result <- Reg.SPLR(design_data,
                  Pre_processed_Data,
                  z_score = 2,
                  unique_values = 5,
                  Knots = NULL,
                  max_Knots = 5)
# Access the fitted model for a particular design dummy variable and OTU:
fitted_model_example <- result$fitted_model[['Group1']][['OTU_name']]
# }

Run the code above in your browser using DataLab