Bagging.lasso: A Bagging Prediction Model Using LASSO Selection Algorithm.

Description

This function performs a bagging prediction for linear and logistic regression model using the LASSO selection algorithm.

Usage

Bagging.lasso(x, y, family = c("gaussian", "binomial"), M = 100, subspace.size = 10, predictor.subset = round((9/10) * ncol(x)), boot.scale = 1, kfold = 10,  predictor.importance = TRUE, trimmed = FALSE, weighted = TRUE,  verbose = TRUE, seed = 123)

Arguments

input matrix. The dimension of the matrix is nobs x nvars; each row is a vector of observations of the variables.

response variable. For family="gaussian", y is a vector of quantitative response. For family="binomial" should be a factor with two levels '0' and '1' and the level of '1' is the target class.

family

response type (see above).

the number of base-level models (LASSO linear or logistic regression models) to obtain a final prediction. Note that it also corresponds to the number of bootstrap samples to draw. Defaults to 100.

subspace.size

the number of random subspaces to construct an ensemble prediction model. Defaults to 10.

predictor.subset

the subset of randomly selected predictors from the training set to reduce the original p-dimensional feature space. Defaults to (9/10)*ncol(x) where ncol(x) represents the the original p-dimensional feature space of input matrix x.

boot.scale

the scale of sample size in each bootstrap re-sampling, relative to the original sample size. Defaults to 1.0, equaling to the original size of training samples.

kfold

the number of folds of cross validation - default is 10. Although kfold can be as large as the sample size (leave-one-out CV), it is not recommended for large datasets. Smallest value allowable is kfold=3.

predictor.importance

logical. Should the importance of each predictor in the bagging LASSO model be evaluated? Defaults to TRUE. A permutation-based variable importance measure estimated by the out-of-bag error rate is adapted for the bagging model.

trimmed

logical. Should a trimmed bagging strategy be performed? Defaults to FALSE. Traditional bagging draws bootstrap samples from the training sample, applies the base-level model to each bootstrap sample, and then averages over all obtained prediction rules. The idea of trimmed bagging is to exclude the bootstrapped prediction rules that yield the highest error rates and to aggregate over the remaining ones.

weighted

logical. Should a weighted rank aggregation procedure be performed? Defaults to TRUE. This procedure uses a Monte Carlo cross-entropy algorithm combining the ranks of a set of based-level model under consideration via a weighted aggregation that optimizes a distance criterion to determine the best performance base-level model.

verbose

logical. Should the iterative process information of bagging model be presented? Defaults to TRUE.

seed

the seed for random sampling, with the default value 0123.

Value

family: the response type.
M: the number of base-level models to obtain a bagging prediction.
predictor.subset: the subset of randomly selected predictors from the training set to reduce the original p-dimensional feature space.
subspace.size: the number of random subspaces to construct an ensemble prediction model.
validation.metric: the model validation measures.
boot.scale: the scale of sample size in each bootstrap re-sampling, relative to the original sample size.
distance: the distance function used in the weighted aggregation to define the similarity between each two sets of based-level model.
models.fitted: the base-level LASSO regression models fitted by the Bagging.lasso function.
models.trimmed: the trimmed base-level models fitted by the Bagging.lasso function if the trimmed bagging strategy is performed.
y.true: the true values of reponse vector y.
conv.scores: the score matrix generated in the Monte Carlo cross-entropy algorithm according to the validation measures defined.
importance: the importance socres of variables identified by the Bagging.lasso model.

Details

This bagging LASSO model Bagging.lasso generates an ensemble prediction based on the L1-regularized linear or logistic regression models. The Bagging.lasso function uses a Monte Carlo cross-entropy algorithm to combine the ranks of a set of based-level LASSO regression model under consideration via a weighted aggregation to determine the best base-level model. In the Bagging.lasso, the glmnet algorithm is performed to fit LASSO model paths for linear and logistic regression using coordinate descent. A random subspace method is employed to improve the predictive performance. In addition, a strategy of trimmed bagging can be defined to exclude the bootstrapped prediction rules that yield the highest error rates and to aggregate over the remaining prediction rules.

References

[1] Guo, P., Zeng, F., Hu, X., Zhang, D., Zhu, S., Deng, Y., & Hao, Y. (2015). Improved Variable Selection Algorithm Using a LASSO-Type Penalty, with an Application to Assessing Hepatitis B Infection Relevant Factors in Community Residents. PLoS One, 27;10(7):e0134151.

[2] Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the royal statistical society series B (statistical methodology), 73(3):273-282.

[3] Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32.

Examples

Run this code

# Example 1: Bagging LASSO linear regression model.
library(mlbench)
set.seed(0123)
mydata <- mlbench.threenorm(100, d=10)
x <- mydata$x
y <- mydata$classes
mydata <- as.data.frame(cbind(x, y))
colnames(mydata) <- c(paste("A", 1:10, sep=""), "y")
mydata$y <- ifelse(mydata$y==1, 0, 1)
# Split into training and testing data.
S1 <- as.vector(which(mydata$y==0))
S2 <- as.vector(which(mydata$y==1))
S3 <- sample(S1, ceiling(length(S1)*0.8), replace=FALSE)
S4 <- sample(S2, ceiling(length(S2)*0.8), replace=FALSE)
TrainInd <- c(S3, S4)
TestInd <- setdiff(1:length(mydata$y), TrainInd)
TrainXY <- mydata[TrainInd, ]
TestXY <- mydata[TestInd, ]
# Fit a bagging LASSO linear regression model, where the parameters
# of M in the following example is set as small values to reduce the
# running time, however the default value is proposed.
Bagging.fit <- Bagging.lasso(x=TrainXY[, -10], y=TrainXY[, 10],
family=c("gaussian"), M=2, predictor.subset=round((9/10)*ncol(x)),
predictor.importance=TRUE, trimmed=FALSE, weighted=TRUE, seed=0123)
# Print a 'bagging' object fitted by the Bagging.fit function.
Print.bagging(Bagging.fit)
# Make predictions from a bagging LASSO linear regression model.
pred <- Predict.bagging(Bagging.fit, newx=TestXY[, -10], y=NULL, trimmed=FALSE)
pred
# Generate the plot of variable importance.
Plot.importance(Bagging.fit)
# Example 2: Bagging LASSO logistic regression model.
library(mlbench)
set.seed(0123)
mydata <- mlbench.threenorm(100, d=10)
x <- mydata$x
y <- mydata$classes
mydata <- as.data.frame(cbind(x, y))
colnames(mydata) <- c(paste("A", 1:10, sep=""), "y")
mydata$y <- ifelse(mydata$y==1, 0, 1)
# Split into training and testing data.
S1 <- as.vector(which(mydata$y==0))
S2 <- as.vector(which(mydata$y==1))
S3 <- sample(S1, ceiling(length(S1)*0.8), replace=FALSE)
S4 <- sample(S2, ceiling(length(S2)*0.8), replace=FALSE)
TrainInd <- c(S3, S4)
TestInd <- setdiff(1:length(mydata$y), TrainInd)
TrainXY <- mydata[TrainInd, ]
TestXY <- mydata[TestInd, ]
# Fit a bagging LASSO logistic regression model, where the parameters
# of M in the following example is set as small values to reduce the
# running time, however the default value is proposed.
Bagging.fit <- Bagging.lasso(x=TrainXY[, -11], y=TrainXY[, 11],
family=c("binomial"), M=2, predictor.subset=round((9/10)*ncol(x)),
predictor.importance=TRUE, trimmed=FALSE, weighted=TRUE, seed=0123)
# Print a 'bagging' object fitted by the Bagging.fit function.
Print.bagging(Bagging.fit)
# Make predictions from a bagging LASSO logistic regression model.
pred <- Predict.bagging(Bagging.fit, newx=TestXY[, -11], y=NULL, trimmed=FALSE)
pred
# Generate the plot of variable importance.
Plot.importance(Bagging.fit)

Run the code above in your browser using DataLab