Learn R Programming

SparseLearner (version 1.0-2)

Bagging.lasso: A Bagging Prediction Model Using LASSO Selection Algorithm.

Description

This function performs a bagging prediction for linear and logistic regression model using the LASSO selection algorithm.

Usage

Bagging.lasso(x, y, family = c("gaussian", "binomial"), M = 100, subspace.size = 10, predictor.subset = round((9/10) * ncol(x)), boot.scale = 1, kfold = 10, predictor.importance = TRUE, trimmed = FALSE, weighted = TRUE, verbose = TRUE, seed = 123)

Arguments

x
input matrix. The dimension of the matrix is nobs x nvars; each row is a vector of observations of the variables.
y
response variable. For family="gaussian", y is a vector of quantitative response. For family="binomial" should be a factor with two levels '0' and '1' and the level of '1' is the target class.
family
response type (see above).
M
the number of base-level models (LASSO linear or logistic regression models) to obtain a final prediction. Note that it also corresponds to the number of bootstrap samples to draw. Defaults to 100.
subspace.size
the number of random subspaces to construct an ensemble prediction model. Defaults to 10.
predictor.subset
the subset of randomly selected predictors from the training set to reduce the original p-dimensional feature space. Defaults to (9/10)*ncol(x) where ncol(x) represents the the original p-dimensional feature space of input matrix x.
boot.scale
the scale of sample size in each bootstrap re-sampling, relative to the original sample size. Defaults to 1.0, equaling to the original size of training samples.
kfold
the number of folds of cross validation - default is 10. Although kfold can be as large as the sample size (leave-one-out CV), it is not recommended for large datasets. Smallest value allowable is kfold=3.
predictor.importance
logical. Should the importance of each predictor in the bagging LASSO model be evaluated? Defaults to TRUE. A permutation-based variable importance measure estimated by the out-of-bag error rate is adapted for the bagging model.
trimmed
logical. Should a trimmed bagging strategy be performed? Defaults to FALSE. Traditional bagging draws bootstrap samples from the training sample, applies the base-level model to each bootstrap sample, and then averages over all obtained prediction rules. The idea of trimmed bagging is to exclude the bootstrapped prediction rules that yield the highest error rates and to aggregate over the remaining ones.
weighted
logical. Should a weighted rank aggregation procedure be performed? Defaults to TRUE. This procedure uses a Monte Carlo cross-entropy algorithm combining the ranks of a set of based-level model under consideration via a weighted aggregation that optimizes a distance criterion to determine the best performance base-level model.
verbose
logical. Should the iterative process information of bagging model be presented? Defaults to TRUE.
seed
the seed for random sampling, with the default value 0123.

Value

family
the response type.
M
the number of base-level models to obtain a bagging prediction.
predictor.subset
the subset of randomly selected predictors from the training set to reduce the original p-dimensional feature space.
subspace.size
the number of random subspaces to construct an ensemble prediction model.
validation.metric
the model validation measures.
boot.scale
the scale of sample size in each bootstrap re-sampling, relative to the original sample size.
distance
the distance function used in the weighted aggregation to define the similarity between each two sets of based-level model.
models.fitted
the base-level LASSO regression models fitted by the Bagging.lasso function.
models.trimmed
the trimmed base-level models fitted by the Bagging.lasso function if the trimmed bagging strategy is performed.
y.true
the true values of reponse vector y.
conv.scores
the score matrix generated in the Monte Carlo cross-entropy algorithm according to the validation measures defined.
importance
the importance socres of variables identified by the Bagging.lasso model.

Details

This bagging LASSO model Bagging.lasso generates an ensemble prediction based on the L1-regularized linear or logistic regression models. The Bagging.lasso function uses a Monte Carlo cross-entropy algorithm to combine the ranks of a set of based-level LASSO regression model under consideration via a weighted aggregation to determine the best base-level model. In the Bagging.lasso, the glmnet algorithm is performed to fit LASSO model paths for linear and logistic regression using coordinate descent. A random subspace method is employed to improve the predictive performance. In addition, a strategy of trimmed bagging can be defined to exclude the bootstrapped prediction rules that yield the highest error rates and to aggregate over the remaining prediction rules.

References

[1] Guo, P., Zeng, F., Hu, X., Zhang, D., Zhu, S., Deng, Y., & Hao, Y. (2015). Improved Variable Selection Algorithm Using a LASSO-Type Penalty, with an Application to Assessing Hepatitis B Infection Relevant Factors in Community Residents. PLoS One, 27;10(7):e0134151.

[2] Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the royal statistical society series B (statistical methodology), 73(3):273-282.

[3] Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32.

Examples

Run this code
# Example 1: Bagging LASSO linear regression model.
library(mlbench)
set.seed(0123)
mydata <- mlbench.threenorm(100, d=10)
x <- mydata$x
y <- mydata$classes
mydata <- as.data.frame(cbind(x, y))
colnames(mydata) <- c(paste("A", 1:10, sep=""), "y")
mydata$y <- ifelse(mydata$y==1, 0, 1)
# Split into training and testing data.
S1 <- as.vector(which(mydata$y==0))
S2 <- as.vector(which(mydata$y==1))
S3 <- sample(S1, ceiling(length(S1)*0.8), replace=FALSE)
S4 <- sample(S2, ceiling(length(S2)*0.8), replace=FALSE)
TrainInd <- c(S3, S4)
TestInd <- setdiff(1:length(mydata$y), TrainInd)
TrainXY <- mydata[TrainInd, ]
TestXY <- mydata[TestInd, ]
# Fit a bagging LASSO linear regression model, where the parameters
# of M in the following example is set as small values to reduce the
# running time, however the default value is proposed.
Bagging.fit <- Bagging.lasso(x=TrainXY[, -10], y=TrainXY[, 10],
family=c("gaussian"), M=2, predictor.subset=round((9/10)*ncol(x)),
predictor.importance=TRUE, trimmed=FALSE, weighted=TRUE, seed=0123)
# Print a 'bagging' object fitted by the Bagging.fit function.
Print.bagging(Bagging.fit)
# Make predictions from a bagging LASSO linear regression model.
pred <- Predict.bagging(Bagging.fit, newx=TestXY[, -10], y=NULL, trimmed=FALSE)
pred
# Generate the plot of variable importance.
Plot.importance(Bagging.fit)
# Example 2: Bagging LASSO logistic regression model.
library(mlbench)
set.seed(0123)
mydata <- mlbench.threenorm(100, d=10)
x <- mydata$x
y <- mydata$classes
mydata <- as.data.frame(cbind(x, y))
colnames(mydata) <- c(paste("A", 1:10, sep=""), "y")
mydata$y <- ifelse(mydata$y==1, 0, 1)
# Split into training and testing data.
S1 <- as.vector(which(mydata$y==0))
S2 <- as.vector(which(mydata$y==1))
S3 <- sample(S1, ceiling(length(S1)*0.8), replace=FALSE)
S4 <- sample(S2, ceiling(length(S2)*0.8), replace=FALSE)
TrainInd <- c(S3, S4)
TestInd <- setdiff(1:length(mydata$y), TrainInd)
TrainXY <- mydata[TrainInd, ]
TestXY <- mydata[TestInd, ]
# Fit a bagging LASSO logistic regression model, where the parameters
# of M in the following example is set as small values to reduce the
# running time, however the default value is proposed.
Bagging.fit <- Bagging.lasso(x=TrainXY[, -11], y=TrainXY[, 11],
family=c("binomial"), M=2, predictor.subset=round((9/10)*ncol(x)),
predictor.importance=TRUE, trimmed=FALSE, weighted=TRUE, seed=0123)
# Print a 'bagging' object fitted by the Bagging.fit function.
Print.bagging(Bagging.fit)
# Make predictions from a bagging LASSO logistic regression model.
pred <- Predict.bagging(Bagging.fit, newx=TestXY[, -11], y=NULL, trimmed=FALSE)
pred
# Generate the plot of variable importance.
Plot.importance(Bagging.fit)

Run the code above in your browser using DataLab