CBDA: Main Compressive Big Data Analytics - CBDA function

Description

This CBDA function comprises all the input specifications to run a set M of subsamples from the Big Data [Xtemp, Ytemp]. We assume that the Big Data is already clean and harmonized. This version 1.0.0 is fully tested ONLY on continuous features Xtemp and binary outcome Ytemp.

Usage

CBDA(Ytemp, Xtemp, label = "CBDA_package_test", alpha = 0.2, Kcol_min = 5,
  Kcol_max = 15, Nrow_min = 30, Nrow_max = 50, misValperc = 0,
  M = 3000, N_cores = 1, top = 1000, workspace_directory = tempdir(),
  max_covs = 100, min_covs = 5, algorithm_list = c("SL.glm", "SL.xgboost",
  "SL.glmnet", "SL.svm", "SL.randomForest", "SL.bartMachine"))

Arguments

Ytemp

This is the output variable (vector) in the original Big Data

Xtemp

This is the input variable (matrix) in the original Big Data

label

This is the label appended to RData workspaces generated within the CBDA calls

alpha

Percentage of the Big Data to hold off for Validation

Kcol_min

Lower bound for the percentage of features-columns sampling (used for the Feature Sampling Range - FSR)

Kcol_max

Upper bound for the percentage of features-columns sampling (used for the Feature Sampling Range - FSR)

Nrow_min

Lower bound for the percentage of cases-rows sampling (used for the Case Sampling Range - CSR)

Nrow_max

Upper bound for the percentage of cases-rows sampling (used for the Case Sampling Range - CSR)

misValperc

Percentage of missing values to introduce in BigData (used just for testing, to mimic real cases).

Number of the BigData subsets on which perform Knockoff Filtering and SuperLearner feature mining

N_cores

Number of Cores to use in the parallel implementation (default is set to 1 core)

top

Top predictions to select out of the M (must be < M, optimal ~0.1*M)

workspace_directory

Directory where the results and workspaces are saved (set by default to tempdir())

max_covs

Top features to display and include in the Validation Step where nested models are tested

min_covs

Minimum number of top features to include in the initial model for the Validation Step (it must be greater than 2)

algorithm_list

List of algorithms/wrappers used by the SuperLearner. By default is set to the following list algorithm_list <- c("SL.glm","SL.xgboost", "SL.glmnet","SL.svm","SL.randomForest","SL.bartMachine")

Value

CBDA object with validation results and 3 RData workspaces

Details

This function comprises all the input specifications to run a set M of subsamples from the Big Data [Xtemp, Ytemp]. We assume that the Big Data is already clean and harmonized. After the necessary data wrangling (i.e., imputation, normalization and rebalancing), an ensemble predictor (i.e., SuperLearner) is applied to each subsample for training/learning. The list of algorithms used by the SuperLearner is supplied by an external file to be placed in the working directory (e.g.: CBDA_SL_library.m in our release). The file can contain any SuperLearner wrapper and any wrappers properly defined by the user. The ensemble predictive model is then validated on a fraction alpha of the Big Data. Each subsample generates a predictive model that is ranked based on performance metrics (e.g., Mean Square Error-MSE and Accuracy) during the first validation step. After all the M subsamples have been generated and each predictive model computed, the CBDA function calls 4 more functions to perform i) CONSOLIDATION and ranking of the results where the top predictive models are selected (top) and the more frequent features (BEST) are ranked and displayed as well, ii) VALIDATION on the top ranked features (i.e., up to "max_covs" number of features) where nested ensemble predictive models are generated in a bottom-up fashion, iii) Implementation of STOPPING CRITERIA for the best/optimal ensemble predictive model (to avoid overfitting) and iv) CLEAN UP step for deleting unnecessary workspaces generated by the CBDA protocol. IMPORTANT - Memory limits to run CBDA: see https://stat.ethz.ch/R-manual/R-devel/library/base/html/Memory-limits.html for various limitations on memory needs while running R under different OS. As far as CBDA is concerned, a CBDA object can be up to 200-300 Mb. The space needed to save all the workspaces however may need to be as large as 1-5 Gb, depending on the number of subsamples. We are working on an new CBDA implementation that reduces the storage constraints.

References

See https://github.com/SOCR/CBDA/releases for details on the CBDA protocol and the manuscript "Controlled Feature Selection and Compressive Big Data Analytics: Applications to Big Biomedical and Health Studies<U+201D> [under review] authored by Simeone Marino, Jiachen Xu, Yi Zhao, Nina Zhou, Yiwang Zhou, Ivo D. Dinov from the University of Michigan

Examples

Run this code

# NOT RUN {
# Installation
# Please upload the Windows binary and/or source CBDA_1.0.0 files from
# the CBDA Github repository https://github.com/SOCR/CBDA/releases
# }
# NOT RUN {
# Installation from the Windows binary (recommended for Windows systems)
install.packages("/filepath/CBDA_1.0.0_binary_Windows.zip", repos = NULL, type = "win.binary")

# Installation from the source (recommended for Macs and Linux systems)
install.packages("/filepath/CBDA_1.0.0_source_.tar.gz", repos = NULL, type = "source")

# Initialization
# This function call installs (if needed) and attaches all the necessary packages to run
# the CBDA package v1.0.0. It should be run before any production run or test.
# The output shows a table where for each package a TRUE or FALSE is displayed.
# Thus the necessary steps can be pursued in case some package has a FALSE.
CBDA_initialization()

# Set the specs for the synthetic dataset to be tested
n = 300          # number of observations
p = 100          # number of variables

# Generate a nxp matrix of IID variables (e.g., ~N(0,1))
X1 = matrix(rnorm(n*p), nrow=n, ncol=p)

# Setting the nonzero variables - signal variables
nonzero=c(1,100,200,300,400,500,600,700,800,900)

# Set the signal amplitude (for noise level = 1)
amplitude = 10

# Allocate the nonzero coefficients in the correct places
beta = amplitude * (1:p %in% nonzero)

# Generate a linear model with a bias (e.g., white  noise ~N(0,1))
ztemp <- function() X1 %*% beta + rnorm(n)
z = ztemp()

# Pass it through an inv-logit function to
# generate the Bernoulli response variable Ytemp
pr = 1/(1+exp(-z))
Ytemp = rbinom(n,1,pr)
X2 <- cbind(Ytemp,X1)

dataset_file ="Binomial_dataset_3.txt"

# Save the synthetic dataset
a <- tempdir()
write.table(X2, file = paste0(file.path(a),'/',dataset_file), sep=",")

# The file is now stored in the directory a
a
list.files(a)

# Load the Synthetic dataset
Data = read.csv(paste0(file.path(a),'/',dataset_file),header = TRUE)
Ytemp <- Data[,1] # set the outcome
original_names_Data <- names(Data)
cols_to_eliminate=1
Xtemp <- Data[-cols_to_eliminate] # set the matrix X of features/covariates
original_names_Xtemp <- names(Xtemp)

# Add more wrappers/algorithms to the SuperLearner ensemble predictor
# It can be commented out if only the default set of algorithms are used,
# e.g., algorithm_list = c("SL.glm","SL.xgboost","SL.glmnet","SL.svm",
#                          "SL.randomForest","SL.bartMachine")
# This defines a "new" wrapper, based on the default SL.glmnet
 SL.glmnet.0.75 <- function(..., alpha = 0.75,family="binomial"){
                 SL.glmnet(..., alpha = alpha, family = family)}

 test_example <- c("SL.glmnet","SL.glmnet.0.75")

# Call the Main CBDA function
# Multicore functionality NOT enabled
CBDA_object <- CBDA(Ytemp , Xtemp , M = 12 , Nrow_min = 50, Nrow_max = 70,
              top = 10, max_covs = 8 , min_covs = 3,algorithm_list = test_example ,
              workspace_directory = a)

# Multicore functionality enabled
test_example <- c("SL.xgboost","SL.svm")
CBDA_test <- CBDA(Ytemp , Xtemp , M = 40 , Nrow_min = 50, Nrow_max = 70,
               N_cores = 2 , top = 30, max_covs = 20 ,
                min_covs = 5 , algorithm_list = test_example ,
              workspace_directory = a)
                
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab