sbh: Cross-Validated Survival Bump Hunting

Description

Main end-user function for fitting a cross-validated Survival Bump Hunting (SBH) model. Returns a cross-validated PRSP object, as generated by our Patient Recursive Survival Peeling or PRSP algorithm, containing cross-validated estimates of end-points statistics of interest.

Usage

sbh(dataset, 
      B = 10, K = 5, A = 1000, 
      vs = TRUE, cpv = FALSE, decimals = 2,
      cvtype = c("combined", "averaged", "none", NULL), 
      cvcriterion = c("lrt", "cer", "lhr", NULL),
      arg = "beta=0.05,alpha=0.05,minn=5,L=NULL,peelcriterion=\"lr\"",
      probval = NULL, timeval = NULL, 
      parallel = FALSE, conf = NULL, seed = NULL)

Arguments

dataset

data.frame or numeric matrix of input dataset containing the observed survival and status indicator variables in the first two columns, respectively, and all the covariates thereafter. If a data.frame is provided, it will be coerced to a numeric matrix. Discrete (or nominal) covariates should be made (or re-arranged into) ordinal variables.

Positive integer scalar of the number of replications of the cross-validation procedure. Defaults to 10.

Integer giving the number of folds (partitions) into which the observations should be randomly split for the cross-validation procedure. Setting K also specifies the type of cross-validation to be done:

K = 1 carries no cross-validation out.
K in {2,...,$n-1$} carries out eqnK-fold cross-validation.
K = $n$ carries out leave-one-out cross-validation.

Positive integer scalar of the number of permutations for the computation of cross-validated p-values. Defaults to 1000.

logical scalar. Flag for optional variable (covariate) pre-selection. Defaults to TRUE.

cpv

logical scalar. Flag for computation of permutation p-values. Defaults to FALSE.

decimals

integer scalar. Number of user-specified significant decimals to output results. Defaults to 2.

cvtype

Character vector describing the cross-validation technique in {"combined", "averaged", "none", NULL}. If NULL, automatically reset to "none".

cvcriterion

character vector describing the optimization criterion in {"lrt", "lhr", "cer", NULL}. If NULL, automatically reset to "none".

arg

Character vector describing the PRSP parameters:

alpha = fraction to peel off at each step. Defaults to 0.05.
beta = minimum support size resulting from the peeling sequence. Defaults to 0.05.
minn = minimum number of observation that we want to be able to detect in a box. Defaults to 5.
L = fixed peeling length. Defaults to NULL.
peelcriterion in {"hr" for Log-Hazard Ratio (LHR), "lr" for Log-Rank Test (LRT), "ch" for Cumulative Hazard Summary (CHS)}. Defaults to "lr".

Note that the parameters in arg come as a string of charaters between double quotes, where all parameter evaluations are separated by comas (see example).

probval

Numeric scalar of the survival probability at which we want to get the endpoint box survival time. Defaults to NULL.

timeval

Numeric scalar of the survival time at which we want to get the endpoint box survival probability. Defaults to NULL.

parallel

Logical. Is parallel computing to be performed? Optional. Defaults to FALSE.

conf

List of parameters for cluster configuration. Inputs for R package parallel function makeCluster (R package parallel) for cluster setup. Optional, defaults to NULL. See details for usage.

seed

Positive integer scalar of the user seed to reproduce the results.

Value

Object of class PRSP (Patient Recursive Survival Peeling) List containing the following 19 fields:

numeric matrix of original dataset.

times

numeric vector of observed failure / survival times.

status

numeric vector of observed event indicator in {1,0}.

positive integer of the number of replications used in the cross-validation procedure.

positive integer of the number of folds used in the cross-validation procedure.

positive integer of the number of permutations used for the computation of permutation p-values.

logical scalar of returned flag of optional variable pre-selection.

cpv

logical scalar of returned flag of optional computation of permutation p-values.

decimals

integer of the number of user-specified significant decimals.

cvtype

character vector of the cross-validation technique used.

cvcriterion

character vector of optimization criterion used.

arg

character vector of the parameters used.

probval

Numeric scalar of survival probability used.

timeval

Numeric scalar of survival time used.

cvfit

List with 10 fields of cross-validated estimates:

cv.maxsteps: numeric scalar of maximal number of peeling steps over the replicates.
cv.nsteps: numeric scalar of optimal number of peeling steps according to the optimization criterion.
cv.trace: numeric vector of the modal trace values of covariate usage for all peeling steps.
cv.boxind: logical matrix in TRUE, FALSE of individual observation box membership indicator (columns) for all peeling steps (rows).
cv.rules: data.frame of decision rules on the covariates (columns) for all peeling steps (rows).
cv.signnumeric vector in {-1,+1} of directions of peeling for all pre-selected covariates.
cv.selectednumeric vector of pre-selected covariates in reference to original index.
cv.usednumeric vector of covariates used for peeling in reference to original index.
cv.stats: numeric matrix of box endpoint quantities of interest (columns) for all peeling steps (rows).
cv.pval: numeric vector of log-rank permutation p-values of sepraration of survival distributions.

cvprofiles

List of ($B$) of numeric vectors, one for each replicate, of the cross-validated statistics used in the optimization criterion (set by user) as a function of the number of peeling steps.

cvmeanprofiles

List of numeric vectors of the cross-validated mean statistics over the replicates. used in the optimization criterion (one set by user) as a function of the number of peeling steps.

plot

logical scalar of the returned flag for plotting or not the results of the fitted SBH model.

config

List with 7 fields of parameters used for configuring the parallelization including parallel and conf.

seed

User seed(s) used: integer of a single value, if parallelization is used integer vector of values, one for each replication, if parallelization is not used.

Details

At this point, the main function sbh performs the search of the first box of the recursive coverage (outer) loop of our Patient Recursive Survival Peeling (PRSP) algorithm. It relies on an optional variable pre-selection procedure that is run before the PRSP algorithm. At this point, this is done by Elastic-Net (EN) penalization of the partial likelihood, where both mixing (alpha) and overal shrinkage (lambda) parameters are simultaneously estimated by cross-validation using the glmnet::cv.glmnet function of the R package glmnet.

The returned S3-class PRSP object contains cross-validated estimates of all the decision-rules of pre-selected covariates and all other statistical quantities of interest at each iteration of the peeling sequence (inner loop of the PRSP algorithm). This enables the graphical display of results of profiling curves for model tuning, peeling trajectories, covariate traces and survival distributions (see plotting functions for more details).

The function offers a number of options for the number of cross-validation replicates to be perfomed: $B$; the type of cross-validation desired: $K$-fold (replicated)-averaged or-combined, as well as the peeling and optimization critera chosen for model tuning and a few more parameters for the PRSP algorithm.

In case replicated cross-validations are performed, a "summary" of the outputs is done over the $B$ replicates, which requires some explanation:

Even thought the PRSP algorithm uses only one covariate at a time at each peeling step, the reported matrix of "Replicated CV" box decision rules may show several covariates being used in a given step, simply because these decision rules are averaged over the $B$ replicates (see equation #21 in Dazard et al. 2015). This is also reflected in the reported "Replicated CV" importance and usage plots of covariate traces.

Likewise, the output matrix of "Replicated CV" box membership indicator does not necessarily match exactly the output vector of "Replicated CV" box support (and corresponding box sample size) for all peeling steps. The reason is that the reported "Replicated CV" box membership indicators are computed (at each peeling step) as the point-wise majority vote over the $B$ replicates (see equation #22 in Dazard et al. 2015), whereas the "Replicated CV" box support vector (and corresponding box sample size) is averaged (at each peeling step) over the $B$ replicates.

The function takes advantage of the R package parallel, which allows users to create a cluster of workstations on a local and/or remote machine(s), enabling scaling-up with the number of CPU cores specified and efficient parallel execution.

If the computation of permutation p-values is desired, then running with the parallelization option is strongly advised as it may take a while. In the case of large ($p > n$) or very large ($p >> n$) datasets, it is also required to use the parallelization option.

To run a parallel session (and parallel RNG) of the PRIMsrc procedures (parallel=TRUE), argument conf is to be specified (i.e. non NULL). It must list the specifications of the folowing parameters for cluster configuration: "names", "cpus", "type", "homo", "verbose", "outfile". These match the arguments described in function makeCluster of the R package parallel. All fields are required to properly configure the cluster, except for "names" and "cpus", which are the values used alternatively in the case of a cluster of type "SOCK" (socket), or in the case of a cluster of type other than "SOCK" (socket), respectively. See examples below.

"names": names : character vector specifying the host names on which to run the job. Could default to a unique local machine, in which case, one may use the unique host name "localhost". Each host name can potentially be repeated to the number of CPU cores available on the corresponding machine.
"cpus": spec : integer scalar specifying the total number of CPU cores to be used across the network of available nodes, counting the workernodes and masternode.
"type": type : character vector specifying the cluster type ("SOCK", "PVM", "MPI").
"homo": homogeneous : logical scalar to be set to FALSE for inhomogeneous clusters.
"verbose": verbose : logical scalar to be set to FALSE for quiet mode.
"outfile": outfile : character vector of the output log file name for the workernodes.

Note that argument B is internally reset to conf$cpus*ceiling(B/conf$cpus) in case the parallelization is used (i.e. conf is non NULL), where conf$cpus denotes the total number of CPUs to be used (see above). The argument A is similarly reset.

The actual creation of the cluster, its initialization, and closing are all done internally. In addition, when random number generation is needed, the creation of separate streams of parallel RNG per node is done internally by distributing the stream states to the nodes (For more details see function makeCluster (R package parallel) and/or http://www.stat.uiowa.edu/~luke/R/cluster/cluster.html.

The use of a seed allows to reproduce the results within the same type of session: the same seed will reproduce the same results within a non-parallel session or within a parallel session, but it will not necessarily give the exact same results (up to sampling variability) between a non-parallelized and parallelized session due to the difference of management of the seed between the two (see parallel RNG and value of retuned seed below).

References

Dazard J-E., Choe M., LeBlanc M. and Rao J.S. (2015). "Cross-validation and Peeling Strategies for Survival Bump Hunting using Recursive Peeling Methods." Statistical Analysis and Data Mining (in press).
Dazard J-E., Choe M., LeBlanc M. and Rao J.S. (2014). "Cross-Validation of Survival Bump Hunting by Recursive Peeling Methods." In JSM Proceedings, Survival Methods for Risk Estimation/Prediction Section. Boston, MA, USA. American Statistical Association IMS - JSM, p. 3366-3380.
Dazard J-E., Choe M., LeBlanc M. and Rao J.S. (2015). "R package PRIMsrc: Bump Hunting by Patient Rule Induction Method for Survival, Regression and Classification." In JSM Proceedings, Statistical Programmers and Analysts Section. Seattle, WA, USA. American Statistical Association IMS - JSM, (in press).
Dazard J-E. and J.S. Rao (2010). "Local Sparse Bump Hunting." J. Comp Graph. Statistics, 19(4):900-92.

Examples

Run this code

#===================================================
# Loading the library and its dependencies
#===================================================
library("PRIMsrc")

#===================================================
# Package news
# Package citation
#===================================================
PRIMsrc.news()
citation("PRIMsrc")
    
#===================================================
# Demo with a synthetic dataset
# Use help for descriptions
#===================================================
data("Synthetic.1", package="PRIMsrc")
?Synthetic.1

#===================================================
# Simulated dataset #1 (n=250, p=3)
# Non Replicated Combined Cross-Validation (RCCV)
# Peeling criterion = LRT
# Optimization criterion = LRT
# Without parallelization
# Without computation of permutation p-values
#===================================================
CVCOMB.synt1 <- sbh(dataset = Synthetic.1, 
                    cvtype = "combined", cvcriterion = "lrt",
                    B = 1, K = 5, 
                    vs = TRUE, cpv = FALSE, 
                    decimals = 2, probval = 0.5, 
                    arg = "beta=0.05,
                           alpha=0.05,
                           minn=5,
                           L=NULL,
                           peelcriterion=\"lr\"",
                    parallel = FALSE, conf = NULL, seed = 123)

## Not run: ------------------------------------
#     #===================================================
#     # Examples of parallel backend parametrization 
#     #===================================================
#     # Example #1 - 1-Quad (4-core double threaded) PC 
#     # Running WINDOWS
#     # With SOCKET communication
#     #===================================================
#     if (.Platform$OS.type == "windows") {
#         cpus <- detectCores()
#         conf <- list("names" = rep("localhost", cpus),
#                      "cpus" = cpus,
#                      "type" = "SOCK",
#                      "homo" = TRUE,
#                      "verbose" = TRUE,
#                      "outfile" = "")
#     }
#     #===================================================
#     # Example #2 - 1 master node + 3 worker nodes cluster
#     # All nodes equipped with identical setups and multicores
#     # Running LINUX
#     # With SOCKET communication
#     #===================================================
#     if (.Platform$OS.type == "unix") {
#         masterhost <- Sys.getenv("HOSTNAME")
#         slavehosts <- c("compute-0-0", "compute-0-1", "compute-0-2")
#         nodes <- length(slavehosts) + 1
#         cpus <- 8
#         conf <- list("names" = c(rep(masterhost, cpus),
#                                  rep(slavehosts, cpus)),
#                      "cpus" = nodes * cpus,
#                      "type" = "SOCK",
#                      "homo" = TRUE,
#                      "verbose" = TRUE,
#                      "outfile" = "")
#     }
#     #===================================================
#     # Example #3 - Multinode multicore per node cluster
#     # Running LINUX 
#     # with MPI communication
#     # Here, a file named ".nodes" (e.g. in the home directory)
#     # contains the list of nodes of the cluster
#     #===================================================
#     if (.Platform$OS.type == "unix") {
#         hosts <- scan(file=paste(Sys.getenv("HOME"), "/.nodes", sep=""), 
#                       what="", 
#                       sep="\n")
#         hostnames <- unique(hosts)
#         nodes <- length(hostnames)
#         cpus <-  length(hosts)/length(hostnames)
#         conf <- list("cpus" = nodes * cpus,
#                      "type" = "MPI",
#                      "homo" = TRUE,
#                      "verbose" = TRUE,
#                      "outfile" = "")
#     }
#     #===================================================
#     # Simulated dataset #1 (n=250, p=3)
#     # Replicated Combined Cross-Validation (RCCV)
#     # Peeling criterion = LRT
#     # Optimization criterion = LRT
#     # With parallelization
#     # With computation of permutation p-values
#     #===================================================
#     CVCOMBREP.synt1 <- sbh(dataset = Synthetic.1, 
#                            cvtype = "combined", cvcriterion = "lrt",
#                            B = 10, K = 5, A = 1024, 
#                            vs = TRUE, cpv = TRUE, 
#                            decimals = 2, probval = 0.5, 
#                            arg = "beta=0.05,
#                                   alpha=0.05,
#                                   minn=5,
#                                   L=NULL,
#                                   peelcriterion=\"lr\"",
#                            parallel = TRUE, conf = conf, seed = 123)
## ---------------------------------------------

Run the code above in your browser using DataLab