sbh: Cross-Validated Survival Bump Hunting

Description

Main end-user function for fitting a cross-validated Survival Bump Hunting (SBH) model. It returns a cross-validated PRSP object, as generated by our Patient Recursive Survival Peeling or PRSP algorithm.

Usage

sbh(dataset, 
      B = 10, K = 5, A = 1000, 
      vs = TRUE, cpv = FALSE,
      cvtype=c("combined", "averaged", "none"), 
      cvcriterion=c("lrt", "cer", "lhr"),
      arg = "beta=0.05,alpha=0.1,minn=10,L=NULL,peelcriterion="lr"",
      probval = NULL, timeval = NULL, 
      parallel = FALSE, conf = NULL, seed = NULL)

Arguments

dataset

data.frame or numeric matrix of input dataset containing the observed survival and status indicator variables in the first two columns, respectively, and all the covariates thereafter. If a data.frame

Positive integer scalar of the number of replications of the cross-validation procedure. Defaults to 10.

Positive integer scalar of the number of folds for the cross-validation procedure. Defaults to 5.

Positive integer scalar of the number of permutations for the computation of cross-validated p-values. Defaults to 1000.

logical scalar. Flag for optional variable (covariate) pre-selection. Defaults to TRUE.

cpv

logical scalar. Flag for computation of permutation p-values. Defaults to FALSE.

cvtype

Character vector describing the cross-validation technique in {"combined", "averaged", "none"}. If NULL, defaults to "combined".

cvcriterion

character vector describing the cross-validation optimization criterion in {"lrt", "cer", "lhr"}. If NULL, defaults to "lrt". Automatically reset to "none" if cvtype is set to "none".

arg

Character vector describing the PRSP parameters:

alpha

= fraction to peel off at each step. Defaults to 0.1. beta = minimum support size resulting from the peel

Value

Object of class PRSP (Patient Recursive Survival Peeling) List containing the following 21 fields:
xnumeric matrix from original dataset with pre-selected covariates only.
timesnumeric vector of observed failure / survival times.
statusnumeric vector of observed event indicator in {1,0}.
Bpositive integer of the number of replications used in the cross-validation procedure.
Kpositive integer of the number of folds used in the cross-validation procedure.
Apositive integer of the number of permutations used for the computation of permutation p-values.
vslogical scalar of returned flag of optional variable pre-selection.
cpvlogical scalar of returned flag of optional computation of permutation p-values.
cvtypecharacter vector of the cross-validation technique used.
cvcriterioncharacter vector of cross-validation optimization criterion used.
varsignnumeric vector in {-1,+1} of directions of peeling for all pre-selected covariates.
selectednumeric vector giving the pre-selected covariates.
usednumeric vector giving the covariates used for peeling.
argcharacter vector of the parameters used:
probvalNumeric scalar of survival probability used.
timevalNumeric scalar of survival time used.
cvfit
List with 7 fields of cross-validated estimates:
- cv.maxsteps
: numeric scalar of maximal ceiled-mean of number of peeling steps over the replicates. cv.nsteps: numeric scalar of optimal number of peeling steps according to the optimization criterion. cv.trace: list of numeric matrix and numeric vector of respectively the traces of covariate usage for all replications (columns) for all steps (rows) and the modal trace values of covariate usage at each step. cv.boxind: logical matrix in {TRUE, FALSE} of individual observation box membership indicator (columns) for all peeling steps (rows). cv.rules: data.frame of decision rules on the covariates (columns) for all peeling steps (rows). cv.stats: numeric matrix of box endpoint quantities of interest (columns) for all peeling steps (rows). cv.pval: numeric vector of log-rank permutation p-values of sepraration of survival distributions.

item

probval
timeval
parallel
conf
seed
cvprofiles
cvmeanprofiles
plot
config
seed

code

vector

pkg

parallel
parallel

eqn

$B$

Details

At this point, the main function sbh performs the search of the first box of the recursive coverage (outer) loop of our Patient Recursive Survival Peeling (PRSP) algorithm. Also, the main function relies on an optional variable pre-selection procedure that is run before the PRSP algorithm. At this point, this is done by Elastic-Net (EN) penalization of the partial likelihood, where both mixing (alpha) and overal shrinkage (lambda) parameters are simultaneously estimated by cross-validation using the glmnet::cv.glmnet function of the R package glmnet. The PRSP object contains cross-validated estimates of all the decision-rules of pre-selected covariates and all other statistical quantities of interest at each iteration of the peeling sequence (inner loop of the PRSP algorithm). It enables the display of results graphically of/for model tuning/selection, all peeling trajectories, covariate traces and survival distributions (see plotting functions for more details).

The function offers a number of options for the type of cross-validation desired: $K$-fold (replicated)-averaged or-combined, as well as peeling and optimization critera for model fitting, tuning and selectio and a few more parameters for the PRSP algorithm.

The function takes advantage of the R package parallel, which allows users to create a cluster of workstations on a local and/or remote machine(s), enabling scaling-up with the number of CPU cores specified and efficient parallel execution. Discrete (or nominal) covariates should be made (or re-arranged into) ordinal variables. If the computation of cross-validated p-value is desired, then running with the parallelization option is generally advised as it may take a while. In the case of large ($p > n$) or very large ($p >> n$) datasets, it is required to use the parallelization option preferably on a hyperperformance cluster of workstations. To run a parallel session (and parallel RNG) of the PRIMsrc procedures (parallel=TRUE), argument conf is to be specified (i.e. non NULL). It must list the specifications of the folowing parameters for cluster configuration: "names", "cpus", "type", "homo", "verbose", "outfile". These match the arguments described in function makeCluster of the R package parallel. All fields are required to properly configure the cluster, except for "names" and "cpus", which are the values used alternatively in the case of a cluster of type "SOCK" (socket), or in the case of a cluster of type other than "SOCK" (socket), respectively. See examples below.

"names":names:charactervectorspecifying the host names on which to run the job. Could default to a unique local machine, in which case, one may use the unique host name "localhost". Each host name can potentially be repeated to the number of CPU cores available on the corresponding machine.
"cpus":spec:integerscalar specifying the total number of CPU cores to be used across the network of available nodes, counting the workernodes and masternode.
"type":type:charactervectorspecifying the cluster type ("SOCK", "PVM", "MPI").
"homo":homogeneous:logicalscalar to be set toFALSEfor inhomogeneous clusters.
"verbose":verbose:logicalscalar to be set toFALSEfor quiet mode.
"outfile":outfile:charactervectorof the output log file name for the workernodes.

Note that argument B is internally reset to conf$cpus*ceiling(B/conf$cpus) in case the parallelization is used (i.e. conf is non NULL), where conf$cpus denotes the total number of CPUs to be used (see above). The actual creation of the cluster, its initialization, and closing are all done internally. In addition, when random number generation is needed, the creation of separate streams of parallel RNG per node is done internally by distributing the stream states to the nodes (For more details see function makeCluster (R package parallel) and/or http://www.stat.uiowa.edu/~luke/R/cluster/cluster.html. The use of a seed allows to reproduce the results within the same type of session: the same seed will reproduce the same results within a non-parallel session or within a parallel session, but it will not necessarily give the exact same results (up to sampling variability) between a non-parallelized and parallelized session due to the difference of management of the seed between the two (see parallel RNG and value of retuned seed below).

References

Dazard J-E., Choe M., LeBlanc M. and Rao J.S. (2015). "Cross-validation and Peeling Strategies for Survival Bump Hunting using Recursive Peeling Methods." (Submitted).
Dazard J-E., Choe M., LeBlanc M. and Rao J.S. (2014). "Cross-Validation of Survival Bump Hunting by Recursive Peeling Methods." In JSM Proceedings, Survival Methods for Risk Estimation/Prediction Section. Boston, MA, USA. American Statistical Association IMS - JSM, p. 3366-3380.
Dazard J-E. and J. S. Rao (2010). "Local Sparse Bump Hunting." J. Comp Graph. Statistics, 19(4):900-92.

Examples

Run this code

#===================================================
# Loading the library and its dependencies
#===================================================
library("PRIMsrc")

#===================================================
    # PRIMsrc package news
    #===================================================
    PRIMsrc.news()
    
    #===================================================
    # PRIMsrc package citation
    #===================================================
    citation("PRIMsrc")
    
    #===================================================
    # Demo with a synthetic dataset
    # Use help for descriptions
    #===================================================
    data("Synthetic.1", "Synthetic.5", "Real.1", "Real.2", package="PRIMsrc")
    ?Synthetic.1
    ?Synthetic.5
    ?Real.1
    ?Real.2

#===================================================
# Simulated dataset #1 (n=250, p=3)
# Replicated Combined Cross-Validation (RCCV)
# Peeling criterion = LRT
# Optimization criterion = LRT
# Without parallelization
# Without computation of permutation p-values
#===================================================
CVCOMBREP.synt1 <- sbh(dataset = Synthetic.1, 
                       cvtype = "combined", cvcriterion = "lrt",
                       B = 1, K = 5, 
                       vs = TRUE, cpv = FALSE, probval = 0.5, 
                       arg = "beta=0.05,
                              alpha=0.1,
                              minn=10,
                              L=NULL,
                              peelcriterion="lr"",
                       parallel = FALSE, conf = NULL, seed = 123)

# selected covariates:
selected <- CVCOMBREP.synt1$selected
selected
# covariates used for peeling:
used <- CVCOMBREP.synt1$used
used
# some output results:
CVCOMBREP.synt1$cvfit$cv.maxsteps
CVCOMBREP.synt1$cvfit$cv.nsteps
CVCOMBREP.synt1$cvfit$cv.trace
CVCOMBREP.synt1$cvfit$cv.rules$frame[,used]
round(CVCOMBREP.synt1$cvfit$cv.stats$mean,2)

#===================================================
    # Examples of parallel backend parametrization 
    #===================================================
    # Example #1 - 1-Quad (4-core double threaded) PC 
    # Running WINDOWS
    # With SOCKET communication
    #===================================================
    if (.Platform$OS.type == "windows") {
        cpus <- detectCores()
        conf <- list("names" = rep("localhost", cpus),
                     "cpus" = cpus,
                     "type" = "SOCK",
                     "homo" = TRUE,
                     "verbose" = TRUE,
                     "outfile" = "")
    }
    #===================================================
    # Example #2 - 1 master node + 3 worker nodes cluster
    # All nodes equipped with identical setups and multicores
    # Running LINUX
    # With SOCKET communication
    #===================================================
    if (.Platform$OS.type == "unix") {
        masterhost <- Sys.getenv("HOSTNAME")
        slavehosts <- c("compute-0-0", "compute-0-1", "compute-0-2")
        nodes <- length(slavehosts) + 1
        cpus <- 8
        conf <- list("names" = c(rep(masterhost, cpus),
                                 rep(slavehosts, cpus)),
                     "cpus" = nodes * cpus,
                     "type" = "SOCK",
                     "homo" = TRUE,
                     "verbose" = TRUE,
                     "outfile" = "")
    }
    #===================================================
    # Example #3 - Multinode multicore per node cluster
    # Running LINUX 
    # with MPI communication
    # Here, a file named ".nodes" (e.g. in the home directory)
    # contains the list of nodes of the cluster
    #===================================================
    if (.Platform$OS.type == "unix") {
        hosts <- scan(file=paste(Sys.getenv("HOME"), "/.nodes", sep=""), 
                      what="", 
                      sep="\n")
        hostnames <- unique(hosts)
        nodes <- length(hostnames)
        cpus <-  length(hosts)/length(hostnames)
        conf <- list("cpus" = nodes * cpus,
                     "type" = "MPI",
                     "homo" = TRUE,
                     "verbose" = TRUE,
                     "outfile" = "")
    }
    
    #===================================================
    # Simulated dataset #1 (n=250, p=3)
    # Replicated Combined Cross-Validation (RCCV)
    # Peeling criterion = LRT
    # Optimization criterion = LRT
    # With parallelization
    # With computation of permutation p-values
    #===================================================
    CVCOMBREP.synt1 <- sbh(dataset = Synthetic.1, 
                           cvtype = "combined", cvcriterion = "lrt",
                           B = 10, K = 5, A = 1024, 
                           vs = TRUE, cpv = TRUE, probval = 0.5, 
                           arg = "beta=0.05,
                                  alpha=0.1,
                                  minn=10,
                                  L=NULL,
                                  peelcriterion=\"lr\"",
                           parallel = TRUE, conf = conf, seed = 123)
                           
    # selected covariates:
    selected <- CVCOMBREP.synt1$selected
    selected
    # covariates used for peeling:
    used <- CVCOMBREP.synt1$used
    used
    # some output results:
    CVCOMBREP.synt1$cvfit$cv.maxsteps
    CVCOMBREP.synt1$cvfit$cv.nsteps
    CVCOMBREP.synt1$cvfit$cv.trace
    CVCOMBREP.synt1$cvfit$cv.pval
    CVCOMBREP.synt1$cvfit$cv.rules$frame[,used]
    round(CVCOMBREP.synt1$cvfit$cv.stats$mean,2)

Run the code above in your browser using DataLab