signatureFinder: Main function to find the signature.

Description

This function implements the algorithm to find the signature using a searching strategy supervised by survival time data.

Usage

signatureFinder(seedGene, logFilePrefix = "", coeffMissingAllowed = 0.75, 
subsetToUse = 1:ncol(geData), cpuCluster = NULL, stopCpuCluster = TRUE)

Arguments

seedGene

is the integer index pointing to the column (gene) of geData from which the searching strategy has to start. Optionally a list of genes (indexes pointing to the columns of geData) can be provided.

logFilePrefix

Is a string containing a prefix of the log file generated by the algorithm. No longer necessary in this upgrade of the package.

coeffMissingAllowed

This parameter controls the number of missing values tolerated by the pam classification procedure (see details).

subsetToUse

If necessary the costruction of the signature can be restricted to a subset of genes. In this case a list of the columns of geData has to be provided.

cpuCluster

If a parallel search is necessary, this variable has to be set to the output of NCPUS() function.

stopCpuCluster

flag to control if the channel to the cpu-cluster has to be closed

Value

The function returns a list with the following slots
signatureNameis a string for identifying the signature. By default is set to (colnames(geData)[seedGene])[1].
startingSignatureis a list of string set to colnames(geData)[seedGene]
coeffMissingAllowedsame as input
startingClassification(factor) classification of the samples computed by using the gene expression levels of the startingSignature
startingTValuetest-value of the log-rank test computed on the startingSignature
startingPValuep-value corresponding to the startingTValue
signatureIDsindexes pointing to the column of geData providing the sequence of gene expression levels that maximizes the distance between the two survival curves
signaturelabels corresponding to signatureIDs: colnames(geData)[signatureIDs]
tValuetest-value of the log-rank test computed on the signature
pValuep-value corresponding to the tValue
classification(factor) classification of the samples computed by using the gene expression levels of the signature

Details

In the global enviroment two variables have to be set up: geData and stData. geData is a matrix whose columns are the gene expressions and the rows are the samples (see geNSCLC for example). It is recommended that the columns names are instantiated. stData is a variable of the "Surv" class from the package "survival" (see stNSCLG for example).

Starting from the seed gene (a list of seeds is allowed), the next gene added is the one that maximizes the distance of the two survival curves. The list of genes grows until no more gene is able to improve the distance between the survival curves.

A gene (candidateGene) can be added to the running signature if it satisfies two controls: given the classification computed on the gene expressions of geneCandidate + runningSignature, 1) no cluster can have a dimension lower than floor(0.1 * nrow(geData)), and 2) the survival curves cannot cross. When more than 1 candidate gene is proposed, if the number of candidates is greater than 0.01*ncol(geData) the searching stops; otherwise a subset of the candidates is selected using backward strategy.

The parameter coeffMissingAllowed controls an empirical rule having in charge to prevent the crash of the pam() function. The number of joint missing values allowed in a sample described by p gene expression levels is given by floor(p^coeffMissingAllowed).

Examples

Run this code

# find the signature starting from the gene SELP for the Non Small Cell Lung Cancer 
#############
# set the working data 
data(geNSCLC)
geData <- geNSCLC
data(stNSCLC)
stData <- stNSCLC
##############
# set the dimension of the cpu's cluster 
aMakeCluster <- makeCluster(2)
################
# set the starting gene to SELP
geneSeed <- which(colnames(geData) == "SELP")
##################
# run ...
ans <- signatureFinder(geneSeed, cpuCluster = aMakeCluster)
ans