sigCheck: Create a `SigCheckObject` and establish baseline performance.

Description

Main constructor for a SigCheckObject. Also establishes baseline survival analysis and/or classification performance.

Usage

sigCheck(expressionSet, classes, survival, signature,  annotation, validationSamples,  scoreMethod="PCA1", threshold=median, classifierMethod=svmI, modeVal, survivalLabel, timeLabel, plotTrainingKM=TRUE, plotValidationKM=TRUE, impute=TRUE)

Arguments

expressionSet

An ExpressionSet object containing the data to be checked, including an expression matrix, feature labels, and samples.

expressionSet can also be an existing SigCheckObject, in which case everything will be inherited from the passed object except the values for any specified parameters,

classes

Specifies which label is to be used to determine the prognostic categories (must be one of varLabels(expressionSet)). There should be only two unique values in expressionSet$classes.

survival

Specifies which label is to be used to determine survival times. (must be one of varLabels(expressionSet)). This may be missing if only classification is being checked.

signature

A vector of feature labels specifying which features comprise the signature to be checked. These feature labels should match values as specified in the annotation parameter. Alternatively, this can be a integer vector of feature indexes.

annotation

Character string specifying which fvarLabels field should be used as the annotation. If missing, the row names of the expressionSet are used as the feature names.

validationSamples

Optional specification, as a vector of sample indices, of what samples in the expressionSet should be considered validation samples. If present, the main checks will be performed using only these samples. If a the scoreMethod parameter is equal to "classifier", the remaining samples will be used as a training set to construct a classifier that will be used to separate the training samples. If a classifier is used, and validationSamples is not specified, a leave-one-out (LOO) validation method will be used, where a separate classifier will be trained to classify each sample using the all the remaining samples.

scoreMethod

specification of how the samples should be split into groups for survival analysis. If a character sting, one of the following values:

"PCA1": default scoring method for separating validation samples into groups by taking the value of the first principal component of the expression values in the signature for each sample.
"High": score used for separating validation samples into groups for each sample is the mean value over all the expression values in the signature for each sample.
"classifer": score used for separating validation samples into groups is determined by a classifier specified in the classifierMethod parameter. If the survival parameter is specified, the classifier method must return a real-valued score for each predicted sample.

scoreMethod can also be a user-defined function that computes a score. The function should take a single parameter, an ExpressionSet, and return a vector of score, one for each row.

if the survival parameter is missing, scoreMethod value must be "classifier".

threshold

specifies the threshold used for separating the validation samples into classed based on the score derived using scoreMethod. Can be either a function, (with default median) or a number between zero and one indicating a percentile. Validation samples will be divided into a group whose percentile scores are less than this value, and another group with percentile scores greater than or equal to this value. threshold may also be a vector of two percentiles, in which case samples will be divided into High, Low, and Mid groups. The survival p-value will be computed using only the high and low groups, with the mid group samples excluded.

classifierMethod

if the scoreMethod is equal to "classifer", this specifies what classifier to use. It is a MLInterfaces learnerSchema object indicating the machine learning method to use for classification. Default is svmI for linear Support Vector Machine classification. See MLearn for available methods.

modeVal

specifies which of the two category values (one of the values implied by the classes parameter) should be considered as the default value when computing the performance of a "mode" classifier. Is missing, the actual mode (most commonly occurring) value of the training set will be used.

survivalLabel

String to use in the Y-axis of any Kaplan-Meier plots generated, this indicates what aspect of survival is being predicted, such as time to recurrence or death.

timeLabel

String to use in the X-axis of an Kaplan-Meier plots generated, this indicates the units of time, such as days or months to outcome event.

plotTrainingKM

if the survival and validationSamples parameters are provided, a Kaplan-Meier plot can be plotted automatically for the training set samples if this is TRUE. A value of FALSE will suppress the plot being automatically generated.

plotValidationKM

if the survival parameter is provided, a Kaplan-Meier plot can be plotted automatically for the validation set samples if this is TRUE. A value of FALSE will suppress the plot being automatically generated. Note that is the validationSamples parameter is missing, the resulting plot will be over all samples.

impute

if TRUE, missing data values in the expressionSet will be imputed. If FALSE, any features with any missing values will be removed from the dataset.

Value

If the baseline analysis can be completed, a SigCheckObject is returned.

Details

This function constructs a new SigCheckObject and carried out a baseline analysis, which will vary depending on which parameters are specified.

If the survival parameter is specified, a survival analysis is carried out. If the validationSamples parameter is specified, this will be done separately on the validation samples and the remaining (training/discovery) samples. The main result is a p-value indicating the confidence that the samples are separable into groups with distinct survival outcomes. This value is obtained using the survdiff function in the survival package (and applying pchisq to the $chisq component of the result). The samples are separated into groups using the scoreMethod and threshold parameters (and possibly the classifierMethod parameter).

If the survival parameter is not specified, then the scoreMethod parameter must be equal to "classifier", and a pure classification analysis is completed (as was done in SigCheck 1.0). If the validationSamples parameter is specified, the remaining samples are used as a training set to construct a classifier that is used to classify the validation samples. If validationSamples is not specified, leave-one-out cross-validation is used whereby a separate classifier is trained to predict each sample using all of the others.

Examples

Run this code

library(breastCancerNKI)
data(nki)
nki <- nki[,!is.na(nki$e.dmfs)]
data(knownSignatures)

## survival analysis 
check <- sigCheck(nki, classes="e.dmfs", survival="t.dmfs",
                  signature=knownSignatures$cancer$VANTVEER,
                  annotation="HUGO.gene.symbol") 
check@survivalPval
check <- sigCheck(check, classes="e.dmfs", survival="t.dmfs",
                  signature=knownSignatures$cancer$VANTVEER,
                  annotation="HUGO.gene.symbol",
                  scoreMethod="High", threshold=.33) 
check@survivalPval

## survival analysis with separate training and validation using SVM
check <- sigCheck(nki, classes="e.dmfs", survival="t.dmfs",
                  signature=knownSignatures$cancer$VANTVEER,
                  annotation="HUGO.gene.symbol",
                  validationSamples=150:319,
                  scoreMethod="classifier") 
check

Run the code above in your browser using DataLab