nearestCentroidPredictor: Nearest centroid predictor

Description

Nearest centroid predictor for binary (i.e., two-outcome) data. Implements a whole host of options and improvements such as accounting for within-class heterogeneity using sample networks, various ways of feature selection and weighing etc.

Usage

nearestCentroidPredictor(
    # Input training and test data
    x, y, xtest, 
    # Feature weights and selection criteria
    featureSignificance = NULL, 
    assocFnc = "cor", assocOptions = "use = 'p'", 
    assocCut.hi = NULL, assocCut.lo = NULL, 
    nFeatures.hi = 10, nFeatures.lo = 10,
    weighFeaturesByAssociation = 0, 
    scaleFeatureMean = TRUE, scaleFeatureVar = TRUE, 
    # Predictor options 
    centroidMethod = c("mean", "eigensample"), 
    simFnc = "cor", simOptions = "use = 'p'", 
    useQuantile = NULL, 
    sampleWeights = NULL, 
    weighSimByPrediction = 0, 
    # Data pre-processing
    nRemovePCs = 0, removePCsAfterScaling = TRUE, 
    # Sample network options
    useSampleNetwork = FALSE, 
    adjFnc = "cor", adjOptions = "use = 'p'", 
    networkPower = 2, 
    clusteringFnc = ".hclust.cutreeDynamic", 
    clusteringOptions = list(minClusterSize = 4, deepSplit = 2, verbose = 0), 
   
    # What should be returned
    CVfold = 0, returnFactor = FALSE, 
    # General options
    randomSeed = 12345, 
    verbose = 2, indent = 0)

Arguments

Training features (predictive variables). Each column corresponds to a feature and each row to an observation.

The response variable. Can be a single vector or a matrix with arbitrary many columns. Number of rows (observations) must equal to the number of rows (observations) in x.

xtest

Optional test set data. A matrix of the same number of columns (i.e., features) as x. If test set data are not given, only the prediction on training data will be returned.

featureSignificance

Optional vector of feature significance for the response variable. If given, it is used for feature selection (see details). Should preferably be signed, that is features can have high negative significance.

assocFnc

Character string specifying the association function. The association function should behave roughly as link{cor} in that it takes two arguments (a matrix and a vector) plus options and returns the vector of associations between the columns

assocOptions

Character string specifying options to the association function.

assocCut.hi

Association (or featureSignificance) threshold for including features in the predictor. Features with associtation higher than assocCut.hi will be included. If not given, the threshold method will not be used; instead, a fixed number of featu

assocCut.lo

Association (or featureSignificance) threshold for including features in the predictor. Features with associtation lower than assocCut.lo will be included. If not given, defaults to -assocCut.hi. If assocCut.hi is <

nFeatures.hi

Number of highest-associated features (or features with highest featureSignificance) to include in the predictor. Only used if assocCut.hi is NULL.

nFeatures.lo

Number of lowest-associated features (or features with highest featureSignificance) to include in the predictor. Only used if assocCut.hi is NULL.

weighFeaturesByAssociation

(Optional) power to downweigh features that are less associated with the response. See details.

scaleFeatureMean

Logical: should the training features be scaled to mean zero? Unless there are good reasons not to scale, the features should be scaled.

scaleFeatureVar

Logical: should the training features be scaled to unit variance? Again, unless there are good reasons not to scale, the features should be scaled.

centroidMethod

One of "mean" and "eigensample", specifies how the centroid should be calculated. "mean" takes the mean across all samples (or all samples within a sample module, if sample networks are used), whereas "eigensam

simFnc

Character string giving the similarity function for measuring the similarity between test samples and centroids. This function should behave roughly like the function cor in that it takes two arguments (x

simOptions

Character string specifying the options to the similarity function.

useQuantile

If non-NULL, the "nearest quantiloid" will be used instead of the nearest centroid. See details.

sampleWeights

Optional specification of sample weights. Useful for example if one wants to explore boosting.

weighSimByPrediction

(Optional) power to downweigh features that are not well predicted between training and test sets. See details.

nRemovePCs

Number of principal components to be removed from the training and test data before the prediction. Defaults to no removal. May be useful if there is suspicion of a large technical effect in the data.

removePCsAfterScaling

Logical: should the principal components be removed after data scaling?

useSampleNetwork

Logical: should sample networks be constructed and used in the prediction? See details.

adjFnc

Character string specifying the adjacency similarity function for sample networks. This function should behave roughly like the function cor in that it takes one argument x plus possibly options an

adjOptions

Character string specifying the options to be passed to the adjacency similarity function for sample networks. Disregarded for adjFnc="dist".

networkPower

Soft-thresholding power for the sample network adjacency.

clusteringFnc

Character string specifying the clustering function for module identification in the sample networks. The clustering function must accept a dissimilarity structure as the first argument and can accept an arbitrary number of other arguments/options that c

clusteringOptions

A list of named elements that will be passed as arguments

CVfold

Non-negative integer specifying cross-validation. Zero means no cross-validation will be performed. values above zero specify the number of samples to be considered test data for each step of cross-validation.

returnFactor

Logical: should a factor be returned?

randomSeed

Integere specifying the seed for the random number generator. If NULL, the seed will not be set. See set.seed.

verbose

Integer controling how verbose the diagnostic messages should be. Zero means silent.

indent

Indentation for the diagnostic messages. Zero means no indentation, each unit adds two spaces.

Value

A list with the following components:
predictedThe back-substitution prediction in the training set.
predictedTestPrediction in the test set.
featureSignificanceA vector of feature significance calculated by assocFnc or a copy of the input featureSignificance if the latter is non-NULL.
selectedFeaturesA vector giving the indices of the features that were selected for the predictor.
centroidProfileThe representative profiles of each class (or cluster). Only returned in useQuntile is NULL.
testSample2centroidSimilaritiesA matrix of calculated similarities between the test samples and class/cluster centroids.
featureValidationWeightsA vector of validation weights (see Details) for the selected features. If weighFeaturesByValidation is 0, a unit vector is used and returned.
CVpredictedCross-validation prediction on the training data. Present only if CVfold is non-zero.
sampleClusterLabelsA list with two components (one per class). Each component is a vector of sample cluster labels for samples in the class.

Clustering functions

The clustering function is specified in the argument clusteringFnc. The clustering function must accept a dissimilarity structure as the first argument and can accept an arbitrary number of other arguments/options that can be passed using the argument clusteringOptions. The return value must be a vector of cluster labels, with 0 meaning unassigned. Two internal functions are provided, .hclust.cutreeDynamic and .pam.NCP that implement hierarchical clustering and PAM, respectively, are provided.

The function .hclust.cutreeDynamic calls hclust to build a hierarchical clustering dendrogram, then identifies modules in the dendrogram using cutreeDynamic. The function .hclust.cutreeDynamic accepts all options accepted by functions hclust and cutreeDynamic. Since both functions accept an option named "method", the method option affects hclust, and the "method" argument of cutreeDynamic can be given by argument cutreeMethod.

The function .pam.NCP calls pam (the standard Partitioning Around Medoids) and passes all arguments to that function. The reason one cannot call pam directly is that it returns more than just the cluster labels. .pam.NCP will discard the extra output and return just the clustering labels.

Details

Nearest centroid predictor works by forming a representative profile (centroid) across features for each class from the training data, then assigning each test sample to the class of the nearest representative profile. The representative profile can be formed either as mean or as athe first principal component ("eigensample"; this choice is governed by the option centroidMethod).

When the number of features is large and only a small fraction is likely to be associated with the outcome, feature selection can be used to restrict the features that actually enter the centroid. Feature selection can be based either on their association with the outcome calculated from the training data using assocFnc, or on user-supplied feature significance (e.g., derived from literature, argument featureSignificance). In either case, features can be selected by high and low association tresholds or by taking a fixed number of highest- and lowest-associated features.

As an alternative to centroids, the predictor can also assign test samples based on a given quantile of the distances from the training samples in each class (argument useQuantile). This may be advantageous if the samples in each class form irregular clusters. Note that setting useQuantile=0 (i.e., using minimum distance in each class) essentially gives a nearest neighbor predictor: each test sample will be assigned to the class of its nearest training neighbor.

If features exhibit non-trivial correlations among themselves (such as, for example, in gene expression data), one can attempt to down-weigh features that do not exhibit the same correlation in the test set. This is done by using essentially the same predictor to predict _features_ from all other features in the test data (using the training data to train the feature predictor). Because test features are known, the prediction accuracy can be evaluated. If a feature is predicted badly (meaning the error in the test set is much larger than the error in the cross-validation prediction in training data), it may mean that its quality in the training or test data is low (for example, due to excessive noise or outliers). Such features can be downweighed using the argument weighByPrediction. The extra factor is min(1, (root mean square prediction error in test set)/(root mean square cross-validation prediction error in the trainig data)^weighByPrediction), that is it is never bigger than 1.

Unless the features' mean and variance can be ascribed clear meaning, the (training) features should be scaled to mean 0 and variance 1 before the centroids are formed.

The function implements a basic option for removal of spurious effects in the training and test data, by removng a fixed number of leading principal components from the features. This sometimes leads to better prediction accuracy but should be used with caution.

If samples within each class are heterogenous, a single centroid may not represent each class well. This function can deal with within-class heterogeneity by clustering samples (separately in each class), then using a one representative (mean, eigensample) or quantile for each cluster in each class to assign test samples. Various similarity measures, specified by adjFnc, can be used to construct the sample network adjacency. Similarly, the user can specify a clustering function using clusteringFnc. The requirements on the clustering function are described in a separate section below.

Description

Usage

Arguments

Value

Clustering functions

Details

See Also