nearestCentroidPredictor( # Input training and test data
x, y, xtest,
# Feature weights and selection criteria
featureSignificance = NULL,
assocFnc = "cor", assocOptions = "use = 'p'",
assocCut.hi = NULL, assocCut.lo = NULL,
nFeatures.hi = 10, nFeatures.lo = 10,
weighFeaturesByAssociation = 0,
scaleFeatureMean = TRUE, scaleFeatureVar = TRUE,
# Predictor options
centroidMethod = c("mean", "eigensample"),
simFnc = "cor", simOptions = "use = 'p'",
useQuantile = NULL,
sampleWeights = NULL,
weighSimByPrediction = 0,
# Data pre-processing
nRemovePCs = 0, removePCsAfterScaling = TRUE,
# Sample network options
useSampleNetwork = FALSE,
adjFnc = "cor", adjOptions = "use = 'p'",
networkPower = 2,
clusteringFnc = ".hclust.cutreeDynamic",
clusteringOptions = list(minClusterSize = 4, deepSplit = 2, verbose = 0),
# What should be returned
CVfold = 0, returnFactor = FALSE,
# General options
randomSeed = 12345,
verbose = 2, indent = 0)
x
.
If test set data are not given, only the prediction on training data will be returned.link{cor}
in that it takes two arguments
(a matrix and a vector) plus options
and returns the vector of associations between the columns assocCut.hi
will be included. If not given, the threshold method will not be
used; instead, a fixed number of featuassocCut.lo
will be included. If not given, defaults to -assocCut.hi
.
If assocCut.hi
is <featureSignificance
) to include in the
predictor. Only used if assocCut.hi
is NULL
.featureSignificance
) to include in
the predictor. Only used if assocCut.hi
is NULL
."mean"
and "eigensample"
, specifies how the centroid should be calculated.
"mean"
takes the mean across all samples (or all samples within a sample module, if sample networks
are used), whereas "eigensam
cor
in that it takes two arguments (x
cor
in that it takes one argument x
plus possibly
options
anadjFnc="dist"
.NULL
, the seed will not be set. See
set.seed
.assocFnc
or a copy of the
input featureSignificance
if the latter is non-NULL.useQuntile
is NULL
.weighFeaturesByValidation
is 0, a unit vector is used and returned.CVfold
is
non-zero.clusteringFnc
. The
clustering function must accept a
dissimilarity structure as the first argument and can accept an arbitrary number of other arguments/options
that can be passed
using the argument clusteringOptions
. The return value must be a vector of cluster labels, with
0 meaning unassigned. Two internal functions are provided, .hclust.cutreeDynamic
and .pam.NCP
that implement hierarchical clustering and PAM, respectively, are provided. The function
.hclust.cutreeDynamic
calls hclust
to build a hierarchical clustering dendrogram, then
identifies modules in the dendrogram using cutreeDynamic
. The function
.hclust.cutreeDynamic
accepts all options accepted by functions hclust
and
cutreeDynamic
. Since both functions accept an option named "method", the method
option
affects hclust
, and the "method" argument of cutreeDynamic
can be given by argument
cutreeMethod
.
The function .pam.NCP
calls pam
(the standard Partitioning Around Medoids) and
passes all arguments to that function. The reason one cannot call pam
directly is
that it returns more than just the cluster labels. .pam.NCP
will discard the extra output and return
just the clustering labels.
centroidMethod
).When the number of features is large and only a small fraction is likely to be associated with the outcome,
feature selection can be used to restrict the features that actually enter the centroid. Feature selection
can be based either on their association with the outcome
calculated from the training data using assocFnc
, or on user-supplied feature significance (e.g.,
derived from literature, argument
featureSignificance
). In either case, features can be selected by high and low association tresholds
or by taking a fixed number of highest- and lowest-associated features.
As an alternative to centroids, the predictor can also assign test samples based on a given quantile of the
distances from the training samples in each class (argument useQuantile
). This may be advantageous if
the samples in each class form irregular clusters. Note that setting useQuantile=0
(i.e., using
minimum distance in each class) essentially gives a nearest neighbor predictor: each test sample will be
assigned to the class of its nearest training neighbor.
If features exhibit non-trivial correlations among themselves (such as, for example, in gene expression
data), one can attempt to down-weigh features that do not exhibit the same correlation in the test set.
This is done by using essentially the same predictor to predict _features_ from all other features in the
test data (using the training data to train the feature predictor). Because test features are known, the
prediction accuracy can be evaluated. If a feature is predicted badly (meaning the error in the test set is
much larger than the error in the cross-validation prediction in training data),
it may mean that its quality in the
training or test data is low (for example, due to excessive noise or outliers).
Such features can be downweighed using the argument weighByPrediction
. The extra factor is
min(1, (root mean square prediction error in test set)/(root mean square cross-validation prediction error
in
the trainig data)^weighByPrediction), that is it is never bigger than 1.
Unless the features' mean and variance can be ascribed clear meaning, the (training) features should be scaled to mean 0 and variance 1 before the centroids are formed.
The function implements a basic option for removal of spurious effects in the training and test data, by removng a fixed number of leading principal components from the features. This sometimes leads to better prediction accuracy but should be used with caution.
If samples within each class are heterogenous, a single centroid may not represent each class well. This
function can deal with within-class heterogeneity by clustering samples (separately in each class), then
using a one representative (mean, eigensample) or quantile for each cluster in each class to assign test
samples. Various similarity measures, specified by adjFnc
, can be used to construct the sample network
adjacency. Similarly, the user can specify a clustering function using clusteringFnc
. The
requirements on the clustering function are described in a separate section below.
votingLinearPredictor