extract_sample_similarity: Internal function to extract the sample distance table.

Description

Computes and extracts the sample distance table for samples analysed using a familiarEnsemble object to form a familiarData object. This table can be used to cluster samples, and is exported directly by extract_feature_expression.

Usage

extract_sample_similarity(
  object,
  data,
  cl = NULL,
  is_pre_processed = FALSE,
  sample_limit = waiver(),
  sample_cluster_method = waiver(),
  sample_linkage_method = waiver(),
  sample_similarity_metric = waiver(),
  verbose = FALSE,
  message_indent = 0L,
  ...
)

Value

A data.table containing pairwise distance between samples. This data is only the upper triangular of the complete matrix (i.e. the sparse unitriangular representation). Diagonals will always be 0.0 and the lower triangular is mirrored.

Arguments

object

A familiarEnsemble object, which is an ensemble of one or more familiarModel objects.

data

A dataObject object, data.table or data.frame that constitutes the data that are assessed.

cl

Cluster created using the parallel package. This cluster is then used to speed up computation through parallellisation.

is_pre_processed

Flag that indicates whether the data was already pre-processed externally, e.g. normalised and clustered. Only used if the data argument is a data.table or data.frame.

sample_limit

(optional) Set the upper limit of the number of samples that are used during evaluation steps. Cannot be less than 20.

This setting can be specified per data element by providing a parameter value in a named list with data elements, e.g. list("sample_similarity"=100, "permutation_vimp"=1000).

This parameter can be set for the following data elements: sample_similarity and ice_data.

sample_cluster_method

The method used to perform clustering based on distance between samples. These are the same methods as for the cluster_method configuration parameter: hclust, agnes, diana and pam.

none cannot be used when extracting data for feature expressions.

If not provided explicitly, this parameter is read from settings used at creation of the underlying familiarModel objects.

sample_linkage_method

The method used for agglomerative clustering in hclust and agnes. These are the same methods as for the cluster_linkage_method configuration parameter: average, single, complete, weighted, and ward.

If not provided explicitly, this parameter is read from settings used at creation of the underlying familiarModel objects.

sample_similarity_metric

Metric to determine pairwise similarity between samples. Similarity is computed in the same manner as for clustering, but sample_similarity_metric has different options that are better suited to computing distance between samples instead of between features: gower, euclidean.

The underlying feature data is scaled to the \([0, 1]\) range (for numerical features) using the feature values across the samples. The normalisation parameters required can optionally be computed from feature data with the outer 5% (on both sides) of feature values trimmed or winsorised. To do so append _trim (trimming) or _winsor (winsorising) to the metric name. This reduces the effect of outliers somewhat.

If not provided explicitly, this parameter is read from settings used at creation of the underlying familiarModel objects.

verbose

Flag to indicate whether feedback should be provided on the computation and extraction of various data elements.

message_indent

Number of indentation steps for messages shown during computation and extraction of various data elements.

...

Unused arguments.