cosine.similarity: Set and vector similarity measures.

Description

Functions for computing similarity between two vectors or sets. See "Details" for exact formulas.

- Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them.

- Tversky index is an asymmetric similarity measure on sets that compares a variant to a prototype.

- Overlap cofficient is a similarity measure related to the Jaccard index that measures the overlap between two sets, and is defined as the size of the intersection divided by the smaller of the size of the two sets.

- Jaccard index is a statistic used for comparing the similarity and diversity of sample sets.

- Morisita's overlap index is a statistical measure of dispersion of individuals in a population. It is used to compare overlap among samples (Morisita 1959). This formula is based on the assumption that increasing the size of the samples will increase the diversity because it will include different habitats (i.e. different faunas).

- Horn's overlap index based on Shannon's entropy.

Use the repOverlap function for computing similarities of clonesets.

Usage

cosine.similarity(.alpha, .beta, .do.norm = NA, .laplace = 0)
tversky.index(x, y, .a = 0.5, .b = 0.5)
overlap.coef(.alpha, .beta)
jaccard.index(.alpha, .beta, .intersection.number = NA)
morisitas.index(.alpha, .beta, .do.unique = T)
horn.index(.alpha, .beta, .do.unique = T)

Arguments

.alpha, .beta, x, y

Vector of numeric values for cosine similarity, vector of any values (like characters) for tversky.index and overlap.coef, matrix or data.frame with 2 columns for morisitas.index and horn.index, either two sets or two numbers of elements in sets for jaccard.index.

.do.norm

One of the three values - NA, T or F. If NA than check for distrubution (sum(.data) == 1) and normalise if needed with the given laplace correction value. if T then do normalisation and laplace correction. If F than don't do normalisaton and laplace correction.

.laplace

Value for Laplace correction.

.a, .b

Alpha and beta parameters for Tversky Index. Default values gives the Jaccard index measure.

.do.unique

if T then call unique on the first columns of the given data.frame or matrix.

.intersection.number

Number of intersected elements between two sets. See "Details" for more information.

Value

Value of similarity between the given sets or vectors.

Details

For morisitas.index input data are matrices or data.frames with two columns: first column is elements (species or individuals), second is a number of elements (species or individuals) in a population.

Formulas:

Cosine similarity: cos(a, b) = a * b / (||a|| * ||b||)

Tversky index: S(X, Y) = |X and Y| / (|X and Y| + a*|X - Y| + b*|Y - X|)

Overlap coefficient: overlap(X, Y) = |X and Y| / min(|X|, |Y|)

Jaccard index: J(A, B) = |A and B| / |A U B| For Jaccard index user can provide |A and B| in .intersection.number otherwise it will be computed using base::intersect function. In this case .alpha and .beta expected to be vectors of elements. If .intersection.number is provided than .alpha and .beta are exptected to be numbers of elements.

Formula for Morisita's overlap index is quite complicated and can't be easily shown here, so just look at this webpage: http://en.wikipedia.org/wiki/Morisita

Examples

Run this code

# NOT RUN {
jaccard.index(1:10, 2:20)
a <- length(unique(immdata[[1]][, c('CDR3.amino.acid.sequence', 'V.gene')]))
b <- length(unique(immdata[[2]][, c('CDR3.amino.acid.sequence', 'V.gene')]))
# Next
jaccard.index(a, b, repOverlap(immdata[1:2], .seq = 'aa', .vgene = T))
# is equal to
repOverlap(immdata[1:2], 'jaccard', seq = 'aa', .vgene = T)
# }

Run the code above in your browser using DataLab