tcR (version 2.2.4)

# cosine.similarity: Set and vector similarity measures.

## Description

Functions for computing similarity between two vectors or sets. See "Details" for exact formulas.

- Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them.

- Tversky index is an asymmetric similarity measure on sets that compares a variant to a prototype.

- Overlap cofficient is a similarity measure related to the Jaccard index that measures the overlap between two sets, and is defined as the size of the intersection divided by the smaller of the size of the two sets.

- Jaccard index is a statistic used for comparing the similarity and diversity of sample sets.

- Morisita's overlap index is a statistical measure of dispersion of individuals in a population. It is used to compare overlap among samples (Morisita 1959). This formula is based on the assumption that increasing the size of the samples will increase the diversity because it will include different habitats (i.e. different faunas).

- Horn's overlap index based on Shannon's entropy.

Use the repOverlap function for computing similarities of clonesets.

## Usage

`cosine.similarity(.alpha, .beta, .do.norm = NA, .laplace = 0)tversky.index(x, y, .a = 0.5, .b = 0.5)overlap.coef(.alpha, .beta)jaccard.index(.alpha, .beta, .intersection.number = NA)morisitas.index(.alpha, .beta, .do.unique = T)horn.index(.alpha, .beta, .do.unique = T)`

## Arguments

.alpha, .beta, x, y

Vector of numeric values for cosine similarity, vector of any values (like characters) for `tversky.index` and `overlap.coef`, matrix or data.frame with 2 columns for `morisitas.index` and `horn.index`, either two sets or two numbers of elements in sets for `jaccard.index`.

.do.norm

One of the three values - NA, T or F. If NA than check for distrubution (sum(.data) == 1) and normalise if needed with the given laplace correction value. if T then do normalisation and laplace correction. If F than don't do normalisaton and laplace correction.

.laplace

Value for Laplace correction.

.a, .b

Alpha and beta parameters for Tversky Index. Default values gives the Jaccard index measure.

.do.unique

if T then call unique on the first columns of the given data.frame or matrix.

.intersection.number

Number of intersected elements between two sets. See "Details" for more information.

## Value

Value of similarity between the given sets or vectors.

## Details

For `morisitas.index` input data are matrices or data.frames with two columns: first column is elements (species or individuals), second is a number of elements (species or individuals) in a population.

Formulas:

Cosine similarity: `cos(a, b) = a * b / (||a|| * ||b||)`

Tversky index: `S(X, Y) = |X and Y| / (|X and Y| + a*|X - Y| + b*|Y - X|)`

Overlap coefficient: `overlap(X, Y) = |X and Y| / min(|X|, |Y|)`

Jaccard index: `J(A, B) = |A and B| / |A U B|` For Jaccard index user can provide |A and B| in `.intersection.number` otherwise it will be computed using `base::intersect` function. In this case `.alpha` and `.beta` expected to be vectors of elements. If `.intersection.number` is provided than `.alpha` and `.beta` are exptected to be numbers of elements.

Formula for Morisita's overlap index is quite complicated and can't be easily shown here, so just look at this webpage: http://en.wikipedia.org/wiki/Morisita

## Examples

Run this code
``````# NOT RUN {
jaccard.index(1:10, 2:20)
a <- length(unique(immdata[][, c('CDR3.amino.acid.sequence', 'V.gene')]))
b <- length(unique(immdata[][, c('CDR3.amino.acid.sequence', 'V.gene')]))
# Next
jaccard.index(a, b, repOverlap(immdata[1:2], .seq = 'aa', .vgene = T))
# is equal to
repOverlap(immdata[1:2], 'jaccard', seq = 'aa', .vgene = T)
# }
``````

Run the code above in your browser using DataCamp Workspace