ham
returns the pairwise distances between rows (observations) of a single matrix if mat_1
equals mat_2
.
Otherwise ham
returns the matrix distance between rows of the two matrices mat_1
and mat_2
if this 2 matrices are different in input.
Computing the Hamming distance stays possible despite the presence of missing data by applying the following formula. Assuming that A and B are 2 matrices such as ncol(A) = ncol(B)
.
The Hamming distance between the \(i^{th}\) row of A and the \(k^{th}\) row of B equals:
$$\mbox{ham}(A_i,B_k) = \frac{\sum_j 1_{\left\{A_{ij} \neq B_{kj}\right\}}}{\sum_j 1}\times\left(\frac{\sum_j 1}{\sum_j 1_{\left\{!\mbox{is.na}(A_{ij}) \& !\mbox{is.na}( B_{kj})\right\}}}\right)$$
where: \(i = 1,\dots,\mbox{nrow}(A)\) and \(k = 1,\dots,\mbox{nrow}(B)\); And the expression located to the right term of the multiplication corresponds to a specific weigh applied in presence of NAs in \(A_i\) and/or \(B_k\).
This specificity is not implemented in the cdist
function and the Hamming distance can not be computed using the dist
function either.
The Hamming distance can not be calculated in only two situations:
If a row of A or B has only missing values (ie for each of the columns of A or B respectively).
The union of the indexes of the missing values in row i of A with the indexes of the missing values in row j of B concerns the indexes of all considered columns.
Example: Assuming that \(\mbox{ncol}(A) = \mbox{ncol}(B) = 3\), if \(A_i = (1,\mbox{NA},0)\) and \(B_j = (\mbox{NA},1,\mbox{NA})\), for each column, either the information in row i is missing in A,
or the information is missing in B, which induces: \(\mbox{ham}(A_i,B_k) = \mbox{NA}\).
If mat_1
is a vector and mat_2
is a matrix (or data.frame) or vice versa, the length of mat_1
must be equal to the number of columns of mat_2
.