Locality sensitive hashing returns a list of possible matches for
similar documents. How likely is it that a pair of documents will be detected
as a possible match? If h
is the number of minhash signatures,
b
is the number of bands in the LSH function (implying then that the
number of rows r = h / b
), and s
is the actual Jaccard
similarity of the two documents, then the probability p
that the two
documents will be marked as a candidate pair is given by this equation.
$$p = 1 - (1 - s^{r})^{b}$$
According to MMDS,
that equation approximates an S-curve. This implies that there is a threshold
(t
) for s
approximated by this equation.
$$t = \frac{1}{b}^{\frac{1}{r}}$$