To reduce collision rate, the hash size should be
equal or superior to the nearest power of two to the number
of unique values in the input data.frame
.
The value computed is a theorical minimum hash size.
It just means that in the best situation it may be
possible that all computed hash can be stored with
this hash size.
Intuitively, if the distribution of hash generated by the algorithm
was perfect, when the computed size is used, each permutation of bits
of the hash vector would correspond to one unique original value of
your data.frame
.
Because a bit value is {0,1}
, the computed size is 2^x
with a x
big enough to have a hash vector containing each
original value.
In real life, there will be some collisions if the computed
size is used because the distribution of hash is not
perfect. However, the hashing algorithm Murmur3
is known to
minimize the number of collisions and is also very performant.
The only known solution to have zero collision is to build a
dictionnary of values, and for each new value to hash, check in the
dictionnary if the hash value already exists. It is not performant at all.
If you increase the computed size (by multiplying it by 2^x
,
it is up to you to choose a x
), you will reduce the collision rate.
If you use a value under the computed size,
there is a 100
There is a trade-off between collision rate and memory
used to store hash. Machine learning algorithms usually deal
well with collisions when the rate is reasonable.