FeatureHashing (version 0.9.1.3)

hash.size: Compute minimum hash size to reduce collision rate

Description

Compute minimum hash size to reduce collision rate

Usage

hash.size(df)

Arguments

df

data.frame with data to hash

Value

The hash size of feature hashing as a positive integer.

Details

To reduce collision rate, the hash size should be equal or superior to the nearest power of two to the number of unique values in the input data.frame.

The value computed is a theorical minimum hash size. It just means that in the best situation it may be possible that all computed hash can be stored with this hash size.

Intuitively, if the distribution of hash generated by the algorithm was perfect, when the computed size is used, each permutation of bits of the hash vector would correspond to one unique original value of your data.frame.

Because a bit value is {0,1}, the computed size is 2^x with a x big enough to have a hash vector containing each original value.

In real life, there will be some collisions if the computed size is used because the distribution of hash is not perfect. However, the hashing algorithm Murmur3 is known to minimize the number of collisions and is also very performant.

The only known solution to have zero collision is to build a dictionnary of values, and for each new value to hash, check in the dictionnary if the hash value already exists. It is not performant at all.

If you increase the computed size (by multiplying it by 2^x, it is up to you to choose a x), you will reduce the collision rate. If you use a value under the computed size, there is a 100

There is a trade-off between collision rate and memory used to store hash. Machine learning algorithms usually deal well with collisions when the rate is reasonable.

Examples

Run this code
# NOT RUN {
data(ipinyou)

#First try with a size of 2^10
mat1 <- hashed.model.matrix(~., ipinyou.train, 2^10, create.mapping = TRUE)

#Extract mapping
mapping1 <- hash.mapping(mat1)
#Rate of collision
mean(duplicated(mapping1))

#Second try, the size is computed
size <- hash.size(ipinyou.train)
mat2 <- hashed.model.matrix(~., ipinyou.train, size, create.mapping = TRUE)

#Extract mapping
mapping2 <- hash.mapping(mat2)
#Rate of collision
mean(duplicated(mapping2))
# }

Run the code above in your browser using DataCamp Workspace