Feature hashing, or the hashing trick, is a transformation of a
text variable into a new set of numerical variables. This is done by
applying a hashing function over the tokens and using the hash values
as feature indices. This allows for a low memory representation of the
text. This implementation is done using the MurmurHash3 method.
The argument num_terms
controls the number of indices that the hashing
function will map to. This is the tuning parameter for this
transformation. Since the hashing function can map two different tokens
to the same index, will a higher value of num_terms
result in a lower
chance of collision.
The new components will have names that begin with prefix
, then
the name of the variable, followed by the tokens all separated by
-
. The variable names are padded with zeros. For example,
if num_terms < 10
, their names will be hash1
- hash9
.
If num_terms = 101
, their names will be hash001
- hash101
.