Feature hashing, or the hashing trick, is a transformation of a text variable
into a new set of numerical variables. This is done by applying a hashing
function over the tokens and using the hash values as feature indices. This
allows for a low memory representation of the text. This implementation is
done using the MurmurHash3 method.
The argument num_terms
controls the number of indices that the hashing
function will map to. This is the tuning parameter for this transformation.
Since the hashing function can map two different tokens to the same index,
will a higher value of num_terms
result in a lower chance of collision.
The new components will have names that begin with prefix
, then
the name of the variable, followed by the tokens all separated by
-
. The variable names are padded with zeros. For example if
prefix = "hash"
, and if num_terms < 10
, their names will be
hash1
- hash9
. If num_terms = 101
, their names will be
hash001
- hash101
.