You miss 100% of the shots you don’t take. Wayne Gretzky
Idea Transcript
…
20
feature hashing?
2
wx: plzsendmetextm…
12
Let's say we want to design a function v = phi(x), which from a d-dimensional vector x = (x(1), x(2), ..., x(d)) outputs a new m-dimensional vector v, with m either greater or smaller than d.
In other words, phi can be used either for reducing dimensionality of x (d > m) or for sparsifying x (m > d).
One way to do so is to use a hash function h to map x(1) to v(h(1)), x(2) to v(h(2)), ..., x(d) to v(h(d)). Hash functions are functions which, from an arbitrary integer input, can output an integer in a specified range.
Good hash functions should have uniform output and obey the avalanche effect: a small perturbation in the input must result in a great change in the output. This ensures that any dimension in x will be mapped to a random dimension in v. Note that this will typically results in collisions (two dimensions in x can be mapped to the same dimension in v) but in practice, this won't affect performance if m is big enough.
In document classification, documents are typically transformed to vectors in the bag-of-word representation. The problem with that is that you don't know the dimensionality of your vectors until you've made an entire pass over your dataset. Fortunately there exist hash functions which take string as input.
We can thus map words to a dimension in v as we go. I think this is why it is called the hashing trick: similarly to the kernel trick, v = phi(x) never needs to be computed explicitly. The ability to transform documents as you go is very useful in online algorithms such as SGD. Vowpal Wabbit uses something like 2^26 for m.
I've also seen hash functions used to automatically generate cross-product features. If you have a hash function which gives you an integer from two integers, i.e. i = h(a,b), you can now map the combined feature x(a) * x(b) to v(i). Cross-product features can be useful to model the interactions between features.
The paper "Feature Hashing for Large Scale Multitask Learning" (Weinberger et al., ICML09) also shows how to use the hashing trick for multi-task learning. For example, in spam filtering, since each user typically receives different kinds of spam, it would be nice if we could use one binary classifier per user.
This is an example of multi-task learning since each task is related. Obviously it would require massive amounts of storage to store the weight vector of each classifier. The authors show that it is possible to use one hash function per user (as well as one global function) to combine the different users into one feature space. This is a recommended reading as it also covers the background on the hashing trick.
To summarize some of the applications: dimensionality reduction sparsification bag-of-words on the fly cross-product features multi-task learning