Agile Java Man: Feature Hashing [PDF]

Apr 1, 2017 - Here's a cool little trick called feature hashing. The idea is that you want to turn some corpus of text i

1 downloads 18 Views 94KB Size

Recommend Stories


[PDF] Rich Man, Poor Man
If you want to go quickly, go alone. If you want to go far, go together. African proverb

hashing i
Never let your sense of morals prevent you from doing what is right. Isaac Asimov

[PDF] Agile Project Management
Everything in the universe is within you. Ask all from yourself. Rumi

[PDF] Agile Project Management
If you want to become full, let yourself be empty. Lao Tzu

PdF Download Agile Testing
Your task is not to seek for love, but merely to seek and find all the barriers within yourself that

[PDF] Agile Testing
The beauty of a living thing is not the atoms that go into it, but the way those atoms are put together.

PDF Agile Project Management
The greatest of richness is the richness of the soul. Prophet Muhammad (Peace be upon him)

PDF Agile Project Management
Everything in the universe is within you. Ask all from yourself. Rumi

[PDF] Agile Software Requirements
At the end of your life, you will never regret not having passed one more test, not winning one more

PdF Agile Software Requirements
The happiest people don't have the best of everything, they just make the best of everything. Anony

Idea Transcript


More Next Blog»

Create Blog Sign In

Agile Java Man Musings on Data Science, Software Architecture, Functional Programming and whatnot.

Saturday, April 1, 2017

Feature Hashing Here's a cool little trick called feature hashing. The idea is that you want to turn some corpus of text into feature vectors without resorting to an external repository for the indices in those vectors. Looking things up in an external repository can be slow and complicates the code base. Instead, you avoid this through hashing the words to an integer. Spark uses the HashingTF class for this.



Blog Archive 2018 (14) t 2017 (41) December (7)

However, the Spark code is very dependent on all your words being able to fit into positive Ints. This is probably OK for most corpus of text but when you start using n-grams ("Groups of words in a split are known as n-grams." [1]), then the numbers start to approach what an Int can hold. But what's more worrying is that hashing leads to the possibility of hash collisions. How many you expect can be calculated from the formula here. Basically, given M possible slots and N words that need fitting into them, the probability of a word going into a particular slot, z, is 1/M. Therefore, the probability of avoiding a slot is: p(miss z) = 1 - 1 = M - 1 = Y M M

November (3) October (2) September (3)

The probability that we miss z for all N words as we add them to the slots is YN . So, obviously, the probability that the slot receives at

August (3)

least one word is 1-YN .

July (4) June (3)

This is a formula for the probability of a slot being filled but what is the expected number of collisions? Here, we note a nice trick outlined here.

May (6) t April (4) Crash Course in Conditional Random Fields Generating a Gaussian distribution Video games to statistical mechanics Feature Hashing March (2) February (2) January (2)

The expectation of a sum is the sum of the expectations. This theorem holds "always." The random variables you are summing need not be independent. Now, we introduce a function that is 1 if slot z contains one or more words and 0 otherwise. Note: we're not interested in the number of words in the slot. We're only interested if the slot contains something. We call this function for slot z, fz .

2016 (38) 2015 (37)

So, the expected total number of slots that hold at least one word is:

2014 (50)

E(X) = E(f0 ) + E(f1 ) + E(f2 ) + ... + E(fM )

2013 (50) 2012 (28)

The expected value of anything is the sum of all possible values multiplied by their probability. So, the expected value of an individual f for any of the M slots is:

2011 (17) 2010 (24)

E(fz ) = 0 * p(fz =0) + 1 * p(fz =1) = p(fz =1)

About Me

(Note that the expected value need not be an integer nor indeed sensible. The expected number when throwing a 6-sided die is 3.5.)

Phillip Henry Follow

31

View my complete profile

Given M slots, the expected number of slots that contain something is: E(X) = M (1-YN ) So the total number of collisions is the difference between this and N as the Quora link says. For a spike, we used the MurmerHash3 hashing algorithm over 820 million n-grams and the number of slots that had more than one word was 70 million (close to the 74 million this equation predicts when M = 2^32 and N = 820 million). So, although it's a nice idea, Spark's current implementation is not for us. We'll look at rolling-our-own using Longs for which we could reasonably expect no collisions (using the above equation with M = 2^64). [1] Real World Machine Learning. Posted by Phillip Henry at 1:52 AM Labels: algorithm hashing, Big Data, Spark, work

No comments:

Post a Comment

Newer Post Subscribe to: Post Comments (Atom)

Home

Older Post

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.