Feature Hashing - LAM [PDF]

Jun 9, 2016 - In machine learning, feature hashing, also known as the hashing trick[1] (by analogy to the kernel trick), is a fast and space-efficient way of vectorizing features, i.e. turning arbitrary features into indices in a vector or matrix. It works by applying a hash function to the features and using their hash values as ...

3 downloads 23 Views 45KB Size

Recommend Stories


hashing i
Never let your sense of morals prevent you from doing what is right. Isaac Asimov

pdf 60th anniversary feature
We can't help everyone, but everyone can help someone. Ronald Reagan

Package 'LAM'
Almost everything will work again if you unplug it for a few minutes, including you. Anne Lamott

Lam MicroRNA_CCDT
Never let your sense of morals prevent you from doing what is right. Isaac Asimov

Lam Plastik
Before you speak, let your words pass through three gates: Is it true? Is it necessary? Is it kind?

Stephanie Lam
At the end of your life, you will never regret not having passed one more test, not winning one more

Wifredo Lam
Be grateful for whoever comes, because each has been sent as a guide from beyond. Rumi

Feature Feature
You often feel tired, not because you've done too much, but because you've done too little of what sparks

Nordic Lam
Keep your face always toward the sunshine - and shadows will fall behind you. Walt Whitman

Monica Lam
Don’t grieve. Anything you lose comes round in another form. Rumi

Idea Transcript


Feature Hashing 2016-06-09

In machine learning, feature hashing, also known as the hashing trick[1] (by analogy to the kernel trick), is a fast and space-efficient way of vectorizing features, i.e. turning arbitrary features into indices in a vector or matrix. It works by applying a hash function to the features and using their hash values as indices directly, rather than looking the indices up in an associative array. In a typical document classification task, the input to the machine learning algorithm (both during learning and classification) is free text. From this, a bag of words (BOW) representation is constructed: the individual tokens are extracted and counted, and each distinct token in the training set defines a feature (independent variable) of each of the documents in both the training and test sets. Machine learning algorithms, however, are typically defined in terms of numerical vectors. Therefore, the bags of words for a set of documents is regarded as a term-document matrix where each row is a single document, and each column is a single feature/word; the entry i, j in such a matrix captures the frequency (or weight) of the j‘th term of the vocabulary in document i. Typically, these vectors are extremely sparse. The common approach is to construct, at learning time or prior to that, a dictionary representation of the vocabulary of the training set, and use that to map words to indices. E.g., the three documents John likes to watch movies. Mary likes movies too. John also likes football. can be converted, using the dictionary Term

Index

John

1

likes

2

to

3

watch

4

movies

5

Mary

6

too

7

also

8

football

9

to the term-document matrix

(Punctuation was removed, as is usual in document classification and clustering.) The problem with this process is that such dictionaries take up a large amount of storage space and grow in size as the training set grows. On the contrary, if the vocabulary is kept fixed and not increased with a growing training set, an adversary may try to invent new words or misspellings that are not in the stored vocabulary so as to circumvent a machine learned filter. Share this:

Data

Machine Learning

Leave a Reply Your email address will not be published. Required fields are marked *

Comment

Name * Email * Website POST COMMENT Notify me of follow-up comments by email. Notify me of new posts by email.

PREVIOUS

NEXT

Copyright © by Vincent Lam Theme by WPaesthetic

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.