Ask yourself: How could I be a better friend to people? Next
Idea Transcript
Buy eBook for $28
Sign In (/mapt/lo
You're currently viewing a course from Mapt logged out
Featurization - feature hashing
Now, it is time to transform string representation into a numeric one. We adopt a bag-of-words approach; however
we use a trick called feature hashing. Let's look in more detail at how Spark employs this powerful technique to he
us construct and access our tokenized dataset efficiently. We use feature hashing as a time-efficient implementati of a bag-of-words, as explained earlier.
At its core, feature hashing is a fast and space-efficient method to deal with high-dimensional data-typical in worki with text-by converting arbitrary features into indices within a vector or matrix. This is best described with an example text. Suppose we have the following two movie reviews:
1
The movie Goodfellas was well worth the money spent. Brilliant acting!
2
Goodfellas is a riveting movie with a great cast and a brilliant plot-a must see for all movie lovers!
For each token in these reviews, we can apply a "hashing trick," whereby we assign the distinct tokens a number. So, the set of unique tokens (after lowercasing + text processing) in the preceding two reviews would be in alphabetical order:
Co {"acting": 1,...
Access every Packt eBook and Video for FREE today!
Access all 5,500+ eBooks & Videos 100 new titles every month Assess your skill set with assessments Learn more effectively with curated Skill Plans and Projects 1 Free eBook/Video to keep every month Find Out More Try Now