Sentiment Analysis of IMDb movie reviews - Rci.rutgers.edu… [PDF]

a positive (1) or a negative (0) sentiment to a given IMDb movie ... level and paragraph-level representations for senti

0 downloads 6 Views 311KB Size

Recommend Stories


Sentiment Analysis of Movie Reviews Using Machine Learning Algorithms
Keep your face always toward the sunshine - and shadows will fall behind you. Walt Whitman

Aspect Based Sentiment Analysis in Reviews
What you seek is seeking you. Rumi

(IMDb) for Movie Review and Summarization
We can't help everyone, but everyone can help someone. Ronald Reagan

Predicting IMDB Movie Ratings Using Social Media
Goodbyes are only for those who love with their eyes. Because for those who love with heart and soul

Sentiment Analysis
Before you speak, let your words pass through three gates: Is it true? Is it necessary? Is it kind?

Sentiment Classification on Polarity Reviews
Come let us be friends for once. Let us make life easy on us. Let us be loved ones and lovers. The earth

Sentiment Analysis
Don't watch the clock, do what it does. Keep Going. Sam Levenson

Sentiment Analysis
Be grateful for whoever comes, because each has been sent as a guide from beyond. Rumi

can prosody inform sentiment analysis? experiments on short spoken reviews
Life isn't about getting and having, it's about giving and being. Kevin Kruse

learning to generate reviews and discovering sentiment
I want to sing like the birds sing, not worrying about who hears or what they think. Rumi

Idea Transcript


Sentiment Analysis of IMDb movie reviews Talal Ahmed Rutgers University 1.

INTRODUCTION

There are hundreds of newspaper articles, blogs, magazines and product reviews that get released on the web everyday. The New York Times has a database of newspapers spanning over 20 years between 1987 and 2007 that is available online. The online database also contains 1.8 million articles from The Times, and many of these online articles have been manually annotated for people, places and organizations. However, an important question is how do we make sense of all this abundant information? The information can possibly be used to infer the popular view of the people on a particular social issue, political situation or even a new movie. Under the assumption that the online sentiment represents the general sentiment of the public, all the abundant information can be used to get an idea of the public sentiment. And if the information is analyzed over a period of time, the change in public sentiment on a particular topic can be tracked over a period of time. The particular type of sentiment analysis problem we address in this work is the problem of sentiment analysis using IMDb reviews. The objective of this work is to draft a procedure that assigns either a positive (1) or a negative (0) sentiment to a given IMDb movie review. So, if we are given k number of IMDb reviews pertaining to a particular movie, we can use our algorithm to assign either a 1 or a 0 to each IMDb review, and then the sentiment assignments to the k movie reviews can be used to get an idea of the popular sentiment of the public towards that movie. A lot of work has already been done on sentiment analysis for movie reviews. The most straight forward method is to count the number of times a word from the model vocabulary appears in a review, and then the count of each word in the model vocabulary is used to form the feature vector for the review. However, this bag of words representation loses the word order so different reviews with identical word composition will have identical vector representation. One of the most prominent model is the Vector Space Model(VSM). In the Vector space model, each word is represented by a vector (known as word vector) and the continuous similarities between words are represented by the distance between their respective word vectors. One of the popular works in unsupervised VSMs is [Maas and Ng 2010] where the context in which a word appears is used to guess the meaning of the word. Thus, words that appear in similar contexts are assumed to be similar in meanings. The semantic similarity between words is captured in the word vector representation of the words such that the word vectors of semantically similar words are close to each other. However, because no sentiment polarity information is associated with the training samples used for learning word vectors, the word vector representation does not capture the sentiment similarity between the words. Thus, the work in [Maas and Ng 2010] was extended in [Maas et al. 2011] which uses the label information of documents to learn vector representation of words that capture semantic as well as sentiment similarities between words. For example, the algorithm in [Maas and Ng 2010] will capture that terrible and aweful are similar in meaning, but it will not capture the negative sentiment associated with these words. However, the word vectors representation for the two words using the algorithm

in [Maas et al. 2011] will not only capture the semantic but also the sentiment similarity. These algorithms were further extended in [Le and Mikolov 2014] which proposed a way of incorporating use of unlabeled data with the supervised learning approach proposed in [Maas et al. 2011]. In [Le and Mikolov 2014], each review is assigned a unique paragraph vector. They propose an unsupervised learning framework that learns continuous vector representation for each review where the review vector representation is trained to be useful for learning word vectors. So, when learning a word vector for each word in the model vocabulary, the context in which a word appears is used along with the paragraph vector to learn the word vector representation. So each review has a unique paragraph vector associated with it whereas the word vectors are shared. There are also works that have gone beyond the word-level representation to achieve phrase level representation [Yessenalina and Cardie 2011]. However, in our work, we focus on using wordlevel and paragraph-level representations for sentiment analysis of IMDb reviews. Most of the works on sentiment analysis using IMDb reviews uses the 25k labeled training samples available at http://ai.Stanford.edu/amaas/data/sentiment/index.html for training. The mentioned dataset contains 25k labeled training samples and 50k unlabeled training samples in total. However, we analyze the sentiment analysis problem with the IMDb dataset in the context when labeled training data is not available in abundance. In particular, we suggest a largely unsupervised approach for sentiment analysis that uses the 50k unlabeled training samples for training with a few hundred labeled training samples.

2.

APPROACHES

2.1

Learning Word Vectors

Each word in a review is represented by a word vector. The objective is to learn the vector representation for a word given the context of the word where the context is the set of words preceding or succeeding this word. The words in the context of a word are either concatenated or averaged to predict the word. Mathematically, given a sequence of words in a review w1 , w2 , . . . , wT , and let 2k be the size of the context of a word, the likelihood of a word wi in the review is log p(wi |wi−k , wi−k+1 , . . . , wi+k−1 , wi+k ).

(1)

The objective of the model is to maximize the average log likelihood: t=T −k 1 X logp(wt |wt−k , wt−k+1 , . . . , wt+k−1 , wt+k ). T

(2)

t=k

After the training process converges, words which appear in similar contexts tend to have similar vector space representations. Thus, words with similar meanings tend to have word vectors that are close to each other in the vector space. ACM Transactions on Graphics, Vol. 28, No. 4, Article 106, Publication date: September 2009.



2

Fig. 1. Framework for learning of the word vector model [Le and Mikolov 2014].

2.2

Learning Paragraph Vectors

This model is inspired by the word vector model introduced in the previous section. In the previous section, the words in the context of a word are averaged or concatenated to predict the word. In the paragraph vector model, not only the words in the context of a word but also a paragraph vector is asked to contribute to the prediction task. The words in the context of a word are averaged or concatenated with the paragraph vector for that review to predict the word/ form the log likelihood. Thus, the paragraph vector can just be seen as an extra word that contributes towards the prediction process. Thus, the paragraph vector model can be seen as an extension of the word vector model. In the paragraph vector model, each review is assigned a unique vector model and the word vectors are shared among same words in different reviews. The paragraph vector is shared among all the words in a review. During the experiment, the context is fixed in length and the context can be seen as a sliding window over the paragraph. The window slides over each review while the word vector for each word is updated by averaging the word vectors in the context of a word with the paragraph vector for that review. The word vectors and paragraph vectors are trained using stochastic gradient descent. The word vectors and the paragraph vectors are trained alternatively till convergence. In the end, each review has a paragraph vector and each word in the model vocabulary has a word vector associated with it. The advantage of the paragraph vector model over the word vector model is that the former can use unlabeled data to learn paragraph vector representations. Another advantage is that the paragraph vectors indirectly capture the word order in a review thus two reviews with the same word composition but different word order won’t have the same paragraph vector representation.

3.

ALGORITHMS

This section explains how the word-level and paragraph-level word vector approaches are used for supervised and unsupervised learning respectively.

3.1

Unsupervised Algorithm

Fig. 2. Framework for learning of the paragraph vector model [Le and Mikolov 2014].

(3) Next step is to tag each of the K clusters with a positive or a negative sentiment: a few hundred labeled training samples can be employed for this purpose. The paragraph vectors for the labeled samples are estimated using the learnt model. The labeled paragraph vectors are assigned to the K clusters using the nearest centroid classifier. If the number of positive reviews assigned to a cluster are greater than the number of negative reviews, the cluster is tagged with a positive sentiment. Otherwise, the cluster is tagged with a negative sentiment. (4) Use the nearest centroid classifier with the K clusters for classification. 3.1.2 Approach B (1) Use the unlabeled training data to train the paragraph vector model. (2) A few labeled samples are used to extract sentiment information from the learnt model. The word vectors are fixed in the learnt model and paragraph vectors are estimated for the labeled samples. (3) The labeled paragraph vectors are used to train a logistic regression model.

3.2

Supervised Algorithm

(1) The word vector model is trained using labeled training samples. Each word in the model vocabulary is assigned to a word vector. (2) (Optional) All words except the adjectives, verbs and adverbs are deleted from each review. (3) For each labeled training sample, the word vectors for all the words in a review are averaged to find an average word vector for each training review. (4) A random forest classifier is trained using the average word vectors for labeled training samples. (5) For each test sample, the average word vector is calculated using the word vector model. The average word vectors are classified using the random forest classifier.

3.3

3.1.1 Approach A

Improving supervised learning performance using unsupervised learning

(1) Use the unlabeled training data to train the paragraph vector model. Each training review is represented by a paragraph vector.

This section uses the paragraph vector model(unsupervised) with the word vector model (supervised) to obtain better classification performance.

(2) The training paragraph vectors are clustered using the K-means algorithm.

(1) The labeled and unlabeled training samples are used to train the paragraph vector model.

ACM Transactions on Graphics, Vol. 28, No. 4, Article 106, Publication date: September 2009.



(2) The labeled training samples are used to train the word vector model. (3) For each labeled training review, the paragraph vector and the average word vector are estimated and concatenated to form the feature vector for that review. (4) The feature vectors for the labeled training reviews are used to train a random forest classifier. (5) For each test review, the feature vector is formed by concatenating the paragraph vector with the average word vector. Use the trained random forest classifier for classification.

4.

EXPERIMENTS

3

the stochastic gradient algorithm. Each of the 1k paragraph vectors are assigned to one of the K clusters using the nearest centroid classifier. In the end, each of the K clusters is assigned a positive or a negative label depending on if more positive or negative labeled training reviews are assigned to the cluster. Once the clusters are tagged for a positive or a negative sentiment, each test review is assigned to one of the K clusters using the nearest centroid classifier. If the test review is assigned to a positive cluster, it’s assigned a positive sentiment and vice versa. The performance of the algorithm is analyzed for different values of K. The results are shown in Fig 3. It can be seen that the highest true classification rate for any number of clusters is 0.77.

We perform experiments to analyze the performance of the unsupervised and the supervised approaches outlined in the previous section. The true classification rate is used as the performance metric where true classification rate is defined as the fraction of the 25k test reviews labeled correctly.

4.1

Dataset

We use the IMDb dataset that was first used in [Maas et al. 2011] for sentiment analysis. The dataset consists of 100k IMDb reviews such that 50k of them are labeled and 50k are unlabeled. The dataset is divided into 50k unlabeled training reviews, 25k labeled training reviews and 25k test reviews. The label on each labeled sample is either a positive sentiment (1) or a negative sentiment (0). There are roughly equal number of positive and negative reviews in each of the training and test subsets. The dataset can be accessed at http://ai.Stanford.edu/amaas/data/sentiment/index.html

4.2

Preprocessing

Before performing any analysis, we have to clean up the data to make it easier to process and get rid of the noisy part of the dataset. Removing noise can greatly increase the accuracy of our proposed approaches. The IMDb reviews contain html tags which have to be removed since they do not contribute towards sentiment analysis. We also got rid of the punctuation marks to ease the data processing even if that resulted in the removal of emoticons. There ain’t many emoticons in IMDb reviews nonetheless so we do not end up losing too much information. Also, we lower case everything to ensure consistency in our model vocabulary. Otherwise, we may end up with multiple versions of a word in the model vocabulary with different word vector representations.

4.3

Unsupervised Approach

To test the unsupervised approach, the paragraph vector model is trained using the 75k unlabeled reviews in the dataset such that the dimensionality of the paragraph vector is set at 150. Note that this is one of the advantages of the paragraph vector model that it can be trained using unlabeled reviews so we are able to use the whole dataset for training. The 75k paragraph vectors for the 75k unlabeled reviews are clustered into K clusters using the K-means algorithm. Each cluster can be modeled as representative of either a positive or a negative sentiment. The next step is to determine the polarity of each cluster. The polarity of each cluster is estimated using only 1k of the 25k labeled samples available. The paragraph vector representation is learnt for the 1k labeled reviews using the unsupervised model. Note that the word vector representation is kept fixed while paragraph vectors are being estimated for the 1k labeled reviews using

Fig. 3. Performance of K-Means clustering with number of clusters.

In the second approach, the paragraph vector model is trained using the 75k unlabeled reviews in the dataset. Again, we only use 1k of the 25k labeled training samples so that we can extract sentiment information from the unsupervised paragraph vector model: the word vectors are fixed in the model and the paragraph vectors are learnt for the 1k labeled training samples. The 1k paragraph vectors are used to train a logistic regression model which can be used to classify the test samples. Note that we have used the 75k unlabeled samples for training and the 1k labeled samples are used to find the label association. The true classification rate is 0.827 as reported in the table.

4.4

Supervised Approach

To test the supervised approach, the model is trained using the 25k labeled training reviews to learn word vectors for each word in the model vocabulary. The dimensionality of the word vectors is set at 150 and the 5000 most frequent words in the 25k reviews make up the model vocabulary. For each training review, the word vectors are averaged for all the words in a review to form an average word vector. Thus, there are 25k average word vectors for the 25k training reviews. Since the training samples have labels, the feature vectors for the 25k reviews can be used to train a logistic regression algorithm. The learnt logistic function is used to label the test reviews. Another approach we tried is that we extracted the adjectives, verbs and adverbs from each review and the average word vector for a review averaged only over the word vectors pertaining to adjectives, verbs and adverbs in the review. The corresponding results are also presented in the table. An interesting question is if we can do better than the supervised approach if we leverage unlabeled data in the supervised learning

ACM Transactions on Graphics, Vol. 28, No. 4, Article 106, Publication date: September 2009.

4



process. We answer the question by combining the supervised approach with the unsupervised approach to see if we get an improvement.

4.5

Combining the Unsupervised Approach with the Supervised Approach

The paragraph vector model is trained using the 75k unlabeled reviews. In addition, the word vector model is trained using the labeled 25k training reviews. Thus, an average word vector is estimated for each of the 25k training reviews. For each of the 25k labeled training reviews, the paragraph vectors are estimated while keeping the word vectors fixed. Then, for each of the 25k labeled reviews, the paragraph vector is concatenated with the average word vector to form the feature vector for each review. Thus, the length of the feature vector for each review is 300. The feature vectors for the 25k training reviews are used to train a logistic regression model which can be used for classification of test reviews. The results are reported in the table below. Algorithm Unsupervised Approach: Paragraph Vector Model Supervised Approach: Word Vector Model (WVM) WVM with Adj, Vrb, Adv Unsupervised and Supervised Approaches Combined

5.

for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, 142–150. Andrew L Maas and Andrew Y Ng. 2010. A probabilistic model for semantic word vectors. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111– 3119. Ainur Yessenalina and Claire Cardie. 2011. Compositional matrix-space models for sentiment analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 172–182.

True Classification Rate 0.827 0.834 0.821 0.866

CONCLUSION

We address the sentiment analysis problem for IMDb reviews such that each review is labeled with either a positive (1) or a negative (0) sentiment. In particular, we address a largely unsupervised setting when we have a lot of unlabeled data available but little labeled data is available for training. We also try a supervised learning approach addressing the scenario when abundant labeled data is available. Then, we combine the two approaches to propose an algorithm that leverages on the unlabeled training data available to improve performance of the supervised learning algorithm. One of the weaknesses of the proposed unsupervised algorithm is that it requires a few labeled samples to extract sentiment information from the learnt unsupervised model. However, such a setup makes sense because even if the only training data available is unlabeled, a few samples can be labeled manually and used with the proposed unsupervised approach for sentiment analysis. So, the proposed unsupervised approach can be applied even when the only training data available is unlabeled. For future work, we can try various clustering approaches with the proposed unsupervised approach. Also, for learning of word vectors, we can try skip-grams which predict the context of a word given a particular word as input. Note that this is in contrast to the Continuous bag of words (CBOW) approach used in this work which uses the context of a word in a sentence to predict the word. REFERENCES Quoc V Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. arXiv preprint arXiv:1405.4053 (2014). Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association ACM Transactions on Graphics, Vol. 28, No. 4, Article 106, Publication date: September 2009.

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.