Wednesday, October 14, 2015

Text Classification of Tweets about NFL Draft Prospects


Presented by

Quoc Le, Jeff Ori, Chris Sherman, Gene Totten, Ryan Williams


Introduction

In this analysis, we explore predictive models for classifying unstructured text in tweets about 2014 NFL Draft prospects. Classification of such tweets into categories like “prediction,” “wish,” or “feeling,” as well as into sentiments like “positive” or “negative,” can be a powerful tool for NFL Front Offices seeking to incorporate public opinion about players into their draft predictive models.

Text classification (which for us includes sentiment analysis) is generally a difficult problem. Many of the challenges of text classification, including co-reference resolution, negation handling, and word sense disambiguation, are not solved in general Natural Language Processing (Liu, 2012). In addition, “sports tweeting” is a domain unto itself, with its own lexicon and jargon. As such, we cannot directly use models trained on a general lexicon for prediction in this specific domain since “a classifier trained using opinion documents from one domain often performs poorly on test data from another domain” (Liu, 2012) — an observation that we test as part of our analysis.

Despite these challenges, we show in this analysis that respectable text classification performance can be achieved by labeling our own domain specific training dataset and using supervised learners that operate on unigram features (also known as a “bag of words” approach).

Literature Review

The literature reviewed by the team is summarized in the References List. Based on this review, we favor text classification using machine learning techniques like Support Vector Machines (SVM) or Naïve Bayes (NB), mainly for their simplicity and higher accuracy compared to more heuristic positive/negative word counts (Miller 2014). Further, we favor training on unigram (one item) features over using ngrams (a continuous sequence of n items), because in the area of “mood, affect, or sentiment classification,” unigram features are recommended to avoid over fitting (Pustejovsky et al, 2012). In summary, the models we consider are supervised learners that predict a binary (or multi-nomial) classification based on single word predictors (presence or absence of a single word). It has been shown that, in some domains, SVM and NB with unigram features can perform sentiment classification on documents with an accuracy of around 79.80% (Pang and Lee, 2002 & Miller, 2014). Random Forests (RF) perform almost as well at around 74% (Miller, 2014).

We also reviewed models that go beyond the “bag of words” approach. For example, there is a new “Sentiment Tree bank” approach, which uses a Recursive Neural Network (RNN) to perform sentiment classification (Socher, 2013). This cutting edge approach uses sentiment labels on entire phrases organized into trees, instead of labels on individual word features. This approach improves the accuracy of sentiment classification for sentences from 80% to 85.4%. In our research, we will evaluate this RNN model alongside the SVM, NB, and RF models discussed earlier.

Our literature review also led to insights on how to refine sentiment analysis in the domain of sports forecasting. For example, we discovered that sports betting sentiment analysis services like Senti-bet classify fan messages into the categories “wish,” “prediction,” and “feeling,” which are valuable distinctions when using sentiment to make forecasts (Grimes, 2012). Since the business application of our sentiment analysis is predicting a player’s value in the draft, the same categories are useful to us.   

We reviewed Holdout Test Sets and K-Fold Cross Validation as approaches for comparing alternative models. Given the discriminating properties of AUC/ROC (“area under curve of receiver operating characteristic”) over straight accuracy (Ling, 2003), we favor AUC/ROC for comparing sentiment classifiers, as this is appropriate for binary (positive/negative) classification. Due to the complications of using AUC in a multinomial classification setting, we prefer accuracy and confusion matrices for comparing category (wish/prediction/feeling) classifiers. 

Methods

We briefly discuss our methods for Data Acquisition, exploratory data analysis (EDA), and Model Comparison. For Data Acquisition, we used the TwitteR package in R to retrieve tweets using twitter searches. In particular, we searched for “Manziel” and “Clowney” in tweets posted a few days before the 2014 NFL Draft and extracted the twitter “status,” which is the text of the message. Focusing only on these two players yielded thousands of tweets, which provided ample data for us to label.

The twitter search used in our Data Acquisition helps to address “entity identification,” which is a key challenge in Sentiment Analysis (Liu, 2012). That is, we already know from the search terms that the tweets are “about” Manziel and Clowney, so we have high confidence that the entities are in fact the topics of the tweets we classify.

Following acquisition, we proceed with data preprocessing and annotation. The main pre-processing task involves deleting any remaining missed entity tweets (i.e., where Manziel or Clowney is not the actual entity), excluding re tweets, screening non-English tweets, removing punctuation, removing twitter handles (but leaving hash tags), and performing word stemming. The annotation consisted of classifying each tweet to a category (prediction/wish/feeling) and a sentiment (positive/negative).

Before jumping into training candidate models, we perform EDA to understand word frequencies and prior distributions of categories and sentiments in the data. We also perform a quick “sentiment baseline” by using the Stanford Tree bank Model to get an idea of how a more general sentiment classifier does in our domain specific tweet data.

For Model Comparison, we consider four (4) supervised classifiers that operate on unigram features: SVM, RF, NB, and Adaboost (AB). We also consider a Hybrid (HYB) model that combines these four classifiers. We fit models for both text classification into categories and sentiment. For category classification, we use accuracy and confusion matrices, to compare model performance. For sentiment classification, we use AUC/ROC to compare performance. Finally, we primarily use R for modeling and analysis, but we also use Python for one model (NB). 

Results

We start by presenting a few visualizations from our EDA, which is helpful for understanding the dataset:
 
We can see from the graphs above how the 879 classified tweets in our training data break down by Player, Category, and Sentiment. We also see that although we labeled fewer Clowney tweets in the training data, Clowney received many more positive tweets relative to negative tweets across all categories. In the “wish” category in particular, tweeters expressed positive wishes about Clowney 18 times more often than negative wishes. This clearly suggests that Clowney was held in higher esteem in the court of public opinion, and this sentiment turned out to reflect reality as Clowney was selected #1 and Manziel #22 in the actual NFL Draft.

 We now turn our attention to the results of our sentiment classification. Here we use a 30% holdout test set to calculate AUC and develop ROC Curves for each of the 5 classifiers:
 
Using ROC curves and AUC as our comparison metric, we can see that the strongest performer for sentiment classification was the Hybrid model (HYB). The Hybrid model was developed by averaging the predicted probabilities for each of the 4 other models. Our AUC of 0.748 for the Hybrid Model is very respectable, and it is comparable to results from sentiment analysis in other domains such as movie reviews (Pang and Lee, 2002).

Next we examine a baseline measurement that puts our sentiment classifier results into perspective. As mentioned earlier, one new technique in sentiment classification is the Recursive Neural Network (RNN), which operates on a phrase tree rather than unigrams. The Stanford RNN gives us an idea of how a fairly strong classifier trained on a different lexicon (movie reviews) fares against our dataset:   



Clearly, the RNN trained with a different lexicon fares poorly on our collection of tweets, suggesting that the sentiment analysis for our particular business application requires training on a domain specific corpus.

 We have established that our sentiment classifiers are fairly good performers. How do the same models perform on category classification? Since there are three categories (wish/prediction/feeling), we cannot use ROC/AUC, so we focus on accuracy and confusion matrices to compare and understand performance. The results are presented below.
 
The confusion matrices above show the overall percent accuracy for each category. Based on overall  accuracy, the RF model is the best category classifier at 58.2%. RF is also the most balanced classifier based on accuracy in each category. For example, it classifies Feeling correctly 63.9% of the time, Prediction 64.5% of the time, and Wish 31.7% of the time. The other classifiers do much better with Feeling, but are weaker at classifying Prediction and Wish.

We might be tempted to conclude that our machine learning models are stronger at sentiment classification than category classification, since the strongest sentiment classifier has an accuracy of 74.5% and the strongest category classifier has an accuracy of 58.2%. This difference is mostly explained by the fact that we have 3 categories and only 2 sentiments. In other words, there are more ways to be wrong with the category classification with 3 choices instead of 2. In fact, if we group Wish into Feeling, such that we have 2 categories (Prediction and Feeling), we find that the accuracy of category classification is around 75.1% using Random Forest (see Appendix A).

Finally, we present the most important or influential words underlying our model classifications.  Since the RF model was strong in both sentiment and category classification, we choose it as a representative model to show “important words”. Word importance is determined by Mean Decreasing Gini score, which increases as the inclusion of a word decreases classification error.

 
 We can see from the results above that some words are intuitively important, such as “like” which is probably a positive sentiment (rather than negative sentiment offered with irony), and “hope” which is very likely a wish. We also see many words that have a neutral sentiment in a general lexicon, such as “pick” or “draft” or “round,” that carry a positive or negative sentiment in this domain. This provides insight into why a classifier trained on a general lexicon will do poorly on our corpus.

It is also interesting to compare the Random Forest Important Words to the words in Appendix B.  The words in Appendix B were selected by identifying words that are more often found in “Positive” or “Prediction” tweets. In other words, we might think of them as “Intuitive” Important Words. The difference between the Random Forest words and those in Appendix B is fairly striking, and it shows that “importance” in Random Forests is driven by complex machinery that is not always intuitive.

Conclusion

We have shown that in the domain of tweets about NFL Draft prospects, we can achieve fairly accurate classification on sentiment and categories using supervised learners that operate on simple unigram features. The accuracy that we achieve for classification of sentiment (positive vs negative) and 2category problems (prediction vs feeling) is generally around 75%, with the Hybrid model performing best for sentiment and the Random Forest performing best for categorization. This performance is comparable to, although slightly lower than, the accuracy achieved by Pang and Lee (80%) and Socher (85%), who trained their models on a movie review corpus. We attribute some of the difference in performance to very different domains (movie reviews vs sports tweets). We also work with a much smaller training dataset. Although, our exploration in Appendix C suggests we may already be seeing diminishing returns in gathering additional data.

The classifiers we developed provide an automated way to generate valuable sentiment analytics about players in the NFL Draft. For example, metrics like Positive to Negative Sentiment Ratios, which can further be broken down into categories like Prediction, Wish, and Feeling, can be powerful predictor variables in models that rank NFL Draft prospects or attempt to predict draft position. The usefulness of automated text classification is clear: an NFL Front Office can now quickly and accurately measure player sentiment over a very large number tweets, without the need to manually classify each one.

References
 
Pustejovsky, James; Stubbs, Amber (20121011). Natural Language Annotation for Machine Learning (Kindle Location 1198). O'Reilly Media. Liu, Bing (20120501).


Liu, Bing (20120501). Sentiment Analysis and Opinion Mining (p. 31). Morgan & Claypool Publishers.

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C., Ng, A., et al. (2013). Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. Retrieved from http://stanford.io/1jUlM1C

Nasukawa, T., & Yi, J. (2003). Capturing Favorability Using Natural Language Processing. Retrieved from http://tredocs.com/tw_files2/urls_41/40/d39217/7zdocs/7.pdf 

Hogenboom, A., Bal, D., Frasincar, F., Bal, M., De Jong, F., & Kaymak, U (2013, March 18). Exploiting Emoticons in Sentiment Analysis. Retrieved from http://eprints.eemcs.utwente.nl/23268/01/sac13senticon. pdf

Grimes, S. (2012, January 26). Sentiment Tool Scans Twitter To Set Super Bowl Odds. Retrieved from http://www.informationweek.com/software/informationmanagement/sentimenttoolscanstwittertosetsuper bowlodds/d/did/1102493?print=yes

Pang, B., & Lee, L. (2004). A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts. Retrieved from http://aclweb.org/anthology/P/P04/P041035. pdf?CFID=325440138&CFTOKEN=34562827 

 Luce, L. (2012, January 2). Twitter sentiment analysis using Python and NLTK. Retrieved from http://bit.ly /KDMbUW

Miller, T. W. (2014). Modeling techniques in predictive analytics business problems and solutions with R. Upper Saddle River, N.J.: Pearson Education.

Pang, B., & Lee, L. (2002). Thumbs up? Sentiment Classification using Machine Learning Techniques. Retrieved from http://www.cs.cornell.edu/home/llee/papers/sentiment.pdf

Hastie, T., & Tibshirani, R. (2009). The elements of statistical learning data mining, inference, and prediction  (2nd ed.). New York: Springer.

Ling, C., & Huang, J. (2003). AUC: a Statistically Consistent and more Discriminating Measure than  ccuracy. Retrieved from http://cling.csd.uwo.ca/papers/ijcai03.pdf


Appendix A
 
Results for a Random Forest category classifier that uses 2 labels (Prediction vs Feeling).

 
Appendix B
 
We examined every word in our corpus to understand how frequently each one appears in tweets labeled as “positive” or “prediction.” In the graphs below, we identify the top ten words measured by Positive to Negative Ratio as well as Prediction to Feeling Ratio. Intuitively, words with higher ratios may be stronger predictors of a positive or prediction classification. As it turns out, the model interactions between words in a tweet are complex, so these “intuitive” words do not always appear as “important” words for classifiers like Random Forest. 

 
 Appendix C 

 We have a small dataset relative to other studies, many of which have tens of thousands of word features. To understand the return on gathering and labeling additional data, we studied the increase in AUC resulting from including incrementally more labeled tweets to our training dataset. From the results below, we conclude that we are already seeing diminishing returns in AUC from adding more features (via larger samples). If we do require more labeled data, however, we might consider Semi-Supervised Learning (SSL)
using Expectation Maximization (EM), which could help us increase our dataset in an automated way (Pustejovsky et al, 2012).

 
Appendix D
 
We also attempted to create features based on emoticons. We looked for emoticons in our tweet dataset, but unfortunately did not find too many. The “:/” emoticon matches were mainly due to matches with http:// According to our Literature Review, emoticons can be powerful predictors of sentiment.  An abbreviated set of output is shown below. See Output_emoticon.txt for a full listing. 

 
Appendix E
 
Here is a quick visualization of player sentiment by geography: