Presented by
Quoc Le, Jeff Ori,
Chris Sherman, Gene Totten, Ryan Williams
Introduction
In this analysis, we
explore predictive models for classifying unstructured text in tweets about
2014 NFL Draft prospects. Classification of such tweets into categories like “prediction,” “wish,” or “feeling,” as well as into sentiments like “positive” or “negative,” can
be a powerful tool for NFL Front Offices seeking to incorporate public opinion about
players into their draft predictive models.
Text classification
(which for us includes sentiment analysis) is generally a difficult problem.
Many of the challenges of text classification, including co-reference
resolution, negation handling, and word sense disambiguation, are not solved in
general Natural Language Processing (Liu, 2012). In addition, “sports tweeting” is a domain unto itself,
with its own lexicon and jargon. As such, we cannot directly use models trained
on a general lexicon for prediction in this specific domain since “a classifier trained using opinion documents from one domain often
performs poorly on test data from another domain”
(Liu, 2012) — an observation that we test as part of
our analysis.
Despite these
challenges, we show in this analysis that respectable text classification
performance can be achieved by labeling our own domain specific training
dataset and using supervised learners that operate on unigram features (also
known as a “bag of words” approach).
Literature Review
The literature
reviewed by the team is summarized in the References List. Based on this
review, we favor text classification using machine learning techniques like
Support Vector Machines (SVM) or Naïve Bayes
(NB), mainly for their simplicity and higher accuracy compared to more
heuristic positive/negative word counts (Miller 2014). Further, we favor
training on unigram (one item) features over using ngrams (a continuous
sequence of n items),
because in the area of “mood, affect, or sentiment
classification,” unigram features are recommended to
avoid over fitting (Pustejovsky et al, 2012). In summary, the models we consider
are supervised learners that predict a binary (or multi-nomial) classification
based on single word predictors (presence or absence of a single word). It has
been shown that, in some domains, SVM and NB with unigram features can perform
sentiment classification on documents with an accuracy of around 79.80% (Pang
and Lee, 2002 & Miller, 2014). Random Forests (RF) perform almost as well
at around 74% (Miller, 2014).
We also reviewed
models that go beyond the “bag of words” approach. For example, there is a new “Sentiment
Tree bank” approach, which uses a Recursive Neural
Network (RNN) to perform sentiment classification (Socher, 2013). This cutting edge
approach uses sentiment labels on entire phrases organized into trees, instead
of labels on individual word features. This approach improves the accuracy of
sentiment classification for sentences from 80% to 85.4%. In our research, we
will evaluate this RNN model alongside the SVM, NB, and RF models discussed
earlier.
Our literature review
also led to insights on how to refine sentiment analysis in the domain of
sports forecasting. For example, we discovered that sports betting sentiment
analysis services like Senti-bet classify fan messages into the categories “wish,” “prediction,” and “feeling,”
which are valuable distinctions when using sentiment to make forecasts (Grimes,
2012). Since the business application of our sentiment analysis is predicting a
player’s value in the draft, the same
categories are useful to us.
We reviewed Holdout Test Sets and K-Fold Cross Validation as approaches for comparing alternative models. Given the discriminating properties of AUC/ROC (“area under curve of receiver operating characteristic”) over straight accuracy (Ling, 2003), we favor AUC/ROC for comparing sentiment classifiers, as this is appropriate for binary (positive/negative) classification. Due to the complications of using AUC in a multinomial classification setting, we prefer accuracy and confusion matrices for comparing category (wish/prediction/feeling) classifiers.
Methods
We briefly discuss
our methods for Data Acquisition, exploratory data analysis (EDA), and Model Comparison.
For Data Acquisition, we used the TwitteR package in R to retrieve tweets using
twitter searches. In particular, we searched for “Manziel” and “Clowney” in
tweets posted a few days before the 2014 NFL Draft and extracted the twitter “status,” which is the text of the
message. Focusing only on these two players yielded thousands of tweets, which
provided ample data for us to label.
The twitter search
used in our Data Acquisition helps to address “entity
identification,” which is a key challenge in Sentiment
Analysis (Liu, 2012). That is, we already know from the search terms that the tweets
are “about”
Manziel and Clowney, so we have high confidence that the entities are in fact
the topics of the tweets we classify.
Following
acquisition, we proceed with data preprocessing and annotation. The main pre-processing
task involves deleting any remaining missed entity tweets (i.e., where Manziel
or Clowney is not the actual entity), excluding re tweets, screening non-English
tweets, removing punctuation, removing twitter handles (but leaving hash tags),
and performing word stemming. The annotation consisted of classifying each
tweet to a category (prediction/wish/feeling) and a sentiment
(positive/negative).
Before jumping into
training candidate models, we perform EDA to understand word frequencies and prior
distributions of categories and sentiments in the data. We also perform a quick
“sentiment baseline” by using the Stanford
Tree bank Model to get an idea of how a more general sentiment classifier does
in our domain specific tweet data.
For Model Comparison,
we consider four (4) supervised classifiers that operate on unigram features: SVM,
RF, NB, and Adaboost (AB). We also consider a Hybrid (HYB) model that combines
these four classifiers. We fit models for both text classification into categories
and sentiment. For category classification, we use accuracy and confusion
matrices, to compare model performance. For sentiment classification, we use
AUC/ROC to compare performance. Finally, we primarily use R for modeling and analysis,
but we also use Python for one model (NB).
Results
We start by
presenting a few visualizations from our EDA, which is helpful for
understanding the dataset:
We can see from the
graphs above how the 879 classified tweets in our training data break down by
Player, Category, and Sentiment. We also see that although we labeled fewer
Clowney tweets in the training data, Clowney received many more positive tweets
relative to negative tweets across all categories. In the “wish” category in particular, tweeters
expressed positive wishes about Clowney 18 times more often than negative
wishes. This clearly suggests that Clowney was held in higher esteem in the
court of public opinion, and this sentiment turned out to reflect reality as
Clowney was selected #1 and Manziel #22 in the actual NFL Draft.
We now turn our attention to the results of our sentiment classification. Here we use a 30% holdout test set to calculate AUC and develop ROC Curves for each of the 5 classifiers:
Using ROC curves and AUC as our comparison metric, we can see that the strongest performer for sentiment classification was the Hybrid model (HYB). The Hybrid model was developed by averaging the predicted probabilities for each of the 4 other models. Our AUC of 0.748 for the Hybrid Model is very respectable, and it is comparable to results from sentiment analysis in other domains such as movie reviews (Pang and Lee, 2002).
Clearly, the RNN trained with a different lexicon fares poorly on our collection of tweets, suggesting that the sentiment analysis for our particular business application requires training on a domain specific corpus.
We have established that our sentiment classifiers are fairly good performers. How do the same models perform on category classification? Since there are three categories (wish/prediction/feeling), we cannot use ROC/AUC, so we focus on accuracy and confusion matrices to compare and understand performance. The results are presented below.
The confusion matrices above show the overall percent accuracy for each category. Based on overall accuracy, the RF model is the best category classifier at 58.2%. RF is also the most balanced classifier based on accuracy in each category. For example, it classifies Feeling correctly 63.9% of the time, Prediction 64.5% of the time, and Wish 31.7% of the time. The other classifiers do much better with Feeling, but are weaker at classifying Prediction and Wish.
We might be tempted to conclude that our machine learning models are stronger at sentiment classification than category classification, since the strongest sentiment classifier has an accuracy of 74.5% and the strongest category classifier has an accuracy of 58.2%. This difference is mostly explained by the fact that we have 3 categories and only 2 sentiments. In other words, there are more ways to be wrong with the category classification with 3 choices instead of 2. In fact, if we group Wish into Feeling, such that we have 2 categories (Prediction and Feeling), we find that the accuracy of category classification is around 75.1% using Random Forest (see Appendix A).
Finally, we present the most important or influential words underlying our model classifications. Since the RF model was strong in both sentiment and category classification, we choose it as a representative model to show “important words”. Word importance is determined by Mean Decreasing Gini score, which increases as the inclusion of a word decreases classification error.
We can see from the results above that some words are intuitively important, such as “like” which is probably a positive sentiment (rather than negative sentiment offered with irony), and “hope” which is very likely a wish. We also see many words that have a neutral sentiment in a general lexicon, such as “pick” or “draft” or “round,” that carry a positive or negative sentiment in this domain. This provides insight into why a classifier trained on a general lexicon will do poorly on our corpus.
It is also interesting to compare the Random Forest Important Words to the words in Appendix B. The words in Appendix B were selected by identifying words that are more often found in “Positive” or “Prediction” tweets. In other words, we might think of them as “Intuitive” Important Words. The difference between the Random Forest words and those in Appendix B is fairly striking, and it shows that “importance” in Random Forests is driven by complex machinery that is not always intuitive.
Conclusion
We have shown that in the domain of tweets about NFL Draft prospects, we can achieve fairly accurate classification on sentiment and categories using supervised learners that operate on simple unigram features. The accuracy that we achieve for classification of sentiment (positive vs negative) and 2category problems (prediction vs feeling) is generally around 75%, with the Hybrid model performing best for sentiment and the Random Forest performing best for categorization. This performance is comparable to, although slightly lower than, the accuracy achieved by Pang and Lee (80%) and Socher (85%), who trained their models on a movie review corpus. We attribute some of the difference in performance to very different domains (movie reviews vs sports tweets). We also work with a much smaller training dataset. Although, our exploration in Appendix C suggests we may already be seeing diminishing returns in gathering additional data.
The classifiers we developed provide an automated way to generate valuable sentiment analytics about players in the NFL Draft. For example, metrics like Positive to Negative Sentiment Ratios, which can further be broken down into categories like Prediction, Wish, and Feeling, can be powerful predictor variables in models that rank NFL Draft prospects or attempt to predict draft position. The usefulness of automated text classification is clear: an NFL Front Office can now quickly and accurately measure player sentiment over a very large number tweets, without the need to manually classify each one.
References
Pustejovsky, James; Stubbs, Amber (20121011). Natural Language Annotation for Machine Learning (Kindle Location 1198). O'Reilly Media. Liu, Bing (20120501).
Liu, Bing (20120501). Sentiment Analysis and Opinion Mining (p. 31). Morgan & Claypool Publishers.
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C., Ng, A., et al. (2013). Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. Retrieved from http://stanford.io/1jUlM1C
Nasukawa, T., & Yi, J. (2003). Capturing Favorability Using Natural Language Processing. Retrieved from http://tredocs.com/tw_files2/urls_41/40/d39217/7zdocs/7.pdf
Hogenboom, A., Bal, D., Frasincar, F., Bal, M., De Jong, F., & Kaymak, U (2013, March 18). Exploiting Emoticons in Sentiment Analysis. Retrieved from http://eprints.eemcs.utwente.nl/23268/01/sac13senticon. pdf
Grimes, S. (2012, January 26). Sentiment Tool Scans Twitter To Set Super Bowl Odds. Retrieved from http://www.informationweek.com/software/informationmanagement/sentimenttoolscanstwittertosetsuper bowlodds/d/did/1102493?print=yes
Pang, B., & Lee, L. (2004). A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts. Retrieved from http://aclweb.org/anthology/P/P04/P041035. pdf?CFID=325440138&CFTOKEN=34562827
Luce, L. (2012, January 2). Twitter sentiment analysis using Python and NLTK. Retrieved from http://bit.ly /KDMbUW
Miller, T. W. (2014). Modeling techniques in predictive analytics business problems and solutions with R. Upper Saddle River, N.J.: Pearson Education.
Pang, B., & Lee, L. (2002). Thumbs up? Sentiment Classification using Machine Learning Techniques. Retrieved from http://www.cs.cornell.edu/home/llee/papers/sentiment.pdf
Hastie, T., & Tibshirani, R. (2009). The elements of statistical learning data mining, inference, and prediction (2nd ed.). New York: Springer.
Ling, C., & Huang, J. (2003). AUC: a Statistically Consistent and more Discriminating Measure than ccuracy. Retrieved from http://cling.csd.uwo.ca/papers/ijcai03.pdf
Appendix A
Results for a Random Forest category classifier that uses 2 labels (Prediction vs Feeling).
Appendix B
We examined every word in our corpus to understand how frequently each one appears in tweets labeled as “positive” or “prediction.” In the graphs below, we identify the top ten words measured by Positive to Negative Ratio as well as Prediction to Feeling Ratio. Intuitively, words with higher ratios may be stronger predictors of a positive or prediction classification. As it turns out, the model interactions between words in a tweet are complex, so these “intuitive” words do not always appear as “important” words for classifiers like Random Forest.
Appendix C
We have a small dataset relative to other studies, many of which have tens of thousands of word features. To understand the return on gathering and labeling additional data, we studied the increase in AUC resulting from including incrementally more labeled tweets to our training dataset. From the results below, we conclude that we are already seeing diminishing returns in AUC from adding more features (via larger samples). If we do require more labeled data, however, we might consider Semi-Supervised Learning (SSL)
using Expectation Maximization (EM), which could help us increase our dataset in an automated way (Pustejovsky et al, 2012).
Appendix D
We also attempted to create features based on emoticons. We looked for emoticons in our tweet dataset, but unfortunately did not find too many. The “:/” emoticon matches were mainly due to matches with http:// According to our Literature Review, emoticons can be powerful predictors of sentiment. An abbreviated set of output is shown below. See Output_emoticon.txt for a full listing.
Appendix E
Here is a quick visualization of player sentiment by geography: