NLTK includes pre-trained models in addition to its text corpus. The VADER Sentiment Lexicon model, aimed at sentiment analysis on social media. Let's see how it works.
If you’ve ever been asked to rate your experience with customer support on a scale from 1-10, you may have contributed to a Net Promoter Score (NPS). With this approach to customer experience, you generally are looking for promoters, those who rate their experience 9-10, because they are advocates for your brand and will keep buying, consuming, and telling others about their experience.
Within the context of NPS, detractors are anyone who rates their experience with a score from 0-6. They are unhappy and often spread their displeasure through negative word-of-mouth. These customers are typically a priority for outreach. A value of 7-8 is considered passive, satisfied, and neutral.
Sentiment analysis can give insights to NPS, but without requiring our audience to directly take a survey. Sentiment analysis can help you find promoters and detractors simply by evaluating what people are saying about you in social media or public forums.
In Finding Data for Natural Language Processing, we talked about textual datasets for NLP and techniques for creating a custom dataset by collecting posts and comments from Reddit discussions.
In this article, we'll look at techniques you can use to start doing the actual NLP analysis. We'll be building on the data collected in the previous article.
VADER Sentiment Analyzer
Developed in 2014, VADER (Valence Aware Dictionary and sEntiment Reasoner) is a pre-trained model that uses rule-based values tuned to sentiments from social media. It evaluates the text of a message and gives you an assessment of not just positive and negative, but the intensity of that emotion as well.
It uses a dictionary of terms that it can evaluate. From the GitHub repository this includes examples like:
- Negations - a modifier that reverses the meaning of a phrase ("not great").
- Contractions - negations, but more complex ("wasn’t great").
- Punctuation - increased intensity ("It’s great!!!").
- Slang - variations of slang words such as "kinda", "sux", or "hella".
It's even able to understand acronyms ("lol") and emoji (❤).
The scoring is a ratio of the proportion for text that falls into each category. Language is not black and white, so it is rare to see a completely positive or a completely negative score. Since this model has been pre-trained for social media, it should be very applicable to comments made by users on Reddit.
Let’s first look at an example from a comment retrieved previously from Reddit.
Comments[116].body
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
analyzer.polarity_scores(comments[116].body)
The output of this analysis is:
{'neg': 0.0, 'neu': 0.436, 'pos': 0.564, 'compound': 0.3802}
On Reddit, a post like "This is cool!" is high praise.
We’ve downloaded (nltk.download('vader_lexicon')
) and imported (from nltk.sentiment.vader import SentimentIntensityAnalyzer
) the Vader sentiment analyzer and used it to score a particular comment from the collection of comments (analyzer.polarity_scores(comments[116].body)
).
The results of polarity_scores
gives us numerical values for use of negative, neutral, and positive word choice. The compound value reflects the overall sentiment ranging from -1 being very negative and +1 being very positive.
You can find more about the NLTK sentiment usage from the pydoc page: https://www.nltk.org/api/nltk.sentiment.html.
Sentiment for all Comments on a Reddit Post
Let’s look at the sentiment overall for this post instead of just a single comment. There were 119 comments to analyze and we’ll put them into buckets to keep count.
len(comments)
result = {'pos': 0, 'neg': 0, 'neu': 0}
for comment in comments:
score = analyzer.polarity_scores(comment.body)
if score['compound'] > 0.05:
result['pos'] += 1
elif score['compound'] < -0.05:
result['neg'] += 1
else:
result['neu'] += 1
print(result)
The output is:
{'pos': 65, 'neg': 25, 'neu': 29}
What we’ve learned is that for this post, the comments overall were generally positive.
If you start analyzing your own posts using a model like this, you may want to tune the threshold up or down. For example, only looking at compound scores +/- 0.5 instead of 0.05 would highlight the more extreme opinions.
What can you do with this information? If you were trying to prioritize how to engage with your community, you might look at the positive comments and give them recognition as your supporters. If you were trying to win back detractors, you might focus on the negative scores and see if you can find constructive feedback from their comment to improve your offering or personal outreach efforts to address specific customer concerns.
Next Steps
As you've seen, we can take a text from a variety of sources and do a quick analysis to understand positive and negative sentiment. This is useful feedback to understand whether a product, service, or content is well-liked. It can also help prioritize community engagement.
As a next step, we can consider the Pros and Cons of NLTK Sentiment Analysis with VADER.
We can also take this analysis project further by leveraging machine learning approaches to understanding language and try to improve upon our results in NLTK and Machine Learning.
Jayson manages Developer Relations for Dolby Laboratories, helping developers deliver spectacular experiences with media.
Jayson likes learning and teaching about new technologies with a wide range applications and industries. He's built solutions with companies including DreamWorks Animation (Kung Fu Panda, How to Train Your Dragon, etc.), General Electric (Predix Industrial IoT), The MathWorks (MATLAB), Rackspace (Cloud), and HERE Technologies (Maps, Automotive).