Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / artificial-intelligence

Pros and Cons of NLTK Sentiment Analysis with VADER

4.73/5 (4 votes)
29 May 2020CPOL3 min read 22.4K   38  
This article is the fourth in the Sentiment Analysis series that uses Python and the open-source Natural Language Toolkit. In this and additional articles, we’re going to try and improve upon our approach to analyzing the sentiment of our communities.
We’ll recap how NLTK and Python can be used to quickly get a sentiment analysis of posts from Reddit using VADER, and the trade-offs of this approach.

The goal of this series on Sentiment Analysis is to use Python and the open-source Natural Language Toolkit (NLTK) to build a library that scans replies to Reddit posts and detects if posters are using negative, hostile or otherwise unfriendly language.

Listening to feedback is critical to the success of projects, products, and communities. However, as the size of your audience increases, it becomes increasingly difficult to understand what your users are saying. For this, sentiment analysis can help.

In Using Pre-trained VADER Models for NLTK Sentiment Analysis, we examined the role sentiment analysis plays in identifying the positive and negative feelings others may have for your brand or activities. Analyzing unstructured text is a common enough activity in natural language processing (NLP) that there are mainstream tools that can make it easier to get started.

Python’s Natural Language Toolkit (NLTK) is an example of one of these tools. In the previous article, we learned how to retrieve data from Reddit, with its very popular online communities. We then used VADER analysis to derive a sentiment score based on that Reddit data. The sentiment score helps us understand whether comments in that Reddit data represent positive or negative views.

In this and additional articles, we’re going to try and improve upon our approach to analyzing the sentiment of our communities. We’ll start by reviewing the pros and cons of the VADER model we've used so far.

The Lexical Approach to Sentiment Analysis

The VADER Sentiment Analyzer uses a lexical approach. That means it uses words or vocabularies that have been assigned predetermined scores as positive or negative. The scores are based on a pre-trained model labeled as such by human reviewers.

For example, here’s a comment from the Reddit data:

Python
import praw

# Connect to reddit to query a specific posting
reddit = praw.Reddit(client_id=’your-id’,    
         client_secret=’your-secret’, 
         user_agent=’your-agent’)
post = "https://www.reddit.com/r/learnpython/comments/fwhcas/whats_the_difference_between_and_is_not"
submission = reddit.submission(url=post)

# Get the comments from the post replacing ‘more’ expansion
submission.comments.replace_more(limit=None)
comments = submission.comments.list()
print(comments[116].body)

The output is:

'This is cool!'

The terms "This", "is", and "cool" each have an emotional intensity ranging from -4 to +4. Here’s the lexicon entry for the token "cool":

Python
cool    1.3 0.64031 [1, 1, 2, 1, 1, 1, 2, 2, 2, 0]

Additional rules cover syntax elements like punctuation. The exclamation point, for example, is used to modify the overall intensity of a phrase or sentence. Other terms, such as "but" or "not", would modify the intensity in the opposite direction.

There are some distinct advantages to this approach:

  • For many applications, such as evaluating public opinion, performing a competitive analysis, or enhancing customer experience, this approach is easy to understand.
  • The lexical approach is quick to implement, requiring just readily available libraries and a few lines of code.
  • It's easy to capture a dataset for analysis.
  • It's efficient at analyzing large datasets.

There are also some disadvantages to this approach:

  • Misspellings and grammatical mistakes may cause the analysis to overlook important words or usage.
  • Sarcasm and irony may be misinterpreted.
  • Analysis is language-specific.
  • Discriminating jargon, nomenclature, memes, or turns of phrase may not be recognized.

For certain use cases that seek a higher level of accuracy, it may be worth evaluating alternatives.

More important, certain domain-specific contexts may need a different approach. For example, a target corpus that includes specialized terms, language, or knowledge — like a programming community — differs substantially from the social media posts the pre-trained VADER model initially used. Source code, for example, with the exception of the occasional aggressive variable name, can be misinterpreted in sentiment analysis.

There are some machine learning classification approaches that may help with this.

Next Steps

In this article, we quickly looked at some pros and cons of using a textual approach to NLP.

As a next step, NLTK and Machine Learning for Sentiment Analysis covers creating the training, test, and evaluation datasets for the NLTK Naive Bayes classifier.

If you need to catch up with previous steps of the VADER analysis, see Using Pre-trained VADER Models for NLTK Sentiment Analysis.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)