View Filtered Tweets in a Word Cloud

Marc Clifton

5.00/5 (13 votes)

Feb 17, 2015

CPOL

7 min read

22345

Using a couple open source packages, I glue together a tweet stream and display the word hits in a word cloud using a force directed graph.

Introduction

I recently had a conversation with Srini (aka Mike) Vasan (CEO at Quantum Ventura) on the subject of semantic analysis that led to this fun little project. The idea is to take a Twitter stream (using the wonderful open source Tweetinvi C# API) and marrying the output to a word cloud, which I've actually implemented as a Force Directed Graph using, as a starting point, the code that Bradley Smith blogged about in 2010. I looked at a few word cloud generators but none were suited for real-time updating, however, a force directed graph is a perfect way of creating a living, dynamic view of tweets as they are happening in real time.

A (silent) video shows this best, so I've posted one here: https://www.youtube.com/watch?v=vEH_1h0jrZY

Wait until the 10 second mark for fun stuff to start happening.

The salient points of this applet are:

Word hit counts are shown in different font sizes (from 8pt to 36pt, representing counts from 1 to 24)
Word hit counts are also reflected in the colorization, from blue (representing 1 hit) to red (representing 24 or more hits)
A maximum of 100 words are shown
To accommodate new words once the maximum is reached, existing words with 1 hit are randomly removed
To prevent saturation of >1 hit count words, over time, counts on all words are slowly decremented

Source Code

The source code is on GitHub, here: https://github.com/cliftonm/TwitterWordCloud-WinForm

Accessing Twitter with the Tweetinvi API

This is very simple. You'll need to first obtain a consumer key and consumer secret from Twitter here: https://apps.twitter.com/. Then, get an access token and access token secret from here: https://api.twitter.com/oauth/request_token

Once you've done that, the credentials are set up in the API with the call:

// Setup your credentials
TwitterCredentials.SetCredentials("Access_Token", "Access_Token_Secret", "Consumer_Key", "Consumer_Secret");

To get this working in the app, you'll need to place these keys into a file called "twitterauth.txt" in the bin\Debug (or bin\Release) folder. The format should be:

[Access Token]
[Access Token Secret]
[Consumer Key]
[Consumer Secret]

For example (made up numbers):

bl6NVMpfD
bxrhfA8v
svdaQ86mNTE
lvGwXzG3MJnN

The values you get from Twitter will be much longer.

I read these four lines and initialize the credentials with:

protected void TwitterAuth()
{
  string[] keys = File.ReadAllLines("twitterauth.txt");
  TwitterCredentials.SetCredentials(keys[0], keys[1], keys[2], keys[3]);
}

Starting / Stopping a Filtered Stream

The code gracefully handles starting and stopping a stream. By graceful, I mean that, if a stream exists, we shut down the current stream, wait for the event that indicates it has stopped, and then start a new stream.

/// <summary>
/// If a stream hasn't been started, just start it.
/// If a stream has been started, shut it down, and when it's stopped, start the new stream.
/// </summary>
protected void RestartStream(string keyword)
{
  if (stream != null)
  {
    Clear();
    stream.StreamStopped += (sender, args) => StartStream(keyword);
    stream.StopStream();
  }
  else
  {
    StartStream(keyword);
  }
}

/// <summary>
/// Start a stream, filtering ony the keyword and only English language tweets.
/// </summary>
protected void StartStream(string keyword)
{
  stream = Stream.CreateFilteredStream();
  stream.AddTrack(keyword);
  stream.MatchingTweetReceived += (sender, args) =>
  {
    if (args.Tweet.Language == Language.English)
    {
      UpdateFdg(args.Tweet.Text);
    }
  };

  stream.StartStreamMatchingAllConditionsAsync();
}

/// <summary>
/// User wants to stop the stream.
/// </summary>
protected void OnStop(object sender, EventArgs e)
{
  if (stream != null)
  {
    stream.StreamStopped += (s, args) => stream = null;
    stream.StopStream();
  }
}

/// <summary>
/// Clear the word cloud.
/// </summary>
protected void Clear()
{
  wordNodeMap.ForEach(kvp => kvp.Value.Diagram = null);
  wordNodeMap.Clear();
}

Parsing the Tweet

There's lots of parts of speech that needs to be parsed out of a tweet. At the moment, the dictionary of words to exclude is harcoded:

protected List<string> skipWords = new List<string>(new string[] { "a", "an", "and", "the", "it", ... etc ...

you get the idea.

We also want to remove punctuation (a somewhat brute force approach):

protected List<string> punctuation = new List<string>(new string[] { ".", ",", ";", "?", "!" });

public static class Extensions
{
  // TODO: This is probably painfully slow.
  public static string StripPunctuation(this string s)
  {
    var sb = new StringBuilder();

    foreach (char c in s)
    {
      if (!char.IsPunctuation(c))
      {
        sb.Append(c);
      }
    }

    return sb.ToString();
  }
}

and filter out specific components of the tweet and the words in our dictionary:

/// <summary>
/// Return true if the word should be eliminated.
/// The word should be in lowercase!
/// </summary>
protected bool EliminateWord(string word)
{
  bool ret = false;
  int n;

  if (int.TryParse(word, out n))
  {
    ret = true;
  }
  else if (word.StartsWith("#"))
  {
    ret = true;
  }
  else if (word.StartsWith("http"))
  {
    ret = true;
  }
  else
  {
    ret = skipWords.Contains(word);
  }

  return ret;
}

Avoiding Saturation and Accommodating new Tweets

As mentioned earlier, once we reach the limit of 100 words, we remove stale words to make room for new words:

/// <summary>
/// Remove the stalest 1 hit count word from the list -- this is the word that has not been updated the longest.
/// We do this only when the word count exceends MaxWords
/// </summary>
protected void RemoveAStaleWord()
{
  if (wordNodeMap.Count > MaxWords)
  {
    // TODO: Might be more efficient to maintain a sorted list to begin with!
    DateTime now = DateTime.Now;
    KeyValuePair<string, TextNode> tnode = wordNodeMap.Where(w => w.Value.Count==1).
          OrderByDescending(w => (now - w.Value.UpdatedOn).TotalMilliseconds).First();
    // Do not call RemoveNode, as this results in a stack overflow because the property setter has this side effect.
    tnode.Value.Diagram = null; // THIS REMOVES THE NODE FROM THE DIAGRAM. 
    wordNodeMap.Remove(tnode.Key);
    wordTweetMap.Remove(tnode.Key);
  }
}

The above algorithm applies only to words with one hit count. If we don't do this, high volume streams, like "Obama" result in words never gaining any traction because of the huge volume of tweets coming in. By eliminating only the oldest "rabble", we get a nice word cloud of the concerns around President Obama:

Saturation is avoided by reducing all word counts over time, based on the number of tweets (iterations) modulus some saturation value, currently set to 20 - in other words, every 20 tweets, all word counts are decremented:

/// <summary>
/// Prevent saturation by decrementing all word counts every 20 tweets.
/// </summary>
protected void ReduceCounts()
{
  // Every 20 iterations (the default for SaturationCount), decrement the word count on all non-1 count words.
  // This allows us to eventually replace old words no longer comning up in new tweets.
  if (iteration % SaturationCount == 0)
  {
    iteration = 0;
    wordNodeMap.Where(wc => wc.Value.Count > 1).Select(wc => wc.Key).ForEach(w=>wordNodeMap[w].DecrementCount());
  }
}

Queueing the Tweet

The tweets are received asynchronously, so we put a lock around adding them to the queue:

protected void UpdateFdg(string text)
{
  lock (this)
  {
    tweetQueue.Enqueue(text);
  }
}

De-Queueing the Tweet

The entire process of updating the FDG is done in the application's main thread, specifically in the OnPaint method, which is called 20 times a second by invalidating the owner-draw panel:

timer = new System.Windows.Forms.Timer();
timer.Interval = 1000 / 20; // 20 times a second, in milliseconds.
timer.Tick += (sender, args) => pnlCloud.Invalidate(true);

In the Paint event handler, we dequeue the tweet, update the nodes in the graph, execute a single iteration cycle of the FDG and draw the results:

pnlCloud.Paint += (sender, args) =>
{
  Graphics gr = args.Graphics;
  gr.SmoothingMode = System.Drawing.Drawing2D.SmoothingMode.HighQuality;

  ++paintIteration;

  if (!overrun)
  {
    overrun = true;
    int maxTweets = 20;

    // We assume here that we can parse the data faster than the incoming stream hands it to us.
    // But we put in a safety check to handle only 20 tweets.
    while (tweetQueue.Count > 0 && (--maxTweets > 0))
    {
      string tweet;

      lock (this)
      {
        tweet = tweetQueue.Dequeue();
      }

      SynchronousUpdate(tweet);
    }

    // gr.Clear(Color.White);
    diagram.Iterate(Diagram.DEFAULT_DAMPING, Diagram.DEFAULT_SPRING_LENGTH, Diagram.DEFAULT_MAX_ITERATIONS);
    diagram.Draw(gr, Rectangle.FromLTRB(12, 24, pnlCloud.Width - 12, pnlCloud.Height - 36));
    overrun = false;
  }
  else
  {
    gr.DrawString("overrun", font, brushBlack, new Point(3, 3));
  }
};

I've never seen the application emit an overrun message, so I assume that everything processes fast enough that that's not an issue. Also, the processing of the incoming tweets, filtering the words, etc., could all be done in separate threads, but for simplicity, and because a lot of the existing code that I used for the FDG would need refactoring to be more thread-friendly, I decided to keep it simple and perform all the processing synchronously.

Updating the Counters and Tweet Buffers

The workhorse function is really SynchronousUpdate. Here, we remove any punctuation, eliminate words we don't care about, replace stale words with any new words in the tweet, and update the word hit counters. We also record up to "MaxTweet" tweets for each word, which (as I'll show next) on mouse-over, you can then see the tweet text. Here's the method:

protected void SynchronousUpdate(string tweet)
{
  string[] words = tweet.Split(' ');

  ++iteration;
  ReduceCounts();

  foreach (string w in words)
  {
    string word = w.StripPunctuation();
    string lcword = word.ToLower();
    TextNode node;

    if (!EliminateWord(lcword))
    {
      if (!wordNodeMap.TryGetValue(lcword, out node))
      {
        ++totalWordCount;
        PointF p = rootNode.Location;
        RemoveAStaleWord();
        TextNode n = new TextNode(word, p);
        rootNode.AddChild(n);
        wordNodeMap[lcword] = n;
        wordTweetMap[lcword] = new Queue<string>(new string[] { tweet });
      }
      else
      {
        wordNodeMap[lcword].IncrementCount();
        Queue<string> tweets = wordTweetMap[lcword];

        // Throw away the oldest tweet if we have more than 20 associated with this word.
        if (tweets.Count > MaxTweets)
        {
          tweets.Dequeue();
        }

        tweets.Enqueue(tweet);
      }
    }
  }
}

Mouse-Overs

Mouse-overs are handled by two events:

pnlCloud.MouseMove += OnMouseMove;
pnlCloud.MouseLeave += (sender, args) =>
{
  if (tweetForm != null)
  {
    tweetForm.Close();
    tweetForm=null;
    mouseWord=String.Empty;
  }
};

When the mouse leaves the owner-draw panel, we close the form displaying the tweets and reset everything to a "not showing tweets" state.

When the user moves the mouse over the owner-draw panel, we check for whether the mouse coordinates are inside the rectangle displaying a word. There's some logic to update the existing tweet form or create a new one if one isn't displayed:

/// <summary>
/// Display tweets for the word the user is hovering over.
/// If a tweet popup is currently displayed, move popup window until the mouse is over a different word.
/// </summary>
protected void OnMouseMove(object sender, MouseEventArgs args)
{
  var hits = wordNodeMap.Where(w => w.Value.Region.Contains(args.Location));
  Point windowPos = PointToScreen(args.Location);
  windowPos.Offset(50, 70);

  if (hits.Count() > 0)
  {
    string word = hits.First().Key;
    TextNode node = hits.First().Value;

    if (mouseWord == word)
    {
      tweetForm.Location = windowPos;
    }
    else
    {
      if (tweetForm == null)
      {
        tweetForm = new TweetForm();
        tweetForm.Location = windowPos;
        tweetForm.Show();
        tweetForm.TopMost = true;
      }

      // We have a new word.
      tweetForm.tbTweets.Clear();
      ShowTweets(word);
      mouseWord = word;
    }
  }
  else
  {
    // Just move the window.
    if (tweetForm != null)
    {
      tweetForm.Location = windowPos;
      tweetForm.TopMost = true;
    }
  }
}

The result is popup window that moves with the mouse as the user moves around the owner-draw panel:

The Force Directed Graph

If you look at Bradley Smith's original FDG code, you'll notice that I've changed a few things. For one, I'm not drawing the force lines, only the nodes:

foreach (Node node in nodes)
{
  PointF destination = ScalePoint(node.Location, scale);

  Size nodeSize = node.Size;
  RectangleF nodeBounds = new RectangleF(center.X + destination.X - (nodeSize.Width / 2), center.Y + destination.Y - (nodeSize.Height / 2), nodeSize.Width, nodeSize.Height);
  node.DrawNode(graphics, nodeBounds);
}

The original code was also simply drawing spots, so I extended the SpotNote class to be able draw text as well:

public class TextNode : SpotNode
{
  protected int count;

  public int Count 
  {
    get { return count; }
  }

  public Rectangle Region { get; set; }

  public DateTime CreatedOn { get; set; }
  public DateTime UpdatedOn { get; set; }

  public static Dictionary<int, Font> fontSizeMap = new Dictionary<int, Font>();

  protected string text;

  public TextNode(string text, PointF location)
    : base()
  {
    this.text = text;
    Location = location;
    count = 1;
    CreatedOn = DateTime.Now;
    UpdatedOn = CreatedOn;
  }

  /// <summary>
  /// Update the UpdatedOn timestamp when incrementing the count.
  /// </summary>
  public void IncrementCount()
  {
    ++count;
    UpdatedOn = DateTime.Now;
  }

  /// <summary>
  /// Do NOT update the UpdatedOn timestamp when decrementing the count.
  /// Also, do not allow the count to go 0 or negative.
  /// </summary>
  public void DecrementCount()
  {
    if (count > 1)
    {
      --count;
    }
  }

  public override void DrawNode(Graphics gr, RectangleF bounds)
  {
    // base.DrawNode(gr, bounds);

    Font font;
    int fontSize = Math.Min(8 + Count, 36);

    if (!fontSizeMap.TryGetValue(fontSize, out font))
    {
      font = new Font(FontFamily.GenericSansSerif, fontSize);
      fontSizeMap[fontSize] = font;
    }

    // Create a color based on count, from 1 to a max of 24
    // Count (or count) is the true count. Here we limit the count to be between 1 and 24.
    int count2 = Math.Min(count, 24);

    if (count2 >= twitterWordCloud.AppForm.CountThreshold)
    {
      int blue = 255 * (24 - count2) / 24;
      int red = 255 - blue;
      Brush brush = new SolidBrush(Color.FromArgb(red, 0, blue));

      SizeF strSize = gr.MeasureString(text, font);
      PointF textCenter = PointF.Subtract(bounds.Location, new Size((int)strSize.Width / 2 - 5, (int)strSize.Height / 2 - 5));
      Region = Rectangle.FromLTRB((int)textCenter.X, (int)textCenter.Y, (int)(textCenter.X + strSize.Width), (int)(textCenter.Y + strSize.Height));

      gr.DrawString(text, font, brush, textCenter);

      brush.Dispose();
    }
  }
}

This class also colorizes the text and each node, being a unique word, keeps track of the hit count and created/updated date.

I also removed the asynchronous behavior of the FDG that Bradley had originally implemented. Also removed was the detection for when to stop iterating -- the graph iterates forever, which is evident in the constant wiggle of the central spot. Various tweaks were also made to better support adding / removing nodes.

Conclusion

This was a fun little project to throw together using some great existing work and just writing a bit of logic around the Twitter and FDG pieces to glue them together. I found some disappointing things:

The vast majority of the "newsy" tweets are re-tweeted, often disproportionately skewing the hit counts.
"Original" tweets are for the most part rather boring, being simply paraphrases of other tweets.
People really only tweet about mainstream things. You won't find people tweeting about global warming or alternative currencies.

I also found some interesting things:

You can discover interesting linkages between subjects. For example, when watching the feed on "Obama", I saw "BigBird" and discovered that Michelle Obama was meeting with Sesame Street's BigBird. A good thing for the First Lady to be doing!
I "read" about the oil tanker train wreck in West Virginia first through this program as a result of filtering on "oil" and saw keywords like "train" and "derailment" having significant hit counts.
It definitely looks possible to perform sentiment analysis on tweets -- there are many single hit count words that convey sentiment: "angry", "happy", "scared", "disappointed", and so forth.
Just because the media makes a hoopla of something, like Apple getting into the electric car market, the tweet volume on this was non-existent, leading me to the conclusion that most people have a "who cares?" attitude to that news event.

There's also some other fun things that could be done, such as plotting the tweets on a map, filtering tweets by geolocation, extending the filtering mechanism for "and" vs. "or" filter tracks, and so forth. This applet really just touches the surface of what can be done.