Twitter API has changed a bit so this post had to be updated. Check the updated post here
Topic modelling means detecting “abstract” topics from a collection of text documents. The most common text book technique to do that is using Latent Dirichlet Allocation. Simply put, LDA is a statistical algorithm which takes documents as input and produces a list of topics. One catch is that you have to tell it how many topics you want. There’s much more to it but since this is not a tutorial post, I will stop here. (If you are interested in how it works, read the references given on the wiki page.)
I was playing around with tweets and topics yesterday. Unfortunately, I couldn’t find any javascript based LDA implementation. So I wrote one. Or to be more accurate, I converted an existing simple one-class implementation to javascript. To check how it works on real data, I need a tool with some documents. So I wrote that too.
Here’s twopicate, the output of about half a weekend of intermittent coding. You enter a search term, tell it how many topics you want, and press the button. It pulls tweets about that term from twitter and extracts topics for them. Each topic is represented as a word cloud (visible on the right). The larger a word, the more weight it has in the topic. The source tweets are on the left. Each tweet has a bar which shows the percentage distribution of topics for that tweet. You can try it yourself by clicking below.
try twopicate
Since it’s a javascript only solution, it runs in your browser and is consequently a bit slow. You might have to wait a minute after pressing the button.
Update: Twitter API is no more. The current implementation is using a random news feed.