There are a lot of web apps around which report your twitter stats. But at times, it’s better to do things yourself. I haven’t done any fun coding for ages now so last night, I finally got around to making a small program to gather twitter word statistics. The fun part was to do everything using  unix tools.  Here’s a small script file which displays the 10 most used words in the tweets for any twitter id.  I have only tested it under cygwin so this is probably the best place to say “USE AT YOUR OWN RISK”.

Here’s how it works.

  1. downloads all status information in a directory
  2. extracts the status message lines
  3. does some regex magic and filters stop words like the, a, an etc. ( haven’t  seen this done earlier anywhere but the join command comes in handy for processing stopwords)
  4. displays the top 10 most frequent words (and emoticons)

Twitter assigns a limit to the number of messages that you can download (3200). Also, the twitter id timeline has to be public for this script to work. All you need to do is download the script file and stop word list, keep them in the same directory, run it with the twitter id in the command line and you’ll get the list of words with the frequency at the start of each line. For example,

$ ./tword.sh barackobama
161 watch
119 live
92 http://mybarackobamacom/livestream
81 health
63 reform
55 today
52 rally
48 #hc09
47 &
38 vote

The script takes time to complete so be patient. As you may have noticed, there are still html tags inside. You can remove them by piping in any html2text program. There’s a small perl script in the zipfile which does this processing. The output now brings in a new word “change”. You will, however, need to pipe this in the script after installing HTML::Entities though CPAN.

$ ./tword.sh barackobama
161 watch
119 live
92 http://mybarackobamacom/livestream
83 health
68 change
63 reform
55 today
55 rally
48 #hc09
39 vote

My list toppers as good, :D, time, day, twitter, read, hope, back, :p and make. I wonder if this makes me a happy person :)