Making a copy of WEKA Instances

This ‘thing’ took about 30 minutes to figure out. According to the WEKA documentation, if you add a new Instance to an existing Instances object, String values are not transferred ! In case you are working on copying a dataset with a string attribute, you need to transfer the string manually. The code segment below copies the i^th instance from source to dest where the first attribute (at index 0) is a string attribute....

April 12, 2010

Google and Urdu Stemming

Is google (finally) stemming Urdu? The last time I checked, there were doing something like a transliteration based search but in the screenshot below, you can see that searching for the phrase ان پڑھ چٹا shows some stemming is being used. Does anyone know anything? Oh, and while I’m on this topic, I would also like to know why is it called چٹا ان پڑھ ?

March 5, 2010

Google as a Question Answering System

A Question Answering (QA) system is an Information Retrieval system which gives the answer to a question posed in natural language. For example, if you ask it Who wrote Hamlet?, it should answer Shakespeare. A few years ago (don’t ask me how many), search engines did not focus on language queries. Recently [sic], Google has started incorporating some NLP (Natural Language Processing) in their results. You can try it out by typing the same question in the search box yourself ( or clicking here )....

February 6, 2010

Visualizing Citation Networks

For techies: I’ve been working on citation networks lately. You can visualize such a network as a graph. In this graph, the nodes represent publications (papers,articles etc) and the edges represent citations between them. The graph above was produced using the GraphViz. The data is from the ACL Anthology Network which contains publications from the publicly available ACL Anthology. For non-techies: Oooooo! pretty picture!

February 4, 2010

A Typical Day of Research (and why I hate Depth First Search )

 

January 29, 2010

Online English to Urdu Translator

While all the online English to Urdu translators that I have seen don’t really work that well (read suck), if we make use the overlapping vocabulary and grammar of Hindi and Urdu along with using Google’s translation API, things come out pretty decent (as mentioned in my previous post). Here’s a small 15 min first cut script which just uses English to Hindi translation and then transliterates from Hindi to Urdu....

January 23, 2010

How do you transliterate that?

I am thinking of using google’s English to Hindi translation and hooking it to a Hindi to Urdu transliterator to get an approximate English to Urdu translation. The Hindi to English transliteration provided by google has some errors which might not be there if we convert directly to Urdu. For example, on translating the sentence It can be used in Urdu too, we get the Hindi translation यह उर्दू में इस्तेमाल किया जा सकता है...

January 21, 2010

old fog

کھڑکی سے جھانکتی ہے کسے بار بار دُھند

January 19, 2010

Custom Resolution in Remote Desktop

I have a 1920x1080 desktop at work but when I use remote desktop to connect to home, it automatically resizes to my compact 1024*768 desktop. Most programs don’t seem to have a problem but I was working on Weka KnowledgeFlow and one of my flows, originally designed on the higher resolution, never showed a horizontal scroll. It might just be a java thing. In short, I had to look for a method to remote using a higher resolution than that of the local machine....

January 5, 2010

What do you tweet about? : A shell script for getting most frequent words for twitter

There are a lot of web apps around which report your twitter stats. But at times, it’s better to do things yourself. I haven’t done any fun coding for ages now so last night, I finally got around to making a small program to gather twitter word statistics. The fun part was to do everything using unix tools. Here’s a small script file which displays the 10 most used words in the tweets for any twitter id....

December 18, 2009