Google as a Question Answering System

2010 February 6
by awais

A Question Answering (QA) system is an Information Retrieval system which gives the answer to a question posed in natural language. For example, if you ask it Who wrote Hamlet?, it should answer Shakespeare. A few years ago (don’t ask me how many), search engines did not focus on language queries. Recently [sic], Google has started incorporating some NLP (Natural Language Processing) in their results. You can try it out by typing the same question in the search box yourself ( or clicking here ).

image

During my M.Phil. course, one of the tasks was to build a basic QA system and extend it however we liked. We used the TREC 8 dataset for evaluations. While building the system, I evaluated how current search engines (read Google) performed on this task. For this, I just queried the exact question and used the summaries of the top five results as answers. Evaluating at that time (2008), I got a Mean Reciprocal Rank (MRR) score of 0.212 over 198 questions. 156 questions had no answers found in top 5 responses.

This term, I am demonstrating for the same task. Demonstrators are usually PhD students who provide help and guidance to junior students. For pure geek fun and lack of better things to do while taking a break, I decided to quickly jolt down a JavaScript (read  JQuery ) based QA system. This time,  the resulting MRR score over 198 questions was 0.384 while only 79 questions had no answers found in top 5 responses.

The results show clearly that during the last two years, Google has significantly improved on answering NLP queries. In fact (IIRC), my baseline system back in 2008 (based on RMRS based matching of sentences from the top 100 documents returned by an IR system) could only achieve an MRR score of approximately 0.290, showing that the current results are much better than that baseline. I hope this decade sees some more developments/improvements in QA systems and I can ask a system What do you get if you multiply six by nine?

I’ve always said there was something fundamentally wrong with the universe. ~Arthur Dent

Visualizing Citation Networks

2010 February 4

aclnet

For techies: I’ve been working on citation networks lately. You can visualize such a network as a graph. In this graph, the nodes represent publications (papers,articles etc) and the edges represent citations between them. The graph above was produced using the GraphViz. The data is from the ACL Anthology Network which contains publications from the publicly available ACL Anthology.

For non-techies: Oooooo! pretty picture!

A Typical Day of Research (and why I hate Depth First Search )

2010 January 29
by awais

 

image

Online English to Urdu Translator

2010 January 24

While all the online English to Urdu translators that I have seen don’t really work that well (read suck), if we make use the overlapping vocabulary and grammar of Hindi and Urdu along with using Google’s translation API, things come out pretty decent (as mentioned in my previous post). Here’s a small 15 min first cut script which just uses English to Hindi translation and then transliterates from Hindi to Urdu. Feel free to use the code and do ping me if you improve something. This works as a Hindi to Urdu transliterator as well.



(Thanks to عزت مآب جناب آغا علی رضا قزلباش رحمتہ اللہ علیہ who graciously sent me his term report on Hindi to Urdu transliteration, from where I’ve copied (and modified) the character mapping.)

How do you transliterate that?

2010 January 21
by awais

I am thinking of using google’s English to Hindi translation and hooking it to a Hindi to Urdu transliterator to get an approximate English to Urdu translation. The Hindi to English transliteration provided by google has some errors which might not be there if we convert directly to Urdu. For example, on translating the sentence

It can be used in Urdu too, image

we get the Hindi translation

यह उर्दू में इस्तेमाल किया जा सकता है

and the Roman transliteration of the Hindi translation

 yaha urdū mēṁ istēmāla kiyā jā sakatā hai.

If you notice the first word, it should have been transliterated to “yeh”. Instead, we get a phonetic transliteration which is made up of two letters ya and ha. Transliteration from Hindi to Urdu directly would have avoided that error. There’s a nice paper titled “Hindi to Urdu Conversion: Beyond Simple Transliteration”  which lists problems faced in simple character-to-character transliteration fromm Hindi to Urdu. Whenever I get some time, I’ll try to cook some javascript code quickly. Until then, the idea is open. Any takers?

old fog

2010 January 19
by awais

old fog

کھڑکی سے جھانکتی ہے کسے بار بار دُھند

Custom Resolution in Remote Desktop

2010 January 5
by awais

horimonI have a 1920×1080 desktop at work but when I use remote desktop to connect to home, it automatically resizes to my compact 1024*768 desktop. Most programs don’t seem to have a problem but I was working on Weka KnowledgeFlow and one of my flows, originally designed on the higher resolution, never showed a horizontal scroll. It might just be a java thing. In short, I had to look for a method to remote using a higher resolution than that of the local machine. Luckily, you can specify a custom resolution for the RDC using a command line switch ( more here ). The command line below gave be enough space to fix the flow. I hope this helps someone out there.

mstsc /w:1280 /h:1024

The picture above is my office machine when i was trying a horizontal flip. It works when you have many consoles open but the bottom part gets for browsing/coding, it’s not that great.

What do you tweet about? : A shell script for getting most frequent words for twitter

2009 December 19
by awais

There are a lot of web apps around which report your twitter stats. But at times, it’s better to do things yourself. I haven’t done any fun coding for ages now so last night, I finally got around to making a small program to gather twitter word statistics. The fun part was to do everything using  unix tools.  Here’s a small script file which displays the 10 most used words in the tweets for any twitter id.  I have only tested it under cygwin so this is probably the best place to say “USE AT YOUR OWN RISK”.

Here’s how it works.

  1. downloads all status information in a directory
  2. extracts the status message lines
  3. does some regex magic and filters stop words like the, a, an etc. ( haven’t  seen this done earlier anywhere but the join command comes in handy for processing stopwords)
  4. displays the top 10 most frequent words (and emoticons)

Twitter assigns a limit to the number of messages that you can download (3200). Also, the twitter id timeline has to be public for this script to work. All you need to do is download the script file and stop word list, keep them in the same directory, run it with the twitter id in the command line and you’ll get the list of words with the frequency at the start of each line. For example,

$ ./tword.sh barackobama
161 watch
119 live
92 http://mybarackobamacom/livestream
81 health
63 reform
55 today
52 rally
48 #hc09
47 &
38 vote

The script takes time to complete so be patient. As you may have noticed, there are still html tags inside. You can remove them by piping in any html2text program. There’s a small perl script in the zipfile which does this processing. The output now brings in a new word “change”. You will, however, need to pipe this in the script after installing HTML::Entities though CPAN.

$ ./tword.sh barackobama
161 watch
119 live
92 http://mybarackobamacom/livestream
83 health
68 change
63 reform
55 today
55 rally
48 #hc09
39 vote

My list toppers as good, :D , time, day, twitter, read, hope, back, :p and make. I wonder if this makes me a happy person :)

certainty

2009 November 24
by awais

image

JabRef and Google Scholar

2009 November 14
by awais

I can’t seem to find any way to import the bib entries provided by google scholar to JabRef directly. You can enable the Import into BibTex link from the preferences but it streams the bib file as text/plain which opens up in the browser. You can save it and import it but that wastes a lot of clicks. The easiest option is to copy-paste all the text into a new JabRef entry (Ctrl+N). The default settings leave the double curly braces in the title (to preserve case) which can be removed by enabling the Remove double braces… checkbox in the File tab of Options/Preferences. This works for JabRef 2.5.