Category Archives: Language

Urdu Sentiment Lexicon

With the increasing number of “opinion-dispensing apps” which enable Urdu users to write in Unicode out there on the web, there is (or will soon be) a need for getting some meaningful statistics out of the ever-present sentiment of the masses (or at least the web-savvy subset). This calls for resources which enable automatic processing of sentiment, one of which is a sentiment lexicon for Urdu. (For people uninitiated in computational linguistics, a lexicon is just a list of words).  Since I couldn’t find any sentiment lexicon available for for Urdu on the tubes, I decided to put in some effort and create a new one.

Click here to check out the Urdu Sentiment Lexicon

image

The Urdu Sentiment Lexicon is a list of 2,607 positive and 4,728 negative sentiment/opinion words for Urdu. It is based on a similar list for English available here. The English words have been translated to Urdu automatically using a dictionary lookup. All resulting Urdu synonyms have been included as well. The lexicon has also been manually inspected (but very quickly) and any irrelevant words have been deleted.

To test things out, I’ve also developed a simple javascript application which changes the color of the sentiment words according to their polarity. It also calculates the background color of the whole text using the total polarity score of the text (+1 for positive, –1 for negative). A screentshot is given above. Like all sentiment lexica, this won’t be perfect. But I am hoping this will give my fellow researchers who work on Urdu sentiment analysis a starting point and save some time.

 

Click here to check out the Urdu Sentiment Lexicon

Twingual: A twitter client for bilingual tweeple

TwingualIn my last post, I highlighted some problems that I face daily while using twitter in Urdu as well in English. A few days ago, I decided to experiment with the Twitter API and write my own client to fix some of these problems. You can see the result at www.twingual.com. It is a javascript only twitter client which supports neat Nastaleeq urdu fonts as well as transliteration.

It’s a work in progress and does not implement all twitter features. If you like it and want to see something you need everyday implemented, feel free to send a tweet.

Meanwhile, tweet away!

Nastaleeq Urdu Typesetting: When will they get it right?

Last night, I read about the new Nasteeq font available in Windows 8 and I just had to check it out. After leaving my machine up all night to install the consumer preview, I finally had time to examine the new “Urdu Typeset” out a while ago. Although Microsoft explicitly states it to be a ‘document’ font, it never hurts to check out how it behaves in a web UI setting. Here’s a screen shot of how the Twitter Urdu page would look with the font. I had to do some CSS overriding to get that right (body.ur for the curious).

image

Urdu Typesetting Pointing up

While it does not look that bad, what bugs me is the fact that the English characters in the font are no way near good enough. The extra kerning for Urdu is probably to blame, but as it turns out, I haven’t been able to find a single Nastaleeq font which can render English as well as Urdu characters in such a way that they are legible enough when used in web pages. The nearest you can get is Alvi Nastaleeq v1.0.0 (screenshot below).

image

Alvi Nastaleeq v1.0.0 Pointing up

But even this font doesn’t quite give an elegant enough look which can be used in professional web pages. Until someone is ambitious enough to tackle this problem, we will probably wont see any useable Urdu+English interfaces. Any solution for us bilinguals will have to handle bidirectional (bidi) text as well. Meanwhile, an alternate is to either detect language and add spans with different fonts, or simply let go your desire to see Urdu Nastaleeq and switch to Helvetica.

image

HelveticaPointing up

The Floor Code

Computer Lab Floor

If you walk into my department, one of the first things you may notice is that some of the tiles on the floor are a black and there’s no particular pattern to it. These tiles actually encode a message. The curious amongst us are supposed to decode this but despite having spent 3 years in the department, I could never get the time until last Friday. The decoding should be pretty simple if you want to try your skills. The last 6 letters of the first word can be read off this picture. If you are too lazy, just click here for the explanation. (Anyone who has taken an Introduction to Computer Science course should at least try for ONE minute before clicking)

Geek Art!

(Oh! and clicking on the image opens up a high res version.)

How to change font on the BBC Urdu website

Update: BBC has now embedded a new font (BBCNasim) on their website which is quite good. In fact I don’t use this plugin any more myself. It’s not Nastaleeq but it is good enough.

Let’s face it. The font on the BBC Urdu website is not that good. When a friend complained about it on our alumni list, I thought of writing a small greasemonkey script to take  care of the problem. The results are pretty good, as visible in the image below. The left part is the site after installing the Urdu Naskh Asia type font provided by BBC and before installing the script (and I maintain, Aijaz, it is not good). The right part is after installing the script. Click on the image and you’ll get an un-scaled version.

To install the script, click the link below and follow the installation instructions given there. Currently, it works only on Chrome and Firefox.

BBC Urdu Font Changer @ userscripts.org

and the world is a bit better now…

change

DependenSee: A Dependency Parse Visualisation/Visualization Tool

 

There aren’t many tools which allow you to visualise sentences parsed with dependency grammars. Here’s a small tool which generates a PNG of the dependency graph of a given sentence using the Stanford Parser. You can generate the image for Einey’s quote below by following these steps.

out

  1. Click here to download DependenSee.2.0.5.jar.
  2. Download the latest version of the Stanford Parser.  I am using version 2.0.5 (For older versions, drop me an email)
  3. Extract stanford-parser.jar and stanford-parser-2.0.5-models.jar in the same folder as DependenSee.jar.
  4. On the command prompt, run
    java -cp DependenSee.jar;stanford-parser.jar;stanford-parser-2.0.5-models.jar com.chaoticity.dependensee.Main "Example isn't another way to teach, it is the only way to teach." out.png
    (If you are on *nix, replace the semicolon by a colon and make sure you have Arial installed. If you have an already parsed dependency output file, replace the sentence by -t input.txt .)
  5. Open out.png and admire :)

I have added Part-of-Speech tags and very basic edge overlap management and might add more eye candy later (curved/coloured edges ?). You can link the library in your code as well. An example is given below. Comments and queries are welcome. You can also find the source at github.


import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.parser.lexparser.*;
import com.chaoticity.dependensee.*;
import java.util.Collection;
class Test {
public static void main(String []args) throws Exception {
String text = "A quick brown fox jumped over the lazy dog.";
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
LexicalizedParser lp = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz");
lp.setOptionFlags(new String[]{"-maxLength", "500", "-retainTmpSubcategories"});
TokenizerFactory tokenizerFactory =
PTBTokenizer.factory(new CoreLabelTokenFactory(), "");
List wordList = tokenizerFactory.getTokenizer(new StringReader(text)).tokenize();
Tree tree = lp.apply(wordList);
GrammaticalStructure gs = gsf.newGrammaticalStructure(tree);
Collection tdl = gs.typedDependenciesCCprocessed(true);
Main.writeImage(tree,tdl, "image.png",3);
}
}

Google and Urdu Stemming

 

Is google (finally) stemming Urdu? The last time I checked, there were doing something like a transliteration based search but in the screenshot below, you can see that searching for the phrase ان پڑھ چٹا shows some stemming is being used. Does anyone know anything?  Oh, and while I’m on this topic, I would also like to know why is it called چٹا ان پڑھ ?

image

Google as a Question Answering System

A Question Answering (QA) system is an Information Retrieval system which gives the answer to a question posed in natural language. For example, if you ask it Who wrote Hamlet?, it should answer Shakespeare. A few years ago (don’t ask me how many), search engines did not focus on language queries. Recently [sic], Google has started incorporating some NLP (Natural Language Processing) in their results. You can try it out by typing the same question in the search box yourself ( or clicking here ).

image

During my M.Phil. course, one of the tasks was to build a basic QA system and extend it however we liked. We used the TREC 8 dataset for evaluations. While building the system, I evaluated how current search engines (read Google) performed on this task. For this, I just queried the exact question and used the summaries of the top five results as answers. Evaluating at that time (2008), I got a Mean Reciprocal Rank (MRR) score of 0.212 over 198 questions. 156 questions had no answers found in top 5 responses.

This term, I am demonstrating for the same task. Demonstrators are usually PhD students who provide help and guidance to junior students. For pure geek fun and lack of better things to do while taking a break, I decided to quickly jolt down a JavaScript (read  JQuery ) based QA system. This time,  the resulting MRR score over 198 questions was 0.384 while only 79 questions had no answers found in top 5 responses.

The results show clearly that during the last two years, Google has significantly improved on answering NLP queries. In fact (IIRC), my baseline system back in 2008 (based on RMRS based matching of sentences from the top 100 documents returned by an IR system) could only achieve an MRR score of approximately 0.290, showing that the current results are much better than that baseline. I hope this decade sees some more developments/improvements in QA systems and I can ask a system What do you get if you multiply six by nine?

I’ve always said there was something fundamentally wrong with the universe. ~Arthur Dent

Online English to Urdu Translator

While all the online English to Urdu translators that I have seen don’t really work that well (read suck), if we make use the overlapping vocabulary and grammar of Hindi and Urdu along with using Google’s translation API, things come out pretty decent (as mentioned in my previous post). Here’s a small 15 min first cut script which just uses English to Hindi translation and then transliterates from Hindi to Urdu. Feel free to use the code and do ping me if you improve something. This works as a Hindi to Urdu transliterator as well.




(Thanks to عزت مآب جناب آغا علی رضا قزلباش رحمتہ اللہ علیہ who graciously sent me his term report on Hindi to Urdu transliteration, from where I’ve copied (and modified) the character mapping.)

How do you transliterate that?

I am thinking of using google’s English to Hindi translation and hooking it to a Hindi to Urdu transliterator to get an approximate English to Urdu translation. The Hindi to English transliteration provided by google has some errors which might not be there if we convert directly to Urdu. For example, on translating the sentence

It can be used in Urdu too, image

we get the Hindi translation

यह उर्दू में इस्तेमाल किया जा सकता है

and the Roman transliteration of the Hindi translation

 yaha urdū mēṁ istēmāla kiyā jā sakatā hai.

If you notice the first word, it should have been transliterated to “yeh”. Instead, we get a phonetic transliteration which is made up of two letters ya and ha. Transliteration from Hindi to Urdu directly would have avoided that error. There’s a nice paper titled “Hindi to Urdu Conversion: Beyond Simple Transliteration”  which lists problems faced in simple character-to-character transliteration fromm Hindi to Urdu. Whenever I get some time, I’ll try to cook some javascript code quickly. Until then, the idea is open. Any takers?