Generalists vs. Specialists: Evaluating Large Language Models for Urdu

Let’s say you’re feeling a bit under the weather. You’ve got a scratchy throat, a nose that resembles a leaky tap, and a cough that rivals a 2-stroke rickshaw without a silencer. You start by calling up your GP (general practitioner) who, with their broad training of common ailments, reassures you that it’s probably just a mild cold – nothing that a little rest, some tea, and binge-watching a few seasons of your favourite series can’t fix. ...

September 26, 2024

UQA - Corpus for Urdu Question Answering

I think it was around 1999 when I first heard that Urdu is a low resource language. 25 years later Urdu is still considered low-resource despite having over 70 million native speakers. This is because large manually curated linguistic resources required for training are still not available for Urdu. One way around this barrier is to translate existing corpora from a resource-rich language (cough English cough) to Urdu. This seems like a chicken-and-egg problem though since good automatic translation systems require training resources. However, recent multimodal multitask models have reached a level where we can use their translations to set a baseline for more complex tasks like question answering. And we have done exactly that! ...

May 24, 2024

Of Mir and Meerkats

Breakfast Doodle

March 1, 2017

Urdu Sentiment Lexicon

With the increasing number of “opinion-dispensing apps” which enable Urdu users to write in Unicode out there on the web, there is (or will soon be) a need for getting some meaningful statistics out of the ever-present sentiment of the masses (or at least the web-savvy subset). This calls for resources which enable automatic processing of sentiment, one of which is a sentiment lexicon for Urdu. (For people uninitiated in computational linguistics, a lexicon is just a list of words). Since I couldn’t find any sentiment lexicon available for for Urdu on the tubes, I decided to put in some effort and create a new one. ...

June 14, 2012

Nastaleeq Urdu Typesetting: When will they get it right?

Last night, I read about the new Nasteeq font available in Windows 8 and I just had to check it out. After leaving my machine up all night to install the consumer preview, I finally had time to examine the new “Urdu Typeset” out a while ago. Although Microsoft explicitly states it to be a ‘document’ font, it never hurts to check out how it behaves in a web UI setting. Here’s a screen shot of how the Twitter Urdu page would look with the font. I had to do some CSS overriding to get that right (body.ur for the curious). ...

April 14, 2012

Google and Urdu Stemming

 Is google (finally) stemming Urdu? The last time I checked, there were doing something like a transliteration based search but in the screenshot below, you can see that searching for the phrase ان پڑھ چٹا shows some stemming is being used. Does anyone know anything? Oh, and while I’m on this topic, I would also like to know why is it called چٹا ان پڑھ ?

March 5, 2010

Online English to Urdu Translator

While all the online English to Urdu translators that I have seen don’t really work that well (read suck), if we make use the overlapping vocabulary and grammar of Hindi and Urdu along with using Google’s translation API, things come out pretty decent (as mentioned in my previous post). Here’s a small 15 min first cut script which just uses English to Hindi translation and then transliterates from Hindi to Urdu. Feel free to use the code and do ping me if you improve something. This works as a Hindi to Urdu transliterator as well. ...

January 23, 2010

آلووں کو پکنے دو

آلووں کو پکنے دو آلووں کو چولہے کی دھیمی دھیمی آنچوں کا کچھ مزہ تو چکھنے دو آلووں کو پکنے دو تیز تپتے تیل سے پانی کو پرے رکھنا پاس نہ ذرا کرنا ورنہ چلملاتی سی گرم سی کئ چھینٹیں تم پر اُڑ کر آئیں گی خوب پھر جلائیں گی اِس لیے میں ٹوکے ہوں اِس جلن سے روکے ہوں مجھ کو روک سکنے دو آلووں کو پکنے دو ...

October 12, 2009