My Language - My AI

Remember watching Arrival? A linguist tasked with communicating with aliens learns their language lets her perceive time non-linearly, reshaping her understanding of reality. The plot brought the Sapir-Whorf’s hypothesis back into mainstream about how language actively shapes the way we think and determines what we can think about. In the context of LLMs and Chain of Thought (CoT) reasoning, the hypothesis becomes particularly relevant since the language of thought quite literally determines the quality of computational output....

February 21, 2025

UQA - Corpus for Urdu Question Answering

I think it was around 1999 when I first heard that Urdu is a low resource language. 25 years later Urdu is still considered low-resource despite having over 70 million native speakers. This is because large manually curated linguistic resources required for training are still not available for Urdu. One way around this barrier is to translate existing corpora from a resource-rich language (cough English cough) to Urdu. This seems like a chicken-and-egg problem though since good automatic translation systems require training resources....

May 24, 2024

Of Mir and Meerkats

Breakfast Doodle

March 1, 2017

Urdu Sentiment Lexicon

With the increasing number of “opinion-dispensing apps” which enable Urdu users to write in Unicode out there on the web, there is (or will soon be) a need for getting some meaningful statistics out of the ever-present sentiment of the masses (or at least the web-savvy subset). This calls for resources which enable automatic processing of sentiment, one of which is a sentiment lexicon for Urdu. (For people uninitiated in computational linguistics, a lexicon is just a list of words)....

June 14, 2012

Nastaleeq Urdu Typesetting: When will they get it right?

Last night, I read about the new Nasteeq font available in Windows 8 and I just had to check it out. After leaving my machine up all night to install the consumer preview, I finally had time to examine the new “Urdu Typeset” out a while ago. Although Microsoft explicitly states it to be a ‘document’ font, it never hurts to check out how it behaves in a web UI setting....

April 14, 2012

Google and Urdu Stemming

Is google (finally) stemming Urdu? The last time I checked, there were doing something like a transliteration based search but in the screenshot below, you can see that searching for the phrase ان پڑھ چٹا shows some stemming is being used. Does anyone know anything? Oh, and while I’m on this topic, I would also like to know why is it called چٹا ان پڑھ ?

March 5, 2010

Online English to Urdu Translator

While all the online English to Urdu translators that I have seen don’t really work that well (read suck), if we make use the overlapping vocabulary and grammar of Hindi and Urdu along with using Google’s translation API, things come out pretty decent (as mentioned in my previous post). Here’s a small 15 min first cut script which just uses English to Hindi translation and then transliterates from Hindi to Urdu....

January 23, 2010

آلووں کو پکنے دو

آلووں کو پکنے دو آلووں کو چولہے کی دھیمی دھیمی آنچوں کا کچھ مزہ تو چکھنے دو آلووں کو پکنے دو تیز تپتے تیل سے پانی کو پرے رکھنا پاس نہ ذرا کرنا ورنہ چلملاتی سی گرم سی کئ چھینٹیں تم پر اُڑ کر آئیں گی خوب پھر جلائیں گی اِس لیے میں ٹوکے ہوں اِس جلن سے روکے ہوں مجھ کو روک سکنے دو آلووں کو پکنے دو...

October 12, 2009