
Of Mir and Meerkats
Breakfast Doodle
Breakfast Doodle
I’ve just pushed a Javascript version of LDA on my github account. It’s based on my no-longer-functioning earlier work. For testing, I use a subset of the SMS Spam Corpus available here (and thus take no responsibility of the inappropriateness of the text within :) ). Each topic is represented as a word cloud; the larger a word, the more weight it has in the topic. The source sentences are displayed again with a bar which shows the percentage distribution of topics for that sentence....
With the increasing number of “opinion-dispensing apps” which enable Urdu users to write in Unicode out there on the web, there is (or will soon be) a need for getting some meaningful statistics out of the ever-present sentiment of the masses (or at least the web-savvy subset). This calls for resources which enable automatic processing of sentiment, one of which is a sentiment lexicon for Urdu. (For people uninitiated in computational linguistics, a lexicon is just a list of words)....
In my last post, I highlighted some problems that I face daily while using twitter in Urdu as well in English. A few days ago, I decided to experiment with the Twitter API and write my own client to fix some of these problems. You can see the result at www.twingual.com. It is a javascript only twitter client which supports neat Nastaleeq urdu fonts as well as transliteration. It’s a work in progress and does not implement all twitter features....
Last night, I read about the new Nasteeq font available in Windows 8 and I just had to check it out. After leaving my machine up all night to install the consumer preview, I finally had time to examine the new “Urdu Typeset” out a while ago. Although Microsoft explicitly states it to be a ‘document’ font, it never hurts to check out how it behaves in a web UI setting....
If you walk into my department, one of the first things you may notice is that some of the tiles on the floor are a black and there’s no particular pattern to it. These tiles actually encode a message. The curious amongst us are supposed to decode this but despite having spent 3 years in the department, I could never get the time until last Friday. The decoding should be pretty simple if you want to try your skills....
There aren’t many tools which allow you to visualise sentences parsed with dependency grammars. Here’s a small tool which generates a PNG of the dependency graph of a given sentence using the Stanford Parser. How to run: Dependency graph shown in the image above for Einey’s quote can be generated by following these steps. Click here to download <dependensee-3.7.0.jar>. Download the latest version of the Stanford Parser. I am using version 3....
Is google (finally) stemming Urdu? The last time I checked, there were doing something like a transliteration based search but in the screenshot below, you can see that searching for the phrase ان پڑھ چٹا shows some stemming is being used. Does anyone know anything? Oh, and while I’m on this topic, I would also like to know why is it called چٹا ان پڑھ ?
A Question Answering (QA) system is an Information Retrieval system which gives the answer to a question posed in natural language. For example, if you ask it Who wrote Hamlet?, it should answer Shakespeare. A few years ago (don’t ask me how many), search engines did not focus on language queries. Recently [sic], Google has started incorporating some NLP (Natural Language Processing) in their results. You can try it out by typing the same question in the search box yourself ( or clicking here )....
While all the online English to Urdu translators that I have seen don’t really work that well (read suck), if we make use the overlapping vocabulary and grammar of Hindi and Urdu along with using Google’s translation API, things come out pretty decent (as mentioned in my previous post). Here’s a small 15 min first cut script which just uses English to Hindi translation and then transliterates from Hindi to Urdu....