Of Mir and Meerkats
Breakfast Doodle
Breakfast Doodle
I’ve just pushed a Javascript version of LDA on my github account. It’s based on my no-longer-functioning earlier work. For testing, I use a subset of the SMS Spam Corpus available here (and thus take no responsibility of the inappropriateness of the text within :) ). Each topic is represented as a word cloud; the larger a word, the more weight it has in the topic. The source sentences are displayed again with a bar which shows the percentage distribution of topics for that sentence. Hovering on each area in the bar would show you the words in the topic. You can of course replace it with any other text, change the number of topics using the slider, and press the ‘Analyse’ button to see it work. ...
With the increasing number of “opinion-dispensing apps” which enable Urdu users to write in Unicode out there on the web, there is (or will soon be) a need for getting some meaningful statistics out of the ever-present sentiment of the masses (or at least the web-savvy subset). This calls for resources which enable automatic processing of sentiment, one of which is a sentiment lexicon for Urdu. (For people uninitiated in computational linguistics, a lexicon is just a list of words). Since I couldn’t find any sentiment lexicon available for for Urdu on the tubes, I decided to put in some effort and create a new one. ...
In my last post, I highlighted some problems that I face daily while using twitter in Urdu as well in English. A few days ago, I decided to experiment with the Twitter API and write my own client to fix some of these problems. You can see the result at www.twingual.com. It is a javascript only twitter client which supports neat Nastaleeq urdu fonts as well as transliteration. It’s a work in progress and does not implement all twitter features. If you like it and want to see something you need everyday implemented, feel free to send a tweet. ...
Last night, I read about the new Nasteeq font available in Windows 8 and I just had to check it out. After leaving my machine up all night to install the consumer preview, I finally had time to examine the new “Urdu Typeset” out a while ago. Although Microsoft explicitly states it to be a ‘document’ font, it never hurts to check out how it behaves in a web UI setting. Here’s a screen shot of how the Twitter Urdu page would look with the font. I had to do some CSS overriding to get that right (body.ur for the curious). ...
If you walk into my department, one of the first things you may notice is that some of the tiles on the floor are a black and there’s no particular pattern to it. These tiles actually encode a message. The curious amongst us are supposed to decode this but despite having spent 3 years in the department, I could never get the time until last Friday. The decoding should be pretty simple if you want to try your skills. The last 6 letters of the first word can be read off this picture. If you are too lazy, just click here for the explanation. (Anyone who has taken an Introduction to Computer Science course should at least try for ONE minute before clicking) ...
There aren’t many tools which allow you to visualise sentences parsed with dependency grammars. Here’s a small tool which generates a PNG of the dependency graph of a given sentence using the Stanford Parser. How to run: Dependency graph shown in the image above for Einey’s quote can be generated by following these steps. Click here to download <dependensee-3.7.0.jar>. Download the latest version of the Stanford Parser. I am using version 3.7.0. Place the jar file in the Stanford Parser folder. On the command prompt, run java -cp dependensee-3.7.0.jar;stanford-parser.jar;stanford-parser-3.6.0-models.jar;slf4j-api.jar com.chaoticity.dependensee.Main "Example isn't another way to teach, it is the only way to teach." out.png ...
Is google (finally) stemming Urdu? The last time I checked, there were doing something like a transliteration based search but in the screenshot below, you can see that searching for the phrase ان پڑھ چٹا shows some stemming is being used. Does anyone know anything? Oh, and while I’m on this topic, I would also like to know why is it called چٹا ان پڑھ ?
A Question Answering (QA) system is an Information Retrieval system which gives the answer to a question posed in natural language. For example, if you ask it Who wrote Hamlet?, it should answer Shakespeare. A few years ago (don’t ask me how many), search engines did not focus on language queries. Recently [sic], Google has started incorporating some NLP (Natural Language Processing) in their results. You can try it out by typing the same question in the search box yourself ( or clicking here ). ...
While all the online English to Urdu translators that I have seen don’t really work that well (read suck), if we make use the overlapping vocabulary and grammar of Hindi and Urdu along with using Google’s translation API, things come out pretty decent (as mentioned in my previous post). Here’s a small 15 min first cut script which just uses English to Hindi translation and then transliterates from Hindi to Urdu. Feel free to use the code and do ping me if you improve something. This works as a Hindi to Urdu transliterator as well. ...