Last night, I took a look at the federal budget for 2012-2013. Apparently we will be spending about 25% in “Servicing of Domestic Debt”.
With the increasing number of “opinion-dispensing apps” which enable Urdu users to write in Unicode out there on the web, there is (or will soon be) a need for getting some meaningful statistics out of the ever-present sentiment of the masses (or at least the web-savvy subset). This calls for resources which enable automatic processing of sentiment, one of which is a sentiment lexicon for Urdu. (For people uninitiated in computational linguistics, a lexicon is just a list of words). Since I couldn’t find any sentiment lexicon available for for Urdu on the tubes, I decided to put in some effort and create a new one.
The Urdu Sentiment Lexicon is a list of 2,607 positive and 4,728 negative sentiment/opinion words for Urdu. It is based on a similar list for English available here. The English words have been translated to Urdu automatically using a dictionary lookup. All resulting Urdu synonyms have been included as well. The lexicon has also been manually inspected (but very quickly) and any irrelevant words have been deleted.
Topic modelling means detecting “abstract” topics from a collection of text documents. The most common text book technique to do that is using Latent Dirichlet Allocation. Simply put, LDA is a statistical algorithm which takes documents as input and produces a list of topics. One catch is that you have to tell it how many topics you want. There’s much more to it but since this is not a tutorial post, I will stop here. (If you are interested in how it works, read the references given on the wiki page.)
Here’s twopicate, the output of about half a weekend of intermittent coding. You enter a search term, tell it how many topics you want, and press the button. It pulls tweets about that term from twitter and extracts topics for them. Each topic is represented as a word cloud (visible on the right). The larger a word, the more weight it has in the topic. The source tweets are on the left. Each tweet has a bar which shows the percentage distribution of topics for that tweet. You can try it yourself by clicking below.
Oh, and you can use the source.
It’s a work in progress and does not implement all twitter features. If you like it and want to see something you need everyday implemented, feel free to send a tweet.
Meanwhile, tweet away!
Last night, I read about the new Nasteeq font available in Windows 8 and I just had to check it out. After leaving my machine up all night to install the consumer preview, I finally had time to examine the new “Urdu Typeset” out a while ago. Although Microsoft explicitly states it to be a ‘document’ font, it never hurts to check out how it behaves in a web UI setting. Here’s a screen shot of how the Twitter Urdu page would look with the font. I had to do some CSS overriding to get that right (body.ur for the curious).
While it does not look that bad, what bugs me is the fact that the English characters in the font are no way near good enough. The extra kerning for Urdu is probably to blame, but as it turns out, I haven’t been able to find a single Nastaleeq font which can render English as well as Urdu characters in such a way that they are legible enough when used in web pages. The nearest you can get is Alvi Nastaleeq v1.0.0 (screenshot below).
Alvi Nastaleeq v1.0.0
But even this font doesn’t quite give an elegant enough look which can be used in professional web pages. Until someone is ambitious enough to tackle this problem, we will probably wont see any useable Urdu+English interfaces. Any solution for us bilinguals will have to handle bidirectional (bidi) text as well. Meanwhile, an alternate is to either detect language and add spans with different fonts, or simply let go your desire to see Urdu Nastaleeq and switch to Helvetica.
Update: BBC has now embedded a new font (BBCNasim) on their website which is quite good. In fact I don’t use this plugin any more myself. It’s not Nastaleeq but it is good enough.
Let’s face it. The font on the BBC Urdu website is not that good. When a friend complained about it on our alumni list, I thought of writing a small greasemonkey script to take care of the problem. The results are pretty good, as visible in the image below. The left part is the site after installing the Urdu Naskh Asia type font provided by BBC and before installing the script (and I maintain, Aijaz, it is not good). The right part is after installing the script. Click on the image and you’ll get an un-scaled version.
To install the script, click the link below and follow the installation instructions given there. Currently, it works only on Chrome and Firefox.
and the world is a bit better now…
I was trying to trace the source of the quote “Any sufficiently advanced financial instrument is indistinguishable from fraud.”. If you do a quoted google search on a custom date range, an interesting problem can be seen.
The results contain pages originally published in 2005 but re-indexed recently. While re-indexing, the current tweets of the author were visible to the crawler and got indexed along with the original article. This makes it seem like the quoted text was mentioned first in 2005 where as originally it’s only a recent meme.
One way to avoid this might focus on identifying dynamic widgets like twitter/news/weather feeds and eliminating them from the index. The HTTP Header (pasted below) lists the last-updated date which probably means that google is either getting the date from the first time it indexed the post or from the URL itself. Whatever the case is, it’s an interesting problem to distinguish between the ‘original’ content and other dynamically added elements on a page.
HTTP/1.1 200 OK
Server: nginx Date: Tue, 23 Nov 2010 17:50:20 GMT
Content-Type: text/html; charset=UTF-8
Vary: Cookie, Accept-Encoding
X-hacker: If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header. X-Pingback:
P.S. An interesting way to advertise. You have read this header anyway so you might want to apply for the job
P.P.S. On second thought, it’s not much of a ‘challenge’ per se. It’s just an interesting problem.
A Question Answering (QA) system is an Information Retrieval system which gives the answer to a question posed in natural language. For example, if you ask it Who wrote Hamlet?, it should answer Shakespeare. A few years ago (don’t ask me how many), search engines did not focus on language queries. Recently [sic], Google has started incorporating some NLP (Natural Language Processing) in their results. You can try it out by typing the same question in the search box yourself ( or clicking here ).
During my M.Phil. course, one of the tasks was to build a basic QA system and extend it however we liked. We used the TREC 8 dataset for evaluations. While building the system, I evaluated how current search engines (read Google) performed on this task. For this, I just queried the exact question and used the summaries of the top five results as answers. Evaluating at that time (2008), I got a Mean Reciprocal Rank (MRR) score of 0.212 over 198 questions. 156 questions had no answers found in top 5 responses.
The results show clearly that during the last two years, Google has significantly improved on answering NLP queries. In fact (IIRC), my baseline system back in 2008 (based on RMRS based matching of sentences from the top 100 documents returned by an IR system) could only achieve an MRR score of approximately 0.290, showing that the current results are much better than that baseline. I hope this decade sees some more developments/improvements in QA systems and I can ask a system What do you get if you multiply six by nine?
I’ve always said there was something fundamentally wrong with the universe. ~Arthur Dent
I am playing around with a customized twitter client, temporarily named ‘outwit’. I’ll try to add up the features as I need them but its strictly an experiment for the time being. Let’s see if things go smoothly from here.
Here is a short intro on how to make sure that major search engines (Google, Yahoo, Microsoft) can be directed to see different URLs with the same content as a single ‘conanical’ URL. For example, the following links point to the same page but have different URLs
The solution is to select a single point as your representative URL and include this line in the HTML code.
Although a standard 301 redirect should work too, but this would be a bit easier for the non-techie designers and SEO enthusiasts to implement.