Category Archives: Openware

Because high-schoolers need computers…

image

 

For under $1 million, every high school student in Punjab can have access to a computer.

  • Number of high schools = 5600 (source)
  • Price per computer = 16500 PKR (source)
  • Total = 92,400,000 PKR = 976,206 USD

Imagine a whole generation growing up on Khan academy lectures and the Gutenberg library. Imagine these kids using Wikipedia to get both sides of an argument and playing around with Wolfram|Alpha. Imagine them falling in love with physics by appreciating the mysteries of light and getting high on chemistry by designing molecules.  Imagine them learning how to pronounce the word “measure’ properly and hearing Faiz reciting poetry as it was meant to be recited.

 

Imagine them educated and not just literate. 

Where does the money go?

Last night, I took a look at the federal budget for 2012-2013. Apparently we will be spending about 25% in “Servicing of Domestic Debt”.

Take a more detailed look here

image

Urdu Sentiment Lexicon

With the increasing number of “opinion-dispensing apps” which enable Urdu users to write in Unicode out there on the web, there is (or will soon be) a need for getting some meaningful statistics out of the ever-present sentiment of the masses (or at least the web-savvy subset). This calls for resources which enable automatic processing of sentiment, one of which is a sentiment lexicon for Urdu. (For people uninitiated in computational linguistics, a lexicon is just a list of words).  Since I couldn’t find any sentiment lexicon available for for Urdu on the tubes, I decided to put in some effort and create a new one.

Click here to check out the Urdu Sentiment Lexicon

image

The Urdu Sentiment Lexicon is a list of 2,607 positive and 4,728 negative sentiment/opinion words for Urdu. It is based on a similar list for English available here. The English words have been translated to Urdu automatically using a dictionary lookup. All resulting Urdu synonyms have been included as well. The lexicon has also been manually inspected (but very quickly) and any irrelevant words have been deleted.

To test things out, I’ve also developed a simple javascript application which changes the color of the sentiment words according to their polarity. It also calculates the background color of the whole text using the total polarity score of the text (+1 for positive, –1 for negative). A screentshot is given above. Like all sentiment lexica, this won’t be perfect. But I am hoping this will give my fellow researchers who work on Urdu sentiment analysis a starting point and save some time.

 

Click here to check out the Urdu Sentiment Lexicon

LDA based topic modelling in javascript

Topic modelling means detecting “abstract” topics from a collection of text documents. The most common text book technique to do that is using Latent Dirichlet Allocation. Simply put, LDA is a statistical algorithm which takes documents as input and produces a list of topics. One catch is that you have to tell it how many topics you want. There’s much more to it but since this is not a tutorial post, I will stop here. (If you are interested in how it works, read the references given on the wiki page.)

I was playing around with tweets and topics yesterday. Unfortunately, I couldn’t find any javascript based LDA implementation. So I wrote one. Or to be more accurate, I converted an existing simple one-class implementation to javascript. To check how it works on real data, I need a tool with some documents. So I wrote that too.

Here’s twopicate, the output of about half a weekend of intermittent coding. You enter a search term, tell it how many topics you want, and press the button. It pulls tweets about that term from twitter and extracts topics for them. Each topic is represented as a word cloud (visible on the right). The larger a word, the more weight it has in the topic. The source tweets are on the left. Each tweet has a bar which shows the percentage distribution of topics for that tweet. You can try it yourself by clicking below.

try twopicate

 

Since it’s a javascript only solution, it runs in your browser and is consequently a bit slow. You might have to wait a minute after pressing the button. 

image

 

Oh, and you can use the source.

Twingual: A twitter client for bilingual tweeple

TwingualIn my last post, I highlighted some problems that I face daily while using twitter in Urdu as well in English. A few days ago, I decided to experiment with the Twitter API and write my own client to fix some of these problems. You can see the result at www.twingual.com. It is a javascript only twitter client which supports neat Nastaleeq urdu fonts as well as transliteration.

It’s a work in progress and does not implement all twitter features. If you like it and want to see something you need everyday implemented, feel free to send a tweet.

Meanwhile, tweet away!

Making a copy of WEKA Instances

imageThis ‘thing’ took about 30 minutes to figure out. According to the WEKA documentation, if  you add a new Instance to an existing Instances object, String values are not transferred ! In case you are working on copying a dataset with a string attribute, you need to transfer the string manually. The code segment below copies the i^th instance from source to dest where the first attribute (at index 0) is a string attribute.


dest.add(source.instance(i));
dest.instance(dest.numInstances()-1)
.setValue(0,source.instance(i).toString(0));

This should come in handy for text classification using WEKA (and hopefully save your time).

Online English to Urdu Translator

While all the online English to Urdu translators that I have seen don’t really work that well (read suck), if we make use the overlapping vocabulary and grammar of Hindi and Urdu along with using Google’s translation API, things come out pretty decent (as mentioned in my previous post). Here’s a small 15 min first cut script which just uses English to Hindi translation and then transliterates from Hindi to Urdu. Feel free to use the code and do ping me if you improve something. This works as a Hindi to Urdu transliterator as well.




(Thanks to عزت مآب جناب آغا علی رضا قزلباش رحمتہ اللہ علیہ who graciously sent me his term report on Hindi to Urdu transliteration, from where I’ve copied (and modified) the character mapping.)

What do you tweet about? : A shell script for getting most frequent words for twitter

There are a lot of web apps around which report your twitter stats. But at times, it’s better to do things yourself. I haven’t done any fun coding for ages now so last night, I finally got around to making a small program to gather twitter word statistics. The fun part was to do everything using  unix tools.  Here’s a small script file which displays the 10 most used words in the tweets for any twitter id.  I have only tested it under cygwin so this is probably the best place to say “USE AT YOUR OWN RISK”.

Here’s how it works.

  1. downloads all status information in a directory
  2. extracts the status message lines
  3. does some regex magic and filters stop words like the, a, an etc. ( haven’t  seen this done earlier anywhere but the join command comes in handy for processing stopwords)
  4. displays the top 10 most frequent words (and emoticons)

Twitter assigns a limit to the number of messages that you can download (3200). Also, the twitter id timeline has to be public for this script to work. All you need to do is download the script file and stop word list, keep them in the same directory, run it with the twitter id in the command line and you’ll get the list of words with the frequency at the start of each line. For example,

$ ./tword.sh barackobama
161 watch
119 live
92 http://mybarackobamacom/livestream
81 health
63 reform
55 today
52 rally
48 #hc09
47 &
38 vote

The script takes time to complete so be patient. As you may have noticed, there are still html tags inside. You can remove them by piping in any html2text program. There’s a small perl script in the zipfile which does this processing. The output now brings in a new word “change”. You will, however, need to pipe this in the script after installing HTML::Entities though CPAN.

$ ./tword.sh barackobama
161 watch
119 live
92 http://mybarackobamacom/livestream
83 health
68 change
63 reform
55 today
55 rally
48 #hc09
39 vote

My list toppers as good, :D , time, day, twitter, read, hope, back, :p and make. I wonder if this makes me a happy person :)

JabRef and Google Scholar

I can’t seem to find any way to import the bib entries provided by google scholar to JabRef directly. You can enable the Import into BibTex link from the preferences but it streams the bib file as text/plain which opens up in the browser. You can save it and import it but that wastes a lot of clicks. The easiest option is to copy-paste all the text into a new JabRef entry (Ctrl+N). The default settings leave the double curly braces in the title (to preserve case) which can be removed by enabling the Remove double braces… checkbox in the File tab of Options/Preferences. This works for JabRef 2.5.

Are you interested in using computers in the classrooms?

 

MPj04393590000[1] A friend of mine is carrying out research in classroom based e-assessment in developing countries such as Pakistan. The aim of the research is to assist primary school teachers with computer software that:

· Is aligned with the particular subject curriculum they follow in their schools.

· Provides pupils with challenges and interactive short quizzes and tests to take after completing a topic taught by the teacher in the classroom.

· Provides students with immediate and diagnostic feedback on their performance on each challenge or test they attempt.

· Helps teachers in identifying the individual pupils needing help in certain conceptual areas of the curriculum, in managing the overall classroom portfolio, and in assuring better teaching and learning within the socio-cultural context of their educational system.

You can help in two ways!

1. The project is still in the stage where available literature on formative e-assessment is being critically reviewed and any bright ideas relevant to the topic are welcome.

2. One very important aspect of the research is related to the kind of tools and technologies that should be utilized in developing the software product. This is done keeping into consideration the fact that the use of expensive tools, technologies and infrastructure does not help much in sustaining any change in the educational systems of developing countries. Therefore, Moodle (www.moodle.org) and other such open source learning management systems are being considered for the initial version of this project.

We need someone who can evaluate both Moodle and OpenMark, tell whether the functionality of diagnostic assessment used in OpenMark can also be integrated in Moodle, and how. Monetary remuneration is available for this activity.

The Open University (www.open.ac.uk) in the UK is one of the pioneers in establishing distance learning (also e-learning) programs for higher education. They are currently using OpenMark (https://openmark.dev.java.net/), their own open source Computer Assisted Assessment (CAA) system, as well as Moodle to develop formative assessment tests for their students enrolled in distance learning programs (http://labspace.open.ac.uk/course/view.php?id=3484&topic=all). You will find some documentation on this at http://labspace.open.ac.uk/course/view.php?id=3484&topic=all, and at http://labspace.open.ac.uk/mod/resource/view.php?id=381989&direct=1. After that we would need help in taking on the development of our own software from there.

If you know anyone who is interested, please leave a comment or drop an email to awais {at} chaoticity.com