I think it was around 1999 when I first heard that Urdu is a low resource language. 25 years later Urdu is still considered low-resource despite having over 70 million native speakers. This is because large manually curated linguistic resources required for training are still not available for Urdu.
One way around this barrier is to translate existing corpora from a resource-rich language (cough English cough) to Urdu. This seems like a chicken-and-egg problem though since good automatic translation systems require training resources. However, recent multimodal multitask models have reached a level where we can use their translations to set a baseline for more complex tasks like question answering. And we have done exactly that!
But before that, a little background on what Question Answering is about. Typically, it is the task of giving a machine some text “to comprehend” and asking it to answer a question about it. While domain-restricted question answering systems have been around since 1960s, they were mostly limited to only academic contexts until recently. When large(?) language models became good enough to predict grammatical text and were made available as publicly accessible chat-bots, the terms LLMs (and even AI) became synonymous with Question Answering. The “purity” of the question answering task still remains the same however. Since LLMs have this tendency to hallucinate, one should always question (!) how true the response is when asked a general question¹. For example, ask ChatGPT how old Beyoncé was when she won a school talent show. Using voice input, it detects the question as being in Hindi and answers “nine”, but the answer changes to “seven” when we type the question in Urdu. So which one is right?
To answer that, we have to have access to the ground truth. LLMs have been trained on data which includes Urdu and Hindi and given how text is represented in their basic architecture, they are able to generate very good responses when asked a question. But how good is good? We don’t know until we evaluate. And this is where question answering corpora come in. These datasets provide a gold standard to test how well a machine can perform when we give it some text and ask a question about it. One such dataset is the widely used Stanford Question Answering Dataset (SQuAD), however, it’s in English only and there is a gap for an objective evaluation benchmark for Urdu. Can we reuse it by translating it to Urdu?
Enter UQA: Corpus for Urdu Question Answering, our recent paper presented at LREC-COLING 2024. We describe the process of selecting an appropriate English-to-Urdu translation model and using it to generate a new Urdu QA dataset from the (SQuAD2.0). Our corpus shows promising results on fine-tuned large(ish?) language models (F₁ ≈ 0.86) which not only sets a new benchmark but also opens the door for evaluating new LLMs as question-answering systems in Urdu. The data and model are open and “we invite researchers, linguists, and tech enthusiasts to explore this new corpus and contribute to the ongoing effort to enhance Urdu’s digital presence.“²
¹ A closely linked topic is Retrieval Augmented Generation. I will write something about it some day, but today is not that day.
² This line was generated by an LLM