Generalists vs. Specialists: Evaluating Large Language Models for Urdu

Let’s say you’re feeling a bit under the weather. You’ve got a scratchy throat, a nose that resembles a leaky tap, and a cough that rivals a 2-stroke rickshaw without a silencer. You start by calling up your GP (general practitioner) who, with their broad training of common ailments, reassures you that it’s probably just a mild cold – nothing that a little rest, some tea, and binge-watching a few seasons of your favourite series can’t fix.

Now, let’s say someone else complains about chest pain, shortness of breath and palpitations. They will be referred to a cardiologist who has spent years honing their skills specialising in heart diseases and bring a depth of expertise that a GP cannot match. Specialist AI models embody this same principle: they are fine-tuned on specific tasks and datasets, delivering superior performance where precision and domain knowledge are critical – just like a cardiologist’s ability to diagnose and treat complex heart conditions with unparalleled accuracy.

Generalist models, with their vast training sets and number of parameters, might seem like the ultimate tool, but their sheer size can be both a blessing and a curse. These models are like a large buffet – plenty of options, but not many good ones ¹. They excel in versatility but can fall short in the specialised skills needed for specific, low-resource tasks.

Specialist models, on the other hand, may not have the size or general knowledge, but they’re like a chef who’s perfected a single dish – reliable, efficient, and consistently hitting the mark ². They shine in low-resource settings because they are specifically tuned to maximise performance with limited data, delivering higher accuracy and relevance where generalist models may falter. So are the performance improvements from fine-tuning worth it?

This question has been tackled by Samee Arif and Abdul Hameed Azeemi from C-SALT working under Agha Ali Raza in our recent research (accepted at EMNLP 2024) where we take Urdu NLP as an example of a data-deficient domain. We compare general-purpose pretrained models, GPT-4-Turbo and Llama-3-8b with special-purpose models fine-tuned XLM, mT5, and Llama-3-8b. In our comprehensive evaluation across multiple classification and generation tasks like (named-entity recognition, sentiment analysis, summarisation, translation, etc.), we covered prompting techniques such as few-shot, chain-of-thought reasoning as well as the fine-tuning of LLMs. The results were as predicted: specialist models quantitatively outperformed generalist models on 12 out of a total of 13.

However, don’t cancel your subscriptions just yet. While specialist models aced the quantitative evaluations, generalist models, such as Claude-3.5, showcased better performance in the human evaluation of the generation tasks, highlighting the importance of qualitative evaluation in accurately assessing model performance. We also performed an LLM based evaluation of the outputs of the models for the generation tasks. The low agreement between the rankings done by LLMs and human rankings may indicate that LLMs struggle when it comes to low-resource language understanding.

So, what’s the final verdict? In some cases, a small fine-tuned language model may perform significantly better than a large off-the-shelf language mode. Just as we would use a Swiss Army knife for opening a package but a sterilised scalpel for surgery, choosing an AI model is about leveraging the right tool for the task at hand to achieve the best results.

Lahoris, think Salt’n Pepper Village ↩︎
Lahoris, think Stuffed Chicken Breast with Pineapple Sauce at Salt’n Pepper, Gulberg ↩︎