We need AI to ‘broaden’ in depth, scope, accuracy and validity and, at the same time, we need AI to ‘narrow’ towards specific use cases, align itself towards specialised implementations using more strictly defined sets of human language and also engineer itself into custom-built company-specific proprietary use cases. But there’s a problem with that broadening and narrowing of artificial intelligence. The core problem in Large Language Model (LLM) specialisation is a lack of high-quality data. This is the joint proposition put forward by Dr. Jignesh Patel, co-founder of DataChat and professor at Carnegie Mellon University, alongside his colleague Deepan Das, technical lead at DataChat, an organisation known as a conversational AI platform company.
Niche LLMs, not easy
The pair suggest that while multitrillion-dollar tech companies compete to create (or buy) the biggest LLMs, smaller players are developing specialised LLMs for expensive, niche problems. This is great news, but we need to remember that a huge number of human work hours go into rote review and adjustment processes for these models before they can applied to markets spanning healthcare, finance, insurance, HR etc.
Automating these processes doesn’t require the power of GPT-4 or Gemini but does require a specialised LLM.
“Developers who use finetuning and Retrieval-Augmented Generation (RAG) to train an LLM won’t get anywhere worthwhile without correct, relevant training data,” urged Patel and Das, speaking to press and analysts this month. “Unlike generalist LLMs, which ingest broad information from the web, niche LLMs require data that probably isn’t open source, plentiful, or even published. There isn’t exactly a rich corpus of texts about medical billing adjustments, for instance and organisations in that space have their own rules and ways of doing it anyhow.”
So then, how does an engineer obtain the data to train (for example) a billion-parameter LLM to be highly competent at a process that exists only in one organisation, all without breaking the bank?
AI distillation & self-improvement.
“Typically, an LLM training dataset comprises an input prompt and target output response. Those might take the form of a question (input) and correct answer (output). They are examples of what the LLM should know. Humans create the most reliable examples, but there’s a limit to how many examples a person or team can concoct. Plus, training an LLM purely on human-generated data is slow and expensive because of the many labour hours it entails. Two categories of techniques, distillation and self-improvement, can address speed and cost issues, but they have drawbacks,” explain Patel and Das.
The DataChat technical men paused to explain the difference between distillation and self-improvement.
Small LLM, baby
Distillation is when engineers summarise information into a shorter form for a small-parameter LLM. In one version of distillation, researchers found that they could feed questions to a powerful LLM, get correct responses and then feed the examples to the ‘baby’ LLM in training. The terms of service for top LLMs now forbid this practice and using a lower quality but open source LLM (arguably) carries some risks.
“It’s too much like having a high school student with mediocre mathematics skills algebra to 10-year-olds,” insisted Patel and Das. “Another distillation process is to summarise high-quality information – a legal document or college textbook, for example – and then feed it to the baby LLM. Developers can use a powerful LLM to create the summaries. That is only feasible, however, if there’s a corpus of text to summarise.”
The other technique, self-improvement, was popular in early 2023 (a long time ago in generative AI years) but fell out of favour. The pair explain that self-improvement runs an LLM in a loop that enables it to learn from itself, but this tends to amplify biases in the model and limit the diversity of the training data.
The Jeopardy Method
A workaround here is what we could call the Jeopardy Method [as in the popular US TV game show] i.e. give the LLM answers and then ask it to generate the correct questions. This form of self-improvement is less prone to bias and benefits from some human input.
“The most promising approaches to LLM specialization combine everything we have discussed so far, with a twist,” said Patel and Das. “We start with human-generated synthetic data (aka the gold standard) and we get maximum mileage out of it using some data annotation techniques. For example, we could collect multiple target output responses for a single input and rank them based on preferences (a concept known as Reinforcement Learning from Human Feedback, or RLHF). This teaches the LLM to learn certain response characteristics over others. The process can even be automated using larger LLMs (called Reinforcement Learning from AI Feedback, in that case). Now that we have a more mature LLM, fed on human-made data, we ask it to come up with more examples like the ones provided while confirming that those examples are correct. Once there’s an even bigger base of that high-quality data, then distillation and self-improvement can scale up the magic.”
Institutional convolutions
The LLM training debate is far from over, if anything it’s just getting started. While a huge amount of the arguments in this space will gravitate around the need for high-quality data (which is a truth that will continue to be played out as AI itself matures towards becoming more a utility function, a process that may the rest of the current decade if not longer) the human factor may ultimately prove to be the most important consideration. In some domains, industries and organisations, knowledge is institutional and the rules governing decisions are understood by only a few people. This is a killer point say Patel and Das, because it means that the highest quality data for LLM specialisation is still human-made.
The rest of all AI reasoning, inference and decision-making, just builds upon those human choices. In the end, AI is all about human brain power, for now at least.