As organizations deploy generative AI at an ever-increasing rate, its shortcomings are becoming all too apparent. LLM training with proprietary data is costly and time-consuming, data privacy is difficult to ensure, and outputs are more than often factually incorrect or otherwise undesirable. Retrieval-Augmented Generation (RAG) can solve many of these problems. What does it entail?
Retraining an AI model on new data can take weeks. RAG offers a solution: through this method, new information can be added from which an LLM can draw. We’ve compared it to the knowledge gathered from a book versus having to rely solely on one’s own memory. That means AI models can cite their sources and take in new information at the same time. RAG enables developers to make AI models cite relevant documents, which can also serve to improve the final AI output. Footnotes can make the information from AI answers directly traceable. In addition, organizations can have LLMs excel in certain knowledge domains through RAG, which reduces the likelihood of any inaccuracies cropping up.
Tip: How do you roll out GenAI in enterprise environments?
The 2020 paper that coined RAG spoke of a “general-purpose fine-tuning recipe.” RAG can connect almost any LLM to virtually any external resource. Organizations can therefore benefit from the most advanced models while still being able to apply proprietary data. Moreover, all of this can be implemented with just five lines of code.
Better than fine-tuning
Fine-tuning already allowed organizations to tweak AI models as desired. However, the actual outputs of the modified model remain opaque, partly because the initial training data is usually inaccessible. Another problem with fine-tuning is that it requires additional AI training, a time-consuming and pricey exercise. And ultimately, despite all this effort, the probability of hallucinations is roughly comparable to the initial model. By contrast, RAG is quick to implement, traceable to specific data sources and less likely to lead to hallucinations.
Applications can be implemented anywhere
Nvidia claims that implementing RAG is not only relatively easy, but also inexpensive. It uses Meta’s Llama 2 model to show off a possible workflow within its AI Enterprise offering. Vector databases play an important role here, already significantly speeding up the actual implementation of AI by linking relevant data together. These databases make fine-tuning an AI model easier, although, as stated before, that’s not the most time-efficient option or practical on a day-to-day basis. Adjustment through RAG is a lot faster: external documents can be continuously added to and adjusted, something Nvidia describes as a task that needs to be performed regularly.
Also read: In AI development, never lose your RAG
In practical terms, it means that organizations can deploy AI in dynamic ways. Examples from Nvidia include AI healthcare assistants with access to an up-to-date medical index so that the latest information can be included. Similarly, financial analysts may be able to better utilize AI because of the inclusion of up-to-date market data. However, a foundation should already be laid by choosing a proficient AI model with high-quality training data. Nvidia compares this model to a judge assisted by RAG court clerks that pull specific expertise from a law library.
Challenges
Unfortunately, that comparison does suggest a level of authority that LLMs have not yet earned. The large-scale implementation of GenAI presents many opportunities, but as many pitfalls, too. RAG alleviates many issues at hand, but it certainly doesn’t make AI infallible. For example, documents deemed relevant through RAG may not match the user’s actual intent. Also, an otherwise desirable output may actually be disrupted by the additional RAG information. This leads to a problem for developers given the additional complexity added by RAG. Thus, it becomes more challenging to determine what to tinker with to achieve better outputs, and what to leave alone.
Because external sources are still in use, RAG does not necessarily make securing data privacy any easier. The use of third-party databases presents privacy issues that are difficult to resolve. Several AI tools require sending out sensitive data, while on-prem acceleration is often simply not feasible (or not possible, as with the cloud-only GPT-4). In this area, RAG has no solution to offer.
Other challenges are mainly related to the lightning-fast development of GenAI in general. For example, RAG is forced to continuously grow along with the rapidly increasing context window that new AI models enable. Google’s Gemini 1.5, for example, is being tested for a context window of 1 million tokens, equivalent to a text of more than 700,000 words. For now, it is also unclear how RAG deals with the scale of an AI model: is the technique better suited for the largest models or small ones? Another development around AI takes place around multimodality: video, images and audio present challenges that the currently primarily text-based RAG does not yet have the answers to.
Tip: Gemini 1.5 is much more than a new foundation model
Continuous improvement
Although the implementation of RAG requires only five lines of programming code, like other AI applications, getting the practical deployment right is a major challenge. The architecture of RAG requires expertise to operate efficiently, securely and consistently. Additional retrieval steps lead to higher latency, which is crucial for a succesful AI deployment.
Regardless, the evolution of RAG shows that much is possible in a short period of time. Several breakthroughs have been made since the introduction of RAG in 2020, but in the past 12 months, innovation in this area has been accelerating. In particular, the efficiency of RAG has improved significantly the past year. This is being achieved in a variety of ways. Examples include the filtering of irrelevant documents, additional steps to verify document eligibility and the speeding up of information retrieval. In October, the removal of some less relevant tokens even resulted in a 62 percent runtime reduction, while performance barely declined (2 percent). It shows that RAG is an ever-evolving technique with unexplored avenues for improvement.
The fact RAG is attractive to organizations is evidenced by its widespread adoption. Microsoft offers the ability to integrate RAG through Azure AI Studio, while implementations are also available for OpenAI’s ChatGPT and IBM Watsonx.ai. Frameworks and libraries for RAG are already offered by Deepset and Google. In other words, those who want to get started with this technology have a variety of providers to choose from. AWS also allows foundation models to be connected “with just a few clicks” to data sources for RAG. It means all major cloud players can already enable AI refinement via this method.
Also read: Nvidia lets users build chatbot that runs locally on PC