AI is smart, but it can be smarter in a cleaner and more controlled way via the use of RAG. Retrieval Augmented Generation (RAG) has quickly become something of a super vitamin shot for generative AI models that seek to gain some additional data connection to external (often outwardly ratified and validated) datasets and context sources that do not exist within the Large Language Model (LLM) that any given AI is built upon. But how do the realities of implementing RAG play out in real world AI use case environments and what major factors are impacting this technology?
LLMs alone are not enough
We know that Large Language Models (LLM) are trained on huge amounts of data to understand words and context before they then respond appropriately. However, using an LLM on its own will not be enough to satisfy what many IT teams (and indeed business users) will want to achieve with the generative AI-powered applications or services, components or software-based AI agents being built today.
“The agent provides more effective responses than just using an LLM by adding in additional context from a company’s own proprietary data,” explained Couldwell. “For example, if we wanted to include real-time data in responses, use information specific to a given customer in a reply, or leverage company Intellectual Property (IP) in order to provide more relevant contextual content in responses to end user queries.”
But he warns, data of this type is difficult to train into a LLM and any business that is looking at retraining its own LLM has an expensive, if not possible, task ahead if it is using a core service from an LLM provider. Equally, the company will most likely not want to train sensitive data into the LLM in case it is used inappropriately. Instead, it should look to add that data as query time context so an LLM can use it to provide more relevant responses.
Explicit & implicit AI information
“As already noted here, RAG describes how we can prepare data and then provide it to the LLM so it can be added in context. This additional context falls into two broad categories,” clarified Couldwell.
He defines the difference between explicit and implicit information in the AI world as follows:
Explicit information that data that we know is relevant to a particular query. For example, a healthcare organisation might need an AI agent to provide information about a patient’s condition and what care path should be followed. Alternatively, a financial firm might want to provide more accurate information around a customer’s account data as part of its conversation with them. These are explicit information scenarios.
If we look at the case of more implied information, this is more general info that is relevant to the query being asked. For example, a company may have a ‘run book’ (or some form of best practice business rules manual that dictates how procedures are carried out according to an agreed policy) on how to address certain situations, or other users may have asked similar questions previously that can be used to better answer a new query. This is information that is implied, but it has a general (albeit still quite specific) contextual connection to the use case in hand.
“To make this more general information available, IT teams need to first prepare the data so it can be leveraged by the AI agent,” stated DataStax’s Couldwell. “This involves ‘chunking up data’ and mapping the relationship between the elements in the data semantically. These relationships are termed vectors. These vectors can be stored and searched so that a customer’s request can be answered in context. RAG carries out a vector search, finds the best matches for that customer’s request… it then provides that data across to the LLM so it can be assembled as a response that is more accurate and in context. Taking the company run book example, we would chunk up the run book and create vectors from each of the chunks in order to enable search for the most semantically relevant sections, which the agent can use when delivering answers.”
NOTE: Data chunking is a formalised process in vector databases – as defined by Webopedia, “Chunking is the process of breaking down large amounts of data into smaller, more manageable pieces. It’s an important concept in programming because it allows programmers to work with larger sets of information without becoming overwhelmed by the volume. This allows for faster retrieval and analysis when needed while reducing the amount of memory required to store it all at once.”
Fixing relevancy issues
However, even with all of this said, having data as vectors is not enough on its own. To improve how AI agent uses vector data, there are additional steps.
“Start by defining how an AI agent should act and define its role. For instance, AI agents like ChatGPT will typically respond like a chatbot. If this is not suitable, tell the agent as part of the initial prompt, such as acting like a lawyer or a customer service operative for a specific service provider. This can affect the choice of words that the LLM will use,” said Couldwell.
We may still have to provide guidance to the AI agent on which context data to use, so that it is more likely to use the most useful data. For instance, including examples of situations within that prompt can improve accuracy. We should also not assume that the agent will know the right format to use, so specify the format that the response should be in and any key points that should be included.
“Lastly,” notes Couldwell. “The IT team may have to instruct the agent to use Chain of Thought [data science technology governing prompting that enables complex reasoning capabilities through intermediate reasoning steps] as part of the approach to generating responses. This breaks down requests into smaller problems, which the agent can then respond to in sequence. This helps the agent provide more accurate responses, as it can look at each element in turn and then combine those steps into a complete answer.”
Going beyond RAG
RAG can improve responses, but it is not a magic wand that will improve AI agents automatically. It will need testing and checking so that it provides responses that are accurate and useful. It can also be difficult to understand why the LLM provides the results that it does, even with RAG in place.
“Further developments here include using RAG Fusion, where we ask semantically similar questions and compare the answers to see which ones are better. Forward-Looking Active Retrieval, or FLARE, adds custom information into the prompt so that an LLM generates questions around key phrases that it thinks would provide better answers,” concluded DataStax’s Couldwell.
In essence, RAG combines data, search and LLM functionality so it can improve responses to questions. This area is developing rapidly as developers want to build better experiences for users.
Free image use: Wikimedia Commons