Google researchers claim they’ve cracked the code to give large language models (LLMs) the literary equivalent of an endless appetite. Their latest paper unveils ‘Infini-attention,’ a technique that supposedly lets models munch on input of infinite length without getting indigestion from memory overload.
In the competitive space of AI, the ability to process more info means a leg up on the competition. Companies are making their LLMs more effective with insider docs and data to outsmart rivals. But until now, there’s been a nagging memory issue. It’s often the fundamental stumbling block preventing another step up in the quality and accuracy of AI outputs.
Transformers in large language models (LLMs), which excel at understanding and generating human-like text, exhibit ‘quadratic complexity’ regarding memory usage and computation time. In training AI, memory requirements and processing time escalate exponentially rather than linearly when the size of the input data increases.
Doubling input size means quadrupling computation time
As VentureBeat puts it, doubling the input size from 1,000 to 2,000 tokens actually quadruples the memory and computation time. This phenomenon results from the inherent nature of the self-attention mechanism within Transformers, that erstwhile revolutionary part that allows LLMs to focus on different parts of the input sequence when processing it. However, enabling LLMs to capture long-range dependencies and contextual information contributes to the exponential increase in resource demands.
The result is that your typical LLM is like a college student cramming for an exam. Their ‘context window’ is akin to the number of books and articles they can flip through simultaneously. They start sweating bullets when they go beyond that limit, forgetting what they studied first.
Tip: Google makes €25 million available for AI training in Europe
Enter Infini-attention, the supposed game-changer. The experts at Google added a ‘compressive memory’ module to the classic attention mechanism in LLMs. In other words, when the text gets too long, it stuffs the old information into a mental attic to make room for new information.
Finding a number in a haystack of text
Google says their test subject outperforms other long-context models, using 114 (!) times less memory. The researchers ran several tests to see how fast and intelligent their model was. One involved burying a random number in a haystack of text up to a million tokens long, a so-called Passkey Retrieval Task.
Another tasked their model with summarizing texts half a million tokens in size. According to the research paper, the tests were conducted on LLMs with 1 billion and 8 billion parameters, respectively. That’s impressive, but unfortunately, Google didn’t share its models or code. So, what can we deduce without a peek behind the curtain?
The reported findings share a kinship with Google’s own Gemini performance, known for its prowess in handling texts of millions of tokens—a digital marathon runner among language models. Anthropic’s Claude 3 boasts a capacity of 200,000 tokens, while OpenAI’s GPT-4 stretches to a context window of 128,000 tokens. Mistral AI has a context window of 32,000 tokens.
Unearthing the most relevant bits, Sherlock Holmes style
The allure of LLMs with infinite context is akin to having a hyper-powered search engine. Picture dumping all your documents into the model’s lap and letting it play detective, Sherlock Holmes style, to unearth the most relevant bits for each query. No more fine-tuning or RAG acrobatics —sit back and watch the LLM do the heavy lifting.
An efficient memory system is crucial for getting LLMs to understand lengthy texts and adapt to new information. The method explored by researchers Munkhdalai, Faruqui and Gopal at Google integrates a compressive memory module into the attention layer of LLMs, allowing them to process incredibly long texts without overwhelming memory or computational resources.
This approach might be the next step in AI training. However, it would be interesting to see if other companies can reproduce the results or find a similar solution to the massive engineering efforts required to train LLMs. This way, it might be possible to fine-tune LLM pipelines to keep costs down and performance up.
Read more: Google Cloud unveils updates to Gemini, Imagen, Gemma and Vertex AI