Researchers at ETH in Zurich, Switzerland, have developed a technique that can accelerate the speed of neural networks up to more than 300 times. The technique greatly reduces the computing power for the inference process.
In the study, the Swiss researchers developed a technique for the inference process that reduces the computing power required for the transformer model “BERT” used by as much as 99 percent.
Transformer models are the underlying neural networks of AI models. These particular neural networks consist of several layers that are responsible for many of the parameters in an LLM. These transformers often require a lot of computing power because they must compute the product of all neurons and input dimensions.
The research shows that not all neurons in the ‘feedforward’ layers need to be active during the inference process of each input. They therefore propose ‘fast feedforward’ layers or FFFs to replace the traditional feedforward layers.
By allowing the new FFFs to identify the appropriate neurons for each computation, this technology can reduce the “computational load” and thus the overall computational power required. Ultimately, this leads to faster and more efficient LLMs.
FFFs use a mathematical action for this purpose; Conditional Matrix Multiplication (CMM). This mathematical action replaces the Dense Matrix Multiplications (DMM) used by traditional feedforward networks.
Up to 341 times faster
Experiments with the specific BERT models show that the technology, based on an algorithm, can significantly speed up the processing of large AI models. Tests showed it to be up to 341 times faster.
The technique can also be applied to LLMs like GPT-3, according to the Zurich researchers. This opens up new possibilities for faster and more efficient natural language processing.