Users of AI hub Hugging Face can now more easily use AWS Inferentia2 AI accelerators to run their LLMs. This is done through AWS Sagemaker or with dedicated EC2 instances. According to both parties, this increases efficiency and lowers operational costs for AI developers.
Thanks to the collaboration that was recently announced, AI developers using the numerous LLMs present in Hugging Face to develop their own models can now run on AWS’ Inferentia2 AI accelerator in the production phase as well.
According to Hugging Face and AWS, this primarily offers AI developers greater efficiency and cost savings. The AWS Inferentia2 processors would be especially suitable for the many inference operations LLMs perform in their production phase.
Additionally, AWS hopes that more AI developers will use its cloud environment to develop such models. This would benefit all parties involved, is the promise at least.
Building on previous collaboration
Hugging Face users can use these dedicated AI accelerator processors through two options. The first option builds on the collaboration between the two parties announced earlier this year involving the cloud-based machine-learning platform AWS Sagemaker. Here, an LLM like Meta’s LLama 3 can run on AWS Inferentia2 accelerators via this tool for inference tasks. This functionality has now been extended to more than 100,000 public LLMs, including 14 LLM architectures in the new constellation.
The second option, tailor-made for Llama 3, is deploying LLMs via the Hugging Face Inference Endpoints solution. End users can deploy their LLMs here using dedicated AWS Inferentia2 EC2 instances.
With this option, the Hugging Face Inference Endpoints solution uses so-called Text Generation Inference for Neuron (TGI) technology to run LLama 3 on the AWS Inferentia2 accelerator. This technology is specifically designed to support large-scale LLMs for production workloads, including continuous batching and streaming functionality. Charges are billed per capacity/second and on a scale-up and scale-down basis.
Two flavours
Users are again given two options for deploying their LLMs, especially for Llama 3 using the AWS EC2 option: one cheaper and one more expensive. First, an AWS EC2 Inf2 small instance with two cores and 32 GB of memory. According to Hugging Face, this is excellent for use with Llama 3 8B and costs $0.75 per hour.
The second option is an Inf2-xlarge EC2 instance with 24 cores and 384 GB of memory. This option is great for use with Llama 3 70B. It costs $12 per hour of use.
Also read: Meta unveils powerful open-source model Llama 3 and chatbot Meta AI