4 min Devices

Cerebras partnership breathes new life into AWS Trainium

Cerebras partnership breathes new life into AWS Trainium

When it comes to AI workloads, inference is by far the most important. There are many ways to run AI models on a daily basis, but the most efficient method has long been elusive. AWS and Cerebras are collaborating in a way that redefines the nature of these workloads. What can ‘disaggregated’ inferencing deliver?

The clear distinction between AI training and inference is fairly straightforward. While LLMs rely on training to become functional systems, inference is how an LLM is actually deployed. Every output, in whatever form, is the result of inference. But the breakdown of AI workloads goes beyond this dichotomy. Inferencing itself also consists of two elements in Transformer models: prefill and decode. AWS and Cerebras are now also separating these two components.

Prefill and decode

AI training requires massive computing power and is often the primary reason “AI Factories” are established. Inferencing is less demanding and can run via the public cloud at a manageable cost. But AWS has discovered that Trainium, originally intended for heavy training workloads, excels in the area of prefill. Cerebras, the maker of massive “wafer-scale” AI chips, appears to excel when it comes to decoding.

Prefill involves processing the input, whether it’s a message from an end-user to a chatbot, an image, or an API call via MCP from another application. Here, computing power is the limiting factor. AWS Trainium, described elsewhere as a “disaster,” seems far removed from the performance level demanded by major AI labs. Anthropic, said to be the “only meaningful Trainium customer,” employs a multi-cloud strategy. In addition to AWS, it relies on computing power from Google Cloud and, by extension, the TPUs on that platform.

AWS Trainium therefore needs a new raison d’être. The way out appears to be AI inference. This can be seen somewhat as a downgrade in terms of objectives, as it is a less demanding workload and is often the type of workload that a former training chip runs when it no longer offers the state-of-the-art performance it once did.

Cerebras, however, offers something else: bandwidth. 21 petabytes per second (!) is reportedly the maximum throughput of the latest CS-3, equipped with 900,000 cores. A single “chip” is in fact a single “wafer,” which is normally cut up to build multiple processors. Petabyte-level speeds are only possible because connectivity within a chip is actually always many times faster than between chips, such as with separate memory modules and a GPU.

And that is precisely what AI inference requires in the second, final step. Decode, the step following prefill, revolves around generating tokens and thus the output. This is the end result: a chatbot’s response to a question, an AI-generated image, and so on.

The bottom line is a new idea

The magic word in the announced collaboration between AWS and Cerebras is “disaggregation.” That is the splitting of prefill and decode. With this combination, available in production for the first time, we can safely say that a new era for AI inference is dawning.

The technology itself isn’t out of thin air: in September, research focused on splitting prefill and decode was published. That was among different GPU vendors, but the naming of the AI chips doesn’t matter much here.

Another technical term for this phenomenon is heterogeneous parallelism, or running different types of chips for the same workload, which perform calculations simultaneously. We suspect a somewhat easier-to-remember term will emerge as other hyperscalers adopt the same methodology.

From Loss to Profit

The announcement will have to prove itself. AWS states that Anthropic and OpenAI remain committed to Trainium. That will also have to do with the billions of dollars AWS is investing in both parties.

Now, however, AWS appears to have a new plan. Trainium 4 is expected to launch in 2027, with the goal once again being to become the go-to choice for AI training among AI labs. But eventually—whether shortly after release or later—Trainium 4 will likely follow in the footsteps of Trainium 3 and be utilized in a similar partnership with Cerebras chips.

Even more AI processors could follow this paradigm. This goes beyond benchmarks and leverages the AI capacity currently being built out for the long term. In this regard, AWS and Cerebras have positioned themselves for the future.

Read also: Nvidia is working on a chip for AI inference with Groq technology