GKE goes big: 65,000 nodes for LLMs with trillion parameters

Google now supports training generative AI models on up to 65,000 Kubernetes nodes, which the company says is ten times the capacity of competing services.

OpenAI’s GPT-4 reportedly contains 1.8 trillion parameters, but the largest model based on publicly available information is Llama 3, which features 405 billion parameters. Training these large language models (LLMs) not only takes a lot of time, computing power, and money but is often simply not feasible on public cloud instances. However, Google Cloud seems to have made a significant advancement in this area.

Previously, Google Kubernetes Engine (GKE) supported clusters with 15,000 nodes, which was sufficient for today’s LLMs. But in anticipation of tomorrow’s models, Google Cloud now supports 65,000 interconnected nodes. These nodes do not use GPUs, which are typical for most generative AI workloads, but instead employ Google’s own TPU (v5e) chips, with each node containing four chips. This means a single cluster can now have up to 250,000 accelerators.

Multislice

How is it possible to coordinate so many TPUs effectively? Linking on this scale is usually only achievable with complex networking. However, the TPU v5e, introduced in 2023, supports “Multislice” technology, which allows for near-linear scaling and enables the efficient use of 65,000 nodes. Achieving this required more than just the new TPU technology, including a complete overhaul of the entire GKE infrastructure and the use of Google’s own distributed database, Spanner, instead of the open-source etcd.

The practical implications of this advancement remain to be seen. While models continue to grow larger, there may be an eventual limit to the linear increase in skills that LLMs have demonstrated with more parameters. For example, GPT-3 was significantly more effective than GPT-2, despite the 175 billion parameters of GPT-3 compared to the 1.5 billion of GPT-2.

Importantly, Google’s new cluster is not solely for training giant models. The company believes researchers also need this level of cloud infrastructure. “Centralizing computing power within the smallest number of clusters gives customers the flexibility to quickly adapt to changes in demand from inference, research, and training workloads,” say Drew Bradstock and Maciek Różacki of Google Cloud and GKE, respectively.

Also read: Google launches GKE Enterprise for easier Kubernetes management

Risk of sabotage of undersea internet cables increases due to Russia and China

The threat of sabotage to submarine internet cables by state actors such as Russia and China is increasing. ...

Mels Dees 14 hours ago

Top story

AI requires mature choices from companies

The rapid rise of AI is putting pressure on organizations to review their infrastructure and working methods....

Berry Zwets July 15, 2025

Top story

Chris Wright: AI needs model, accelerator, and cloud flexibility

Red Hat is repositioning its platform strategy to meet the shifting demands of enterprise AI. The company’s...

Berry Zwets 2 days ago

Expert Talks

Tech calendar

GKE goes big: 65,000 nodes for LLMs with trillion parameters

Google Kubernetes Engine scales up for the future

Multislice

Stay tuned, subscribe!

AI requires mature choices from companies

Chris Wright: AI needs model, accelerator, and cloud flexibility

Dutch Department of Justice offline after Citrix vulnerability

Storyblok Blueprints, speedier setup for web developers

ServiceNow goes after the mid-market with its AI-based Core Business Suite

Accelerating SAP Adoption with AI: Interview with Thomas Pfiester at SAP Sapphire

Rise with SAP vs Grow with SAP: the different SAP ERP journeys

"AI puts process modeling on steroids", SAP's Dee Houchen on business process management

How AI and automation are redefining ROI in the enterprise

Enhancing video encoding: The AV1 support in the new ARTPEC-9 System-on-Chip

How organisations can remain compliant while building resiliency during the AI era

GITEX DIGI_HEALTH 5.0 - Thailand

IT Arena

Innovation Week 2025

Luxembourg Venture Days

Appdevcon

Webdevcon

Experience Synology’s latest enterprise backup solution

How to choose the right Enterprise Linux platform?

Enhance your data protection strategy for 2025

Strengthen your cybersecurity with DNS best practices