GKE Scales to 65,000 Nodes for Trillion-Parameter AI Models

2024-11-13

Google Cloud

Google Cloud announced that Google Kubernetes Engine (GKE) now supports up to 65,000 nodes, enabling it to handle massive, trillion-parameter AI models. As generative AI evolves, the need for immense computing power to train these models intensifies. GKE now offers over 10X larger scale than the other two largest public cloud providers, allowing customers to reduce model training time or scale models to multi-trillions of parameters. This expansion also enables running five jobs in a single cluster, each matching the scale of Google Cloud's previous world record for the largest training job for LLMs. Customers like Anthropic, an AI safety and research company, have welcomed these developments. Technically, GKE is transitioning from the open-source etcd, distributed key-value store, to a new, more robust key-value store based on Spanner, Google’s distributed database. This change will usher in new levels of reliability for GKE users, improving latency of cluster operations. Additionally, thanks to a major overhaul of the GKE infrastructure managing the Kubernetes control plane, GKE now scales significantly faster. Google Cloud also maintains its commitment to open source, ensuring all necessary optimizations and improvements for such scale are part of the core open-source Kubernetes.

GKE Scales to 65,000 Nodes for Trillion-Parameter AI Models

Recommends