Google Announces Its Next-Gen Cloud TPU v5p AI Accelerator Chips & AI Hypercomputer

Google has announced the company’s “most powerful” scalable, and flexible AI accelerator, dubbed the Cloud TPU v5p along with a new AI Hypercomputer model.

Google Plans on Taking The Reigns of The AI Bandwagon Through Its Brand New Cloud TPV v5p Chip & AI Hypercomputer Solutions

With the rapidly progressing AI markets, companies are moving towards their solutions when it comes to providing computing power to ongoing developments. Firms like Microsoft with their Maia 100 AI Accelerator and Amazon with their Trainium2 aim to excel past each other when it comes to performance-optimized hardware to tackle AI workloads, and Google has indeed joined the list.

Now Google has unveiled several exciting elements such as their new Gemini model for the AI industry, but our coverage will be more focused on the hardware side of things. The Cloud TPU v5p is Google’s most capable and cost-effective TPU (Cloud Tensor Processing Unit) up to date. Each TPU v5p pod consists of a whopping 8,960 chips interconnected using the highest-bandwidth inter-chip connection at 4,800 Gbps per chip, ensuring rapidly fast transfer speeds and optimal performance. Google doesn’t look to hold back, as the upcoming generational leap figures will amaze you.

Image Source: Google Cloud

When compared to the TPU v4, the newly-released v5p comes with two times the greater FLOPS (Floating-point operations per second) and three times more high-memory bandwidth, which is amazing when considered in the domain of artificial intelligence.

Moreover, coming to model training, the TPU v5p shows a 2.8 times generational jump in LLM training speeds. Google has also created space to squeeze out more computing power as well since the TPU v5p is “4X more scalable than TPU v4 in terms of total available FLOPs per pod”.

1_next-generation_ai_workloads-max-2000x2000
2_next-generation_ai_workloads-max-2000x2000

Summing things up for the Google Cloud TPU v5p AI chip:

  • 2X More Flops Versus TPU v4 (459 TFLOPs Bf16 / 918 TOPs INT8)
  • 3X More Memory Capacity Versus TPU v4 (95 GB HBM)
  • 2.8X Faster LLM Training
  • 1.9X Faster Embedding-Dense Model Training
  • 2.25X More Bandwidth Versus TPU v4 (2765 GB/s vs 1228 GB/s)
  • 2X Interchip Interconnect Bandwidth versu TPU v4 (4800 Gbps vs 2400 Gbps)

Google has recognized the apparent success when it comes to having the best hardware and software resources onboard, which is why the firm has an AI Hypercomputer, which is a “set” of elements designed to work in cooperation to enable modern AI workloads. Google has integrated the likes of performance-optimized compute, optimal storage along liquid cooling to leverage the immense capabilities all together, to output performance that is indeed an industry marvel of its own.

Image Source: Google Cloud

On the software side, Google has stepped things up with the use of open software to tune its AI workloads to ensure the best performance with its hardware. Here is a rundown of the newly added software resources in AI Hypercomputer:

  • Extensive support for popular ML frameworks such as JAX, TensorFlow, and PyTorch are available right out of the box. Both JAX and PyTorch are powered by OpenXLA compiler for building sophisticated LLMs. XLA serves as a foundational backbone, enabling the creation of complex multi-layered models (Llama 2 training and inference on Cloud TPUs with PyTorch/XLA). It optimizes distributed architectures across a wide range of hardware platforms, ensuring easy-to-use and efficient model development for diverse AI use cases (AssemblyAI leverages JAX/XLA and Cloud TPUs for large-scale AI speech).
  • Open and unique Multislice Training and Multihost Inferencing software, respectively, make scaling, training, and serving workloads smooth and easy. Developers can scale to tens of thousands of chips to support demanding AI workloads.
  • Deep integration with Google Kubernetes Engine (GKE) and Google Compute Engine, to deliver efficient resource management, consistent ops environments, autoscaling, node-pool auto-provisioning, auto-checkpointing, auto-resumption, and timely failure recovery.

Google’s revolutionary approach to artificial intelligence is quite evident with its new set of hardware and software elements, which are all set to break the barriers that are limiting the industry. It will be interesting to see how the new Cloud TPU v5p processing units, coupled with the AI Hypercomputer aid in the ongoing developments, but one thing is certain, they are surely going to ramp up the competition.

News Source: Google Cloud

Share this story

Facebook

Twitter