menu
techminis

A naukri.com initiative

google-web-stories
Home

>

Google News

>

Using RDMA...
source image

Cloudblog

4w

read

199

img
dot

Image Credit: Cloudblog

Using RDMA over Converged Ethernet networking for AI on Google Cloud

  • Google Cloud is making the RDMA over Converged Ethernet (RoCE) v2 protocol available to support high-performance workloads, such as AI, ML, and scientific workloads. Traditional workloads involve data movement between source and destination with TCP headers, whereas AI workloads require low latency, high bandwidth, and lossless communication. RDMA allows for the direct exchange of data between systems, bypassing the OS, networking stack, and CPU for faster processing. Google Cloud now supports RoCE v2 for A3 Ultra and A4 Compute Engine machine types with benefits like lower latency, increased bandwidth, lossless communication, and scalability support for large cluster deployments.
  • RDMA (Remote Direct Memory Access) technology enables systems to exchange data without involving the OS, networking stack, and CPU, resulting in faster processing times. With RoCEv2, Google Cloud extends its RDMA-like capabilities and introduces features like priority-based flow control, explicit congestion notification, and UDP port 4791 support. This allows for faster training and inference, making Google Cloud's offering a key differentiator for demanding applications.
  • To take advantage of these capabilities, users need to create a reservation, choose a deployment strategy, and create their deployment. The detailed configuration steps and more information can be found in the provided documentation.
  • Google Cloud's support for RDMA over Converged Ethernet (RoCE) v2 protocol enables high-performance networking for AI workloads, leading to faster training and inference and improved application speed.

Read Full Article

like

12 Likes

For uninterrupted reading, download the app