Google Cloud launched GPU support for the Cloud Run serverless platform, enabling developers to accelerate serverless inference of models deployed on Cloud Run.
The tutorial explains the steps to deploy the Llama 3.1 Large Language Model (LLM) with 8B parameters on a GPU-based Cloud Run service.
The guide demonstrates the initialization of the environment, deploying the Text Generation Inference (TGI) server using the Hugging Face model, and performing inference using cURL or the OpenAI Python library.
This tutorial provides insights into how to leverage GPU-acceleration for serverless inference using Google Cloud Run.