Google Cloud's Vertex AI offers easy access to open models that require powerful infrastructure and deployment capabilities.
The announcement allows for the deployment and serving of Llama 3.1 405B FP16 LLM over GKE.
The challenges of deploying and serving these large models has been made easier with Kubernetes yaml.
Llama 3.1 405B FP16 LLM requires more than 750 GB GPU memory and the A3 virtual machines is equipped with 8 H100 Nvidia GPUs.
The LeaderWorkerSet (LWS) is a deployment API specifically developed to address the workload requirements of multi-host inference, facilitating the sharding and execution of the model across multiple devices on multiple nodes.
vLLM is a popular open source model server and supports multi-node multi-GPU inference by employing tensor parallelism and pipeline parallelism.
For pipeline parallelism, vLLM manages the distributed runtime with Ray for multi-node inferencing.
LWS supports dual templates, one designated for the leader and the other for the workers.
The tensor parallel size is set to 8, while the pipeline parallel size is set to 2 to accommodate the entirety of the Llama 3.1 405B FP16 model.
Multi-host deployment and serving is essential for LLMs like the FP16 Llama 3.1 405B model and is the only viable solution.