Large language models (LLMs) and generative AI have become crucial for applications, often consumed as a service via APIs.
Inference-as-a-Service eliminates bottlenecks by allowing applications to interface with ML models with low operational overhead.
Cloud Run, Google Cloud’s serverless container platform, is ideal for driving LLM-powered applications as it leverages container runtimes without infrastructure concerns.
Using Vertex AI and Cloud Run with GPUs, developers can host open LLMs and access Model Garden offering various ML models.
By activating the Gemini API in Vertex AI, deploying applications as containers on Cloud Run seamlessly inference with Vertex AI.
Cloud Run with GPUs offers flexibility, enabling the hosting of LLMs on a serverless architecture for optimized cost and performance.
Tailoring LLM responses to specific domains can be done using Retrieval-Augmented Generation (RAG), which leverages AlloyDB for contextual customization.
Inference-as-a-Service manages interactions between Cloud Run, Vertex AI, and AlloyDB, facilitating the RAG data flow in the architecture.
A chatbot architecture example demonstrates how Cloud Run can host chatbots that inference with LLMs in Vertex AI and store embeddings into AlloyDB.
Get started with building generative AI Python applications on Cloud Run by following the provided codelab.