The LLM Twin free course teaches you how to build an LLM system replica that incorporates your style and personality.
The focus of lesson 10 introduces prompt monitoring into the LLM Twin production system.
An essential task in prompt monitoring is to track the latency of generating an answer, the total input and output tokens, and to compute metrics relative to the relevance and precision of the retrieved context.
When working with RAG systems, it is critical to log their full traces to reveal the entire process from when a user sends a query to when the final response is returned.
To monitor a simple LLM call, we must annotate the function with the @opik.track(name=”…”) Python decorator.
To monitor complex traces, we can aggregate all monitoring keys into a single trace using Opik monitoring SDK.
The next step is to evaluate the samples collected from production, such as hallucination, toxicity, and response time metrics.
The production data is collected in real-time from all the requests made by the clients.
The evaluation pipeline can be shipped in either offline batch mode or evaluated each sample independently.
Finally, we should hook the evaluation pipeline to an alarming system that alerts us when the application has moderation, hallucination, or other business issues.