The conversational AI agent is built as a distributed service composed of a transcription service, a text to speech engine, an LLM server, and a gRPC server and client architecture that tie all the services together.
This project eschews adding a web interface since it makes the assumption that natural conversational interactions are the future of interactivity.
The automatic speech recognition client makes use of the RealtimeTTS open-source library that integrates a Faster-Whisper model using English as the primary language.
The main application integrates a gRPC server together with the LLM client API calls to the Ollama server and the text-to-speech functionality.
Regarding acknowledgments, this handshaking feature was implemented to get around the issue of talking over the agent chatbot while it replies.
The latency in the above pipeline is from 1–10 seconds which is does not meet real-time requirements.
As of early 2025, a limited subset of current open-source models does include multi-modal capabilities.
This article showed how to build a complete conversational AI agent that can be used fully locally with no dependence on cloud services.
The main idea was to evaluate the feasibility of current open-source models to implement fully local conversational interfaces that can actuate IOT devices.
The conversational chatbot was tested on a Jetson AGX Orin resulting is almost real-time conversations.