The University of Manchester's Beehive Lab has released GPULlama3.java, the first Java-native implementation of Llama3 with automatic GPU acceleration.
GPULlama3.java leverages TornadoVM to enable GPU-accelerated large language model inference in Java without requiring developers to write CUDA or native code.
TornadoVM, at the core of GPULlama3.java, extends OpenJDK and GraalVM to automatically accelerate Java programs on GPUs, FPGAs, and multi-core CPUs.
TornadoVM works by extending the Graal JIT compiler with specialized backends that translate Java bytecode to GPU-compatible code at runtime when marked for acceleration.
The project supports NVIDIA GPUs, Intel GPUs, and Apple Silicon through various backends for diverse hardware execution.
GPULlama3.java leverages modern Java features like Vector API, Foreign Memory API support, GGUF format for model deployment, and quantization support.
The project builds upon Mukel's original Llama3.java, integrating GPU acceleration capabilities through TornadoVM.
GPULlama3.java is part of the expanding Java ecosystem for AI/ML, allowing developers to build LLM-powered applications without leaving the Java platform.
TornadoVM aims to make heterogeneous computing accessible to Java developers and has been evolving since 2013 with new backend support and optimizations.
GPULlama3.java is currently in beta, focusing on performance optimization and benchmark collection, especially for Apple Silicon support.
The project signifies a significant advancement in bringing GPU-accelerated LLM inference to Java, showcasing the potential for Java-based AI applications in enterprise settings.
Developers interested in exploring GPU-accelerated LLM inference in Java can access the open-source project on GitHub with comprehensive documentation and examples.