A new technical paper titled “Scaling On-Device GPU Inference for Large Generative Models” was published by researchers at Google and Meta Platforms.
The paper introduces ML Drift, an optimized framework that enables on-device execution of generative AI workloads with significantly more parameters than existing models.
ML Drift addresses engineering challenges related to cross-GPU API development and ensures compatibility across mobile and desktop/laptop platforms, allowing for the deployment of complex models on resource-constrained devices.
The GPU-accelerated ML/AI inference engine developed by the researchers achieves a performance improvement of an order of magnitude compared to existing open-source GPU inference engines.