Analog in-memory computing (AIMC) is a promising compute paradigm that aims to enhance speed and power efficiency of neural network inference beyond traditional von Neumann-based architectures.
Challenges like noisy computations and strict input/output quantization constraints hinder the performance of off-the-shelf Large Language Models (LLMs) when deployed on AIMC-based hardware.
A new method has been introduced to adapt LLMs for execution on noisy, low-precision analog hardware, allowing advanced models to maintain performance comparable to 4-bit weight, 8-bit activation standards despite noise and quantization restrictions.
The models developed through this approach can also be quantized for inference on low-precision digital hardware, displaying improved scaling behavior compared to models trained with 4-bit weight and 8-bit static input quantization.