Google DeepMind, in collaboration with KAIST AI, proposes a method called Relaxed Recursive Transformers (RRTs) that reduce the cost, computing, and resources required for a LLM to function.
RRTs allow LLMs to be programmed to behave like small language models yet outperform many of the standard SLMs present today.
Layer Tying, an RRT technique, allows input to pass through a small number of layers recursively, cutting down memory requirements and significantly reducing computational resources.
RRTs introduce low-ranking adaptation that adjusts the shared weights with a slight amount of variation which guarantees distinct behaviour in processing the input.
Recursive RRT models provide substantial accuracy improvements and performance parity with full-size models trained on 3 trillion tokens.
This method introduces LoRA or low-ranking adaptation. Low-rank matrices are set up leading to substantial energy savings by increasing inference throughput.
Compared to other models, the RRT uptrained on 60 billion tokens achieved performance parity with full-size models trained on 3 trillion tokens.
RRTs may contribute to impactful energy savings by making LLMs smarter without adding significantly to their footprint.
Quantisation and Layer Skip are other ways explored to scale down LLMs without compromising on performance, but RRTs involve parameter sharing and real-time verification during draft token generation.
Further research is needed to determine the uptraining cost associated with scaling to larger models before RRTs are deployed in real-world applications.