Recent advancements in building highly efficient large language models (LLMs) with only three types of weights using BitNet b1.58 architecture has made us wonder: would it be possible to directly produce small models without needing to train large models costing millions of dollars? However, training these smaller models would not be easy as gradient descent method does not work efficiently on these smaller models.
Training LLMs requires massive GPU clusters, and even the smaller models derived from these LLMs require significant computational requirements for distillation and quantization; thereby, widening the gap between training and inference networks.
Gradient-free solutions like evolutionary algorithms and random search may seem less efficient than gradient descent but are advantageous in cases where derivatives cannot be computed, like in the case of 1.58-bit neural networks that cannot be trained using gradient descent.
The use of gradient-free training could lead to the repurposing of transistors to build ASICs specifically designed to efficiently run 1.58-bit networks, making the process faster, more scalable and energy-efficient.
Decentralization in gradient-free training offers greater democratization in AI and could involve any device capable of running the network to participate in the training process. Decentralized systems similar to Bitcoin could be used to 'mine' neural networks, where the ASIC would run an NN very quickly, and those who succeed in finding effective parameters would earn a reward.
Though it is unclear if gradient-free solutions could be effective, the potential rewards could be significant, and it is a field that needs deeper research. It could lead to the development of new and better LLMs, or even gradient-free fine-tuning, and the potential for decentralization in training is compelling.
Further discussion on this topic is ongoing in the Reddit thread linked in the article, and more opinions and comments are welcome.