Forward-mode automatic differentiation (FmAD) and zero-order (ZO) optimization have been proposed as memory-efficient alternatives to backpropagation (BP) for gradient computation.
A new study presents a comparison of BP, FmAD, and ZO methods, highlighting theoretical and empirical findings.
Theoretical analysis suggests that FmAD and ZO reduce memory usage but at the cost of accuracy, convergence speed, and computation compared to BP with checkpointing.
Empirical experiments on large models demonstrate that BP with checkpointing outperforms FmAD and ZO variants, indicating BP with checkpointing as the most effective strategy for model training in memory-constrained settings.