Automatic Differentiation (AD) is an important algorithm to calculate derivatives of computer programs. Reverse mode AD works for any number of function outputs and is commonly used in machine learning.
AD treats a computation as a nested sequence of function compositions, and then calculates the derivative of the outputs w.r.t. the inputs using repeated applications of the chain rule. There are two methods of AD: forward mode and reverse mode.
Linear chain graphs can be easily calculated through a sequence of function compositions with a single input and single output. Reverse mode AD is a generalization of the Backpropagation technique used in training neural networks.
General DAGs are more complex with non-linear patterns of interconnected nodes but can still be calculated through the multivariate chain rule. Expensive full Jacobians are not required in reverse mode AD because we only need a function that takes a row vector and outputs the VJP.
Reverse mode AD and VJPs are highly regarded because large and sparse jacobians can lead to optimal computational efficiency.
The Var class uses operator overloading and costumed functions to construct the graph in the background. The grad method then runs reverse mode AD.
Reverse mode AD is commonly used in machine learning because it's more efficient and handles scalar loss output functions.
Reverse mode AD and VJPs are crucial in implementing machine learning algorithms and neural networks.
Professional AD implementations like Autograd and JAX have better ergonomics and performance for large-scale projects.