Automatic differentiation
July 27, 2016 — November 15, 2023
Getting your computer to tell you the gradient of a function, without resorting to finite difference approximation or coding an analytic derivative by hand. We usually mean this in the sense of automatic forward or reverse mode differentiation, which is not, as such, a symbolic technique, but symbolic differentiation gets an incidental look-in, and these ideas do of course relate.
Infinitesimal/Taylor series formulations, the related dual number formulations, and even fancier hyperdual formulations. Reverse-mode, a.k.a. Backpropagation, versus forward-mode etc. Computational complexity of all the above.
There are many ways you can do automatic differentiation, and I won’t attempt to comprehensively introduce the various approaches. This is a well-ploughed field. There is much good material out there already with fancy diagrams and the like. Symbolic, numeric, dual/forward, backwards mode… Notably, you don’t have to choose between them — e.g. you can use forward differentiation to calculate an expedient step in the middle of backward differentiation, for example.
You might want to do this for ODE quadrature, or sensitivity analysis, for optimisation, either batch or SGD, especially in neural networks, matrix factorisations, variational approximation etc. This is not news these days, but it took a stunningly long time to become common since its inception in the… 1970s? See, e.g. Justin Domke, who claimed automatic differentiation to be the most criminally underused tool in the machine learning toolbox. (That escalated quickly.) See also a timely update by Tim Viera.
There is a beautiful explanation of reverse-mode basics by Sanjeev Arora and Tengyu Ma. See also Mike Innes’ hands-on introduction, or his terse, opinionated introductory paper, Innes (2018), or Jingnan Shi’s excellent Automatic Differentiation: Forward and Reverse. There is a well-established terminology for sensitivity analysis discussing adjoints, e.g. Steven Johnson’s class notes, and his references (Johnson 2012; Errico 1997; Cao et al. 2003).
1 Terminology zoo
Too many words meaning the same thing, or quirky use of broad terms. Need some disambiguation
- pushforward/pullback
- vjp/vhp
- sensitivity
- adjoint
2 Who invented backpropagation?
There is an adorable cottage industry in arguing about who first applied reverse-mode autodiff to networks. See, e.g. Schmidhuber’s blog post, Griewank (2012) and Schmidhuber (2015), a reddit thread and so on.
3 Computational complexity
🏗
4 Forward- versus reverse-mode
🏗
TaylorSeries.jl is an implementation of high-order automatic differentiation, as presented in the book by W. Tucker (Tucker 2011). The general idea is the following.
The Taylor series expansion of an analytical function \(f(t)\) with one independent variable \(t\) around \(t_0\) can be written as
\[ f(t) = f_0 + f_1 (t-t_0) + f_2 (t-t_0)^2 + \cdots + f_k (t-t_0)^k + \cdots, \] where \(f_0=f(t_0)\), and the Taylor coefficients \(f_k = f_k(t_0)\) are the \(k\)th normalized derivatives at \(t_0\):
\[ f_k = \frac{1}{k!} \frac{{\rm d}^k f} {{\rm d} t^k}(t_0). \]
Thus, computing the high-order derivatives of \(f(t)\) is equivalent to computing its Taylor expansion.… Arithmetic operations involving Taylor series can be expressed as operations on the coefficients.
5 Symbolic differentiation
If you have already calculated the symbolic derivative, you can of course use this as a kind of automatic derivative. It might even be faster.
Calculating symbolic derivatives can itself be automated. Symbolic math packages such as Sympy, MAPLE and Mathematica can all do actual symbolic differentiation, which is different again, but sometimes leads to the same thing. I haven’t tried Sympy or MAPLE, but Mathematica’s support for matrix calculus is weak, and since I usually need matrix derivatives, this particular task has not been automated for me.
6 In implicit targets
Long story. For use in, e.g. Implicit NN.
A beautiful explanation can be found in Blondel et al. (2021).
To do: investigate Benoît Pasquier’s (Pasquier and Primeau 2019) F-1 Method.
This package implements the F-1 algorithm […] It allows for efficient quasi-auto-differentiation of an objective function defined implicitly by the solution of a steady-state problem.
7 In ODEs
See learning ODEs, and differentiable PDE solvers.
8 Method of adjoints
9 Hessians in neural nets
We are getting better at estimating second-order derivatives in yet more adverse circumstances. For example, see the pytorch Hessian tools.
10 As message-passing
- Thomas Minka, From automatic differentiation to message passing/ Slides
11 Software
In decreasing order of relevance to me personally.
11.1 jax
jax
(python) is a successor to classic python autograd
.
JAX is Autograd and XLA, brought together for high-performance machine learning research.
I use it a lot; see jax.
11.2 Pytorch
See pytorch
.
Another neural-net style thing like tensorflow, but with dynamic graph construction as in autograd
.
11.3 Julia
Julia has an embarrassment of different methods of autodiff (Homoiconicity and introspection make this comparatively easy.) and it’s not always clear the comparative selling points of each.
Anyway, there is enough going on there that it needs its own page. See Julia Autodiff.
11.4 Tensorflow
Not a fan, but it certainly does work. See Tensorflow. FYI there is an interesting discussion of its workings in the tensorflow jacobians ticket request
11.5 Aesara
Aesara is a Python library that allows you to define, optimize, and efficiently evaluate mathematical expressions involving multi-dimensional arrays. It can use GPUs and perform efficient symbolic differentiation.
This is a fork of the original Theano library that is being maintained by the PyMC team.
- A hackable, pure-Python codebase
- Extensible graph framework suitable for rapid development of custom symbolic optimizations
- Implements an extensible graph transpilation framework that currently provides compilation to C and JAX JITed Python functions
- Built on top of one of the most widely-used Python tensor libraries: Theano
Aesara combines aspects of a computer algebra system (CAS) with aspects of an optimizing compiler. It can also generate customized C code for many mathematical operations. This combination of CAS with optimizing compilation is particularly useful for tasks in which complicated mathematical expressions are evaluated repeatedly and evaluation speed is critical. For situations where many different expressions are each evaluated once Aesara can minimize the amount of compilation/analysis overhead, but still provide symbolic features such as automatic differentiation.
11.6 taichi
Taichi is a physics-simulation-and-graphics oriented library with clever compilation to various backends, embedded in python:
As a data-oriented programming language, Taichi decouples computation from data organization. For example, you can freely switch between arrays of structures (AOS) and structures of arrays (SOA), or between multi-level pointer arrays and simple dense arrays. Taichi has native support for sparse data structures, and the Taichi compiler effectively simplifies data structure accesses. This allows users to compose data organization components into complex hierarchical and sparse structures. The Taichi compiler optimizes data access.
We have developed 10 different differentiable physical simulators using Taichi, for deep learning and robotics tasks. Thanks to the built-in reverse-mode automatic differentiation system, most of these differentiable simulators are developed within only 2 hours. Accurate gradients from these differentiable simulators make controller optimization orders of magnitude faster than reinforcement learning.
11.7 Classic python autograd
I wouldn’t use this any longer. A better-supported near drop-in replacement is jax which is much faster and better documented.
can automatically differentiate native Python and Numpy code. It can handle a large subset of Python’s features, including loops, ifs, recursion and closures, and it can even take derivatives of derivatives of derivatives. It uses reverse-mode differentiation (a.k.a. backpropagation), which means it can efficiently take gradients of scalar-valued functions with respect to array-valued arguments. The main intended application is gradient-based optimization.
AFAICT deprecated in favour of jax.
autograd-forward will mingle forward-mode differentiation in to calculate Jacobian-vector products and Hessian-vector products for scalar-valued loss functions, which is useful for classic optimization.
11.8 Micrograd
Andrej Karpathy’s teaching library micrograd is a 50 line scalar autograd library from which you can learn cool things.
11.9 Enzyme
Generic compiler-level AD targeting many languages
Applying differentiable programming techniques and machine learning algorithms to foreign programs requires developers to either rewrite their code in a machine learning framework, or otherwise provide derivatives of the foreign code. This paper presents Enzyme, a high-performance automatic differentiation (AD) compiler plugin for the LLVM compiler framework capable of synthesizing gradients of statically analyzable programs expressed in the LLVM intermediate representation (IR). Enzyme synthesizes gradients for programs written in any language whose compiler targets LLVM IR including C, C++, Fortran, Julia, Rust, Swift, MLIR, etc., thereby providing native AD capabilities in these languages. Unlike traditional source-to-source and operator-overloading tools, Enzyme performs AD on optimized IR. …Packaging Enzyme for PyTorch and TensorFlow provides convenient access to gradients of foreign code with state-of-the art performance, enabling foreign code to be directly incorporated into existing machine learning workflows. (Moses and Churavy 2020)
Basically the long story short is that Enzyme has a couple of interesting contributions:
- Low-level Automatic Differentiation (AD) IS possible and can be high performance
- By working at LLVM we get cross-language and cross-platform AD
- Working at the LLVM level actually can give more speedups (since it’s able to be performed after optimization)
- We made a plugin for PyTorch/TF that uses Enzyme to import foreign code into those frameworks with ease!
Sounds great but I suspect that in practice there is still a lot of work required to make this go.
NB I tried to find the pytorch and tensorflow bindings but failed. Perhaps discontinued? Julia bindings, Rust bindings and JAX bindings seem real though.
11.10 Theano
Mentioned for historical accuracy.
Theano, (python) supports autodiff as a basic feature and had a massive user base, although it is now discontinued in favour of other options. See Aesara for a direct successor, and jax/pytorch/tensorflow for some more widely used alternatives.
11.11 Casadi
A classic is CasADi (Python, C++, MATLAB) (Andersson et al. 2019)
a symbolic framework for numeric optimization implementing automatic differentiation in forward and reverse modes on sparse matrix-valued computational graphs. It supports self-contained C-code generation and interfaces state-of-the-art codes such as SUNDIALS, IPOPT etc. It can be used from C++, Python or Matlab
[…] CasADi is an open-source tool, written in self-contained C++ code, depending only on the C++ Standard Library.
Documentation is sparse; probably should read the source or the published papers to understand how well this will fit your needs and, e.g. whicharithmetic operations it supports.
It might be worth it for such features as graceful support for 100-fold nonlinear composition, for example. It also includes ODE sensitivity analysis (differentiating through ODE solvers) which predates lots of fancypants ‘neural ODEs’. The price you pay is a weird DSL that you must learn to use, and unlike many of its trendy peers, it has no GPU support.
11.12 KeOps
File under least squares, autodiff, gps, pytorch.
The KeOps library lets you compute reductions of large arrays whose entries are given by a mathematical formula or a neural network. It combines efficient C++ routines with an automatic differentiation engine and can be used with Python (NumPy, PyTorch), Matlab, and R.
It is perfectly suited to the computation of kernel matrix-vector products, K-nearest neighbors queries, N-body interactions, point cloud convolutions, and the associated gradients. Crucially, it performs well even when the corresponding kernel or distance matrices do not fit into the RAM or GPU memory. Compared with a PyTorch GPU baseline, KeOps provides a x10-x100 speed-up on a wide range of geometric applications, from kernel methods to geometric deep learning.
11.13 ADOL
Another classic. ADOL-C
is a popular C++ differentiation library with python binding. Looks clunky from Python but tenable from C++.
11.14 ad
ad
is based on uncertainties (and therefore Python).
11.15 ceres solver
ceres-solver, (C++), the Google least squares solver, seems to have some good tricks, mostly focused on least-squares losses.
11.16 audi
autodiff
, which is usually referred to as audi
for the sake of clarity, offers light automatic differentiation for MATLAB. I think MATLAB now has a whole deep learning toolkit built in which surely supports something natively in this domain.
11.17 algopy
allows you to differentiate functions implemented as computer programs by using Algorithmic Differentiation (AD) techniques in the forward and reverse mode. The forward mode propagates univariate Taylor polynomials of arbitrary order. Hence it is also possible to use AlgoPy to evaluate higher-order derivative tensors.
Speciality of AlgoPy is the possibility to differentiate functions that contain matrix functions as +,-,*,/, dot, solve, qr, eigh, cholesky.
Looks sophisticated, and indeed supports differentiation elegantly; but not so actively maintained, and the source code is hard to find.