Adaptive design of experiments
I am not going to call it ‘Bayesian optimization’, but that is what everyone else does
April 11, 2017 — January 26, 2025
Suspiciously similar content
Closely connected to AutoML, in that surrogate optimisation is a popular tool for such, and likewise Bayesian model calibration.
Unless they are working on improving BO algorithms in themselves, and if they are not working in a large (100+) number of dimensions, I generally recommend people use off-the-shelf Ax and don’t sweat the fine details. It has a good API and is powerful. Documentation is thin improving, and the project is active. It can be deployed on real labs, on virtual experiments, and on various weird clusters. Per default it usually just works.
1 Problem statement
Depending on allegiances of hipness you might credit the original statement of the problem Chernoff (1959) or Močkus (1975). Let’s go with the friendly modern version of Gilles Louppe and Manoj Kumar:
We are interested in solving
\[x^* = \arg \min_x f(x)\]
under the constraints that
- \(f\) is a black box for which no closed form is known (nor its gradients);
- \(f\) is expensive to evaluate;
- evaluations of \(y=f(x)\) may be noisy.
It is possible to imagine we might even have access to gradients sometimes in which case we will additionally say that, rather than observing \(\nabla f, \nabla^2 f\) we observe some random variables \(G(x),H(x)\) with \(\mathbb{E}G=\nabla f\) and \(\mathbb{E}(H)=\nabla^2 f\), as in stochastic optimization.
This resembles the typical framing of reinforcement learning problems where there is a similar explore/exploit trade-off, although I do not know the precise disciplinary boundaries that may transect these areas.
The typical setup here is: We use a surrogate model of the loss surface and optimise that, in the hope that it is computationally cheaper than evaluating the whole loss surface. An artfully chosen surrogate model can choose where to sample next and so on and estimate unseen loss values, and possibly even give uncertainty estimates.
When the surrogate model is in particular a Bayesian posterior over parameter values that we wish to learn, a common name is the “Bayesian optimisation”. Gaussian process regression is an obvious method to approximate the loss surface in many cases. This is not crazy. Some of the early work on GP regression (Krige 1951) already seems optimisation-adjacent.
However, GP regressions are not the only possible surrogate models, not even the only possible Bayesian one, and there is nothing intrinsically Bayesian about estimating the unknown function. So there are a few ways we can relax the model from the default. Setting that quibble aside, see Apoorv Agnihotri, Nipun Batra, Exploring Bayesian Optimization for a well-illustrated journey into this field.
Fashionable use: hyperparameter/ model selection, in e.g. regularising complex models, which is compactly referred to these days as automl.
We could also use adaptive experiments outside of simulations, e.g. in industrial process control, actual lab experiments, mine-shafts, etc. That is where I originally saw the idea, before I knew the name, in the form of sequential ANOVA design, which is an incredible idea itself, although that is now years old and thus not nearly so hip.
Further info in Roman Garnett’s Bayesian Optimization Book (Garnett 2023).
2 Adaptive stopping only
3 With side information
e.g. SEBO (Chan, Paulson, and Mesbah 2023; S. Liu et al. 2023). TBC.
4 BORE
Bayesian optimization by density ratio estimation (Oliveira, Tiao, and Ramos 2022; Louis C. Tiao et al. 2021).
Bayesian optimization (BO) is among the most effective and widely-used blackbox optimization methods. BO proposes solutions according to an explore-exploit trade-off criterion encoded in an acquisition function, many of which are computed from the posterior predictive of a probabilistic surrogate model. Prevalent among these is the expected improvement (EI). The need to ensure analytical tractability of the predictive often poses limitations that can hinder the efficiency and applicability of BO. In this paper, we cast the computation of EI as a binary classification problem, building on the link between class-probability estimation and density-ratio estimation, and the lesser-known link between density-ratios and EI. By circumventing the tractability constraints, this reformulation provides numerous advantages, not least in terms of expressiveness, versatility, and scalability.
5 Lab bandits
Sequential experiment design in the lab.
6 Acquisition functions
More useful terminology: Active learning, acquisition functions. TBC.
For now, see BoTorch custom acquisition for an explanation by example.
7 Connection to RL
TBD.
8 Wacky
Adaptive design methods that I do not understand because they look not so much black box as out of the box. Quasi-oppositional Differential Evolution (Rahnamayan, Tizhoosh, and Salama 2007) is old and comes from a zany field that cites compass points and Yin-Yang as its inspiration (Mahdavi, Rahnamayan, and Deb 2018). However, it is supposedly powerful and robust (“Dagstuhloid Benchmarking” 2023). What is going on there?
9 Over large discrete sequences
Challenging for many BO methods but very important in, e.g. biological ML. I have seen some interesting ones in this space (González-Duque et al. 2024; Stanton et al. 2024).
Benchmarking HDBO summarises SOTA for life sciences. See poli.
10 Implementations
10.1 BoTorch/Ax
Botorch is the pytorch-based Bayesian optimization toolbox used by Ax which is an experiment designer, wrapping it up in a nice API.
Ax is a platform for optimising any kind of experiment, including machine learning experiments, A/B tests, and simulations. Ax can optimize discrete configurations (e.g., variants of an A/B test) using multi-armed bandit optimization, and continuous (e.g., integer or floating point)-valued configurations using Bayesian optimization. This makes it suitable for a wide range of applications.
Ax has been successfully applied to a variety of product, infrastructure, ML, and research applications at Facebook.
I wrote a script to run this on a slurm cluster: Ax + SLURM via submitit
and asyncio
.
10.2 Poli
- MachineLearningLifeScience/poli-baselines: A collection of objective functions and black box optimization algorithms related to proteins and small molecules
- MachineLearningLifeScience/poli: A library of discrete objectives
This is probably what to look for if the problem is about optimising very long things, like DNA strands or sentences or something.
poli-baselines
boasts many algorithms:
Name | Reference |
---|---|
Random Mutations | N/A |
Random hill-climbing | N/A |
CMA-ES | pycma |
(Fixed-length) Genetic Algorithm | pymoo’s implementation |
Hvarfner’s Vanilla BO | Hvarfner et al. 2024 |
Bounce | Papenmeier et al. 2023 |
BAxUS | Papenmeier et al. 2022 |
Probabilistic Reparametrization | Daulton et al. 2022 |
SAASBO | Eriksson and Jankowiak 2021 |
ALEBO | Lentham et al. 2020 |
LaMBO2 | Gruver and Stanton et al. 2020 |
[…] This library interoperates well with the discrete objective functions defined in poli
. One such objective function is the ALOHA problem, in which we search the space of 5-letter sequences of the word “ALOHA”. The following is a simple example of how one could use the RandomMutation
solver inside poli-baselines
to solve this problem:
from poli.objective_repository import AlohaProblemFactory
from poli_baselines.solvers import RandomMutation
# Creating an instance of the problem
problem = AlohaProblemFactory().create()
f, x0 = problem.black_box, problem.x0
y0 = f(x0)
# Creating an instance of the solver
solver = RandomMutation(
black_box=f,
x0=x0,
y0=y0,
)
# Running the optimization for 1000 steps,
# breaking if we find a performance above 5.0.
solver.solve(max_iter=1000, break_at_performance=5.0)
# Checking if we got the solution we were waiting for
print(solver.get_best_solution()) # Should be [["A", "L", "O", "H", "A"]]
10.3 Nevergrad
Nevergrad - A gradient-free optimization platform
It looks kinda similar to Ax, but I have not used it so cannot say how it compares.
10.4 skopt
skopt (aka scikit-optimize
)
[…] is a simple and efficient library to minimize (very) expensive and noisy black-box functions. It implements several methods for sequential model-based optimization.
This is a member of the sklearn
club which is to say it works well, reliably, predictably, universally has amazing tooling, but is not that fast and few modern fancy fripperies.
10.5 Dragonfly
…is an open source Python library for scalable Bayesian optimization.
Bayesian optimization is used for optimising black-box functions whose evaluations are usually expensive. Beyond vanilla optimization techniques, Dragonfly provides an array of tools to scale up Bayesian optimization to expensive large scale problems. These include features/functionality that are especially suited for high-dimensional optimization (optimising for a large number of variables), parallel evaluations in synchronous or asynchronous settings (conducting multiple evaluations in parallel), multi-fidelity optimization (using cheap approximations to speed up the optimization process), and multi-objective optimization (optimising multiple functions simultaneously).
Python and Fortran, open-source.
10.6 PySOT
Surrogate Optimization Toolbox (pySOT) for global deterministic optimisation problems. pySOT is hosted on GitHub
The main purpose of the toolbox is for optimisation of computationally expensive black-box objective functions with continuous and/or integer variables. All variables are assumed to have bound constraints in some form where none of the bounds are infinity. The tighter the bounds, the more efficient are the algorithms since it reduces the search region and increases the quality of the constructed surrogate. This toolbox may not be very efficient for problems with computationally cheap function evaluations. Surrogate models are intended to be used when function evaluations take from several minutes to several hours or more.
This has a huge variety of different surrogate options, a long history and promises many cool features such as automatic concurrency. However, I think it has fallen from favour since it has not been updated for several years. Based on (Krityakierne, Akhtar, and Shoemaker 2016; Regis and Shoemaker 2013, 2009, 2007). It is one of the ones that does not emphasise Bayesian interpretation of the optimisation.
10.7 GPyOpt
Gaussian process Optimization using GPy. Performs global optimization with different acquisition functions. Among other functionalities, it is possible to use GPyOpt to optimize physical experiments (sequentially or in batches) and tune the parameters of Machine Learning algorithms. It is able to handle large data sets via sparse Gaussian process models.
By the same lab at Sheffield that brought us GPy.
10.8 Sigopt
sigopt is a commercial product that presumably does a good job. The fact their website does not give even a hint of the price leads me to suspect they are extremely expensive.
10.9 spearmint
Spearmint is a package to perform Bayesian optimisation according to the algorithms outlined in the paper (Snoek, Larochelle, and Adams 2012)
The code consists of several parts. It is designed to be modular to allow swapping out various ‘driver’ and ‘chooser’ modules. The ‘chooser’ modules are implementations of acquisition functions such as expected improvement, UCB or random. The drivers determine how experiments are distributed and run on the system. As the code is designed to run experiments in parallel (spawning a new experiment as soon as a result comes in), this requires some engineering.
Spearmint2 is similar but more recently updated and fancier; however, it has a restrictive licence prohibiting wide redistribution without the payment of fees. You may or may not wish to trust the implied level of development and support of four Harvard Professors, depending on your application.
Both of the Spearmint options (especially the latter) have opinionated choices of technology stack to do their optimizations. This means they can do more work for you but require more setup than a simple little thing like skopt
. Depending on your computing environment, this might be an overall plus or a minus.
10.10 SMAC
(sequential model-based algorithm configuration) is a versatile tool for optimising algorithm parameters (or the parameters of some other process we can run automatically or a function we can evaluate, such as a simulation).
SMAC has helped us speed up both local search and tree search algorithms by orders of magnitude on certain instance distributions. Recently, we have also found it to be very effective for the hyperparameter optimization of machine learning algorithms, scaling better to high dimensions and discrete input dimensions than other algorithms. Finally, the predictive models SMAC is based on can also capture and exploit important information about the model domain, such as which input variables are most important.
We hope you find SMAC similarly useful. Ultimately, we hope that it helps algorithm designers focus on tasks that are more scientifically valuable than parameter tuning.