Adaptive design of experiments
I am not going to call it ‘Bayesian optimization’, but that is what everyone else does
April 11, 2017 — February 27, 2025
Suspiciously similar content
Closely connected to AutoML because surrogate optimisation is quite popular for this, and likewise Bayesian model calibration.
Unless improving BO algorithms themselves, or working with a large (100+) number of dimensions, I usually recommend people use off-the-shelf Ax and don’t worry about the fine details. It has a good API, and it’s powerful. Documentation is improving, and the project is active. It can be deployed on real labs, virtual experiments, and various weird clusters. By default, it usually just works.
1 Problem statement
Depending on your allegiance to hipness, you might credit the original statement of the problem to either Chernoff (1959) or Močkus (1975). Let’s go with the friendly modern version from Gilles Louppe and Manoj Kumar:
We are interested in solving
\[x^* = \arg \min_x f(x)\]
under the constraints that
- \(f\) is a black box for which no closed form is known (nor its gradients);
- \(f\) is expensive to evaluate;
- evaluations of \(y=f(x)\) may be noisy.
We might imagine sometimes having access to gradients. In such cases, we will additionally say that, rather than observing \(\nabla f, \nabla^2 f\), we observe random variables \(G(x),H(x)\) with \(\mathbb{E}G=\nabla f\) and \(\mathbb{E}(H)=\nabla^2 f\), as in stochastic optimization.
This setup is similar to reinforcement learning problems with a similar explore/exploit trade-off, though I don’t know the exact disciplinary boundaries.
The typical setup here is: We use a surrogate model of the loss surface and optimise that, aiming for a computationally cheaper alternative than evaluating the whole loss surface. An artfully chosen surrogate model can estimate where to sample next, predict unseen loss values, and possibly even give uncertainty estimates.
When the surrogate model is a Bayesian posterior over parameter values we want to learn, it’s often called “Bayesian optimisation.” Gaussian process regression is often used to approximate the loss surface. This isn’t crazy. Early work on GP regression (Krige 1951) was already somewhat optimisation-adjacent.
However, GP regressions aren’t the only possible surrogate models, not even the only possible Bayesian ones, and there’s nothing innately Bayesian about estimating unknown functions. So, there are several ways we can adjust from the default. Setting that issue aside, see Apoorv Agnihotri, Nipun Batra, Exploring Bayesian Optimization for a well-illustrated journey into this field.
Fashionable use: hyperparameter/ model selection, e.g., regularising complex models, often called automl.
We could also use adaptive experiments outside simulations, such as in industrial process control, real labs, mine shafts, and more. I first noticed this idea in sequential ANOVA design. Even though it’s not nearly so hip now, it’s still an incredible idea years after its inception.
Further info in Roman Garnett’s Bayesian Optimization Book (Garnett 2023).
2 Adaptive stopping only
3 With side information
e.g. SEBO (Chan, Paulson, and Mesbah 2023; S. Liu et al. 2023). To be continued.
4 BORE
Bayesian optimization by density ratio estimation (Oliveira, Tiao, and Ramos 2022; Louis C. Tiao et al. 2021).
Bayesian optimization (BO) is among the most effective and widely-used black-box optimization methods. BO proposes solutions according to an explore-exploit trade-off criterion encoded in an acquisition function, many of which are computed from the posterior predictive of a probabilistic surrogate model. Prevalent among these is the expected improvement (EI). The need to ensure analytical tractability of the predictive often poses limitations that can hinder the efficiency and applicability of BO. In this paper, we cast the computation of EI as a binary classification problem, building on the link between class-probability estimation and density-ratio estimation, and the lesser-known link between density ratios and EI. By circumventing the tractability constraints, this reformulation provides numerous advantages, not least in terms of expressiveness, versatility, and scalability.
5 Lab bandits
Sequential experiment design in the lab.
6 Acquisition functions
More useful terminology: Active learning, acquisition functions. To be continued.
For now, see BoTorch custom acquisition for an explanation by example.
7 Connection to RL
To be determined.
8 Wacky
Adaptive design methods I don’t understand because they look not so much black box as out of the box. Quasi-oppositional Differential Evolution (Rahnamayan, Tizhoosh, and Salama 2007) is old and comes from a zany field that cites compass points and Yin-Yang as inspiration (Mahdavi, Rahnamayan, and Deb 2018). Supposedly, it’s powerful and robust (“Dagstuhloid Benchmarking” 2023). What’s going on here?
9 Over large discrete sequences
Challenging for many BO methods but vital in, e.g. biological ML. I’ve seen some interesting ones in this space (González-Duque et al. 2024; Stanton et al. 2024).
Benchmarking HDBO summarises SOTA for life sciences. See poli.
10 Implementations
10.1 BoTorch/Ax
Botorch is the pytorch-based Bayesian optimization toolbox used by Ax, which is an experiment designer, wrapped up in a nice API.
Ax is a platform for optimising any kind of experiment, including machine learning experiments, A/B tests, and simulations. Ax can optimise discrete configurations (e.g., variants of an A/B test) using multi-armed bandit optimization and continuous (e.g., integer or floating point)-valued configurations using Bayesian optimization. This makes it suitable for many applications.
Ax has been used for various product, infrastructure, ML, and research applications at Facebook.
I wrote a script to run this on a slurm cluster: Ax + SLURM via submitit
and asyncio
.
10.2 Nevergrad
Nevergrad - A gradient-free optimization platform
It looks similar to Ax, but I haven’t used it, so I can’t say how it compares.
10.3 Poli
- MachineLearningLifeScience/poli-baselines: A collection of objective functions and black box optimization algorithms related to proteins and small molecules
- MachineLearningLifeScience/poli: A library of discrete objectives
This is probably what you want if the problem involves optimising long sequences, like DNA strands or sentences.
poli-baselines
has many algorithms:
Name | Reference |
---|---|
Random Mutations | N/A |
Random hill-climbing | N/A |
CMA-ES | pycma |
(Fixed-length) Genetic Algorithm | pymoo’s implementation |
Hvarfner’s Vanilla BO | Hvarfner et al. 2024 |
Bounce | Papenmeier et al. 2023 |
BAxUS | Papenmeier et al. 2022 |
Probabilistic Reparametrization | Daulton et al. 2022 |
SAASBO | Eriksson and Jankowiak 2021 |
ALEBO | Lentham et al. 2020 |
LaMBO2 | Gruver and Stanton et al. 2020 |
[…] This library works well with the discrete objective functions in poli
. One example is the ALOHA problem, involving searching 5-letter sequences to spell “ALOHA”. Here’s how to use the RandomMutation
solver inside poli-baselines
:
from poli.objective_repository import AlohaProblemFactory
from poli_baselines.solvers import RandomMutation
# Create an instance of the problem
problem = AlohaProblemFactory().create()
f, x0 = problem.black_box, problem.x0
y0 = f(x0)
# Create an instance of the solver
solver = RandomMutation(
black_box=f,
x0=x0,
y0=y0,
)
# Run the optimisation for 1000 steps,
# breaking if we find a performance above 5.0.
solver.solve(max_iter=1000, break_at_performance=5.0)
# Check if we got the solution we wanted
print(solver.get_best_solution()) # Should be [["A", "L", "O", "H", "A"]]
10.4 skopt
skopt (aka scikit-optimize
)
[…] is a simple and efficient library to minimise (very) expensive and noisy black-box functions. It implements several methods for sequential model-based optimisation.
This belongs to the sklearn
family, meaning it works well, reliably, predictably, and has amazing tooling, but it’s not fast, and lacks recent enhancements.
10.5 Dragonfly
…is an open source Python library for scalable Bayesian optimisation.
Bayesian optimization optimises expensive black-box functions. Dragonfly offers tools to scale up Bayesian optimization for large problems, with features for high-dimensional optimization, parallel evaluations, multi-fidelity optimization, and multi-objective optimisation.
It’s written in Python and Fortran, open-source.
10.6 PySOT
The Surrogate Optimization Toolbox (pySOT) for global deterministic optimisation problems. pySOT is hosted on GitHub
The main purpose is to optimise expensive black-box objective functions with continuous and/or integer variables, where all variables have bound constraints. The tighter the bounds, the more efficient the algorithms. This toolbox is less efficient for tasks with cheap evaluations.
With many surrogate options, a long history, and cool features like automatic concurrency, it hasn’t been updated for years. Perhaps that’s why it’s fallen from favour (Krityakierne, Akhtar, and Shoemaker 2016; Regis and Shoemaker 2013, 2009, 2007). It’s not strong on Bayesian optimisation interpretation.
10.7 GPyOpt
Gaussian process Optimization using GPy. Performs global optimization with different acquisition functions. You can use GPyOpt to optimise physical experiments (sequentially or in batches) and tune ML algorithms. It’s excellent at handling large datasets through sparse Gaussian process models.
Created by the same lab at Sheffield that brought us GPy.
10.8 Sigopt
sigopt is a commercial product that likely delivers impressive results. Given no pricing information on their website, one suspects it’s quite pricey.
10.9 spearmint
Spearmint is a package for Bayesian optimisation based on (Snoek, Larochelle, and Adams 2012).
The code consists of several parts and is modular, allowing for various ‘driver’ and ‘chooser’ modules. The ‘choosers’ are implementations of acquisition functions like expected improvement or random. The drivers manage experiment distribution and execution on the system. Designed for running parallel experiments (launching new experiments as soon as results come in), it requires some engineering know-how.
Spearmint2 is similar but fancier and more recently updated; however, it has a restrictive licence that prohibits wide redistribution without paying fees. You may or may not want to trust the development and support implied by four Harvard Professors, depending on your application.
Both of the Spearmint options (especially the latter) have opinionated choices of technology stack for their optimizations. This means they can do more for you but require more setup than something simple like skopt
. Depending on your computing environment, this might be an overall plus or minus.
10.10 SMAC
(sequential model-based algorithm configuration) is a versatile tool for optimising algorithm parameters (or the parameters of some other process we can run automatically or a function we can evaluate, such as a simulation).
SMAC has helped us speed up both local search and tree search algorithms by orders of magnitude on certain instance distributions. Recently, we have also found it to be very effective for the hyperparameter optimization of machine learning algorithms, scaling better to high dimensions and discrete input dimensions than other algorithms. Finally, the predictive models SMAC is based on can also capture and exploit important information about the model domain, such as which input variables are most important.
We hope you find SMAC similarly useful. Ultimately, we hope that it helps algorithm designers focus on tasks that are more scientifically valuable than parameter tuning.