Bayesian Optimization and Active Learning Cookbook

Date
March 28, 2025
Authors

Contents

This document is intended as a reference for implementing various types of Bayesian optimization and active learning.

We begin with an introduction where we maximize a simple 1D function using Bayesian optimization, while explaining some of the theory and practicalities behind BoTorch.

Subsequently, a number of "recipes" are introduced, solving a more realistic engineering design problem based on AeroSandbox. Finally, there is a comparison against genetic algorithms (GA), which are another popular black-box optimization approach.

  1. Vanilla Bayesian Optimization
  2. Batched Bayesian Optimization
  3. Multi-Objective (Batched) Bayesian Optimization
  4. Constrained (Batched) Bayesian Optimization
  5. Pure Active Learning
  6. Bonus: BO vs GA

Note: The code uses BoTorch 0.13.0.

Introduction

Bayesian Optimization

Bayesian optimization (BO) is a method for globally optimizing a black-box function. The most general description is:

  1. Train a surrogate model on data collected from evaluating the black-box function.
  2. Define an acquisition function based on the outputs of the black-box function.
  3. Find points of the input space that maximize the acquisition function.
  4. Evaluate the black-box function at these points, and add the results to the training dataset.
  5. GOTO 1

Traditionally, the surrogate we use is a Gaussian process (GP), because the uncertainty quantification (UQ) allows us to target regions of the input space where the uncertainty is high. However, we can do the same with any surrogate that supports UQ. If the surrogate does not support UQ, then you are restricted in which acquisition functions you can use, and you're not really doing BO.

Active Learning

To give it its most general definition, active learning (AL) is any method where a model is used to decide which samples should be added to the training set ("labeled" using a black-box function). AL scenarios typically fall under one of the following three categories (see Active learning for data streams: a survey (Cacciarelli, 2023)):

  • Pool-based AL: there is a finite pool of unlabelled samples; all samples are evaluated using some criteria and a subset of them are labelled.
  • Online (or stream) AL: there is an infinite stream of unlabelled samples, and for each sample a decision has to be made whether or not to label it.
  • Membership query synthesis: any point in the input space can be labeled — this is typically the case we have at PhysicsX.

With this in mind, it is clear that BO can be seen as a type of AL, where the criteria for deciding which samples to label depends on an objective that we want to maximize. "Pure" AL is the setting where we don't have any objective other than training an accurate model (or in other words, the objective is to minimize the error of the model).

Ingredients

If you want to try the following recipes for yourself, you'll need these. The plotting code and AeroSandbox aircraft definition are not included, to keep the focus on the optimization process.

import numpy as np
import scipy.stats
import torch
from botorch import acquisition, sampling
from botorch.acquisition.objective import ScalarizedPosteriorTransform, ConstrainedMCObjective
from botorch.fit import fit_gpytorch_mll
from botorch.models import SingleTaskGP
from botorch.models.transforms import Normalize
from botorch.optim import optimize_acqf
from botorch.utils.multi_objective.box_decompositions import FastNondominatedPartitioning
from gpytorch.mlls import ExactMarginalLogLikelihood
from tqdm.notebook import trange

device = "cuda:0"
torch.set_default_device(device)

def array(x: torch.Tensor) -> np.ndarray:
    return x.detach().cpu().numpy()

Simple BO Example

Let's try this on a toy 1D example — we'll try and find the global maximum of this function for $x∈[0,1]$. You can see that there are a lot of local optima, so a conventional optimization algorithm will usually converge to a suboptimal point.

Remember that in practice this function can be noisy, nondifferentiable, and expensive to evaluate — for example, a numerical simulation — which is the reason we want to minimize how many times we evaluate the function.

default_a = 15.0 
default_b = 2.0 
default_noise = 0.2

def example_1d_function(
    x: torch.Tensor, a: float = default_a, b: float = default_b
) -> torch.Tensor:
    """Ground truth function.""" 
    return torch.sin(a * x) + b * x**2

def example_noisy_1d_function(
    x: torch.Tensor, a: float = default_a, b: float = default_b, noise: float = default_noise
) -> torch.Tensor:
    """Noisy ground truth observations."""
    return example_1d_function(x, a, b) + noise * torch.randn_like(x)

Gradient-Based Optimization

However, in this case we can perform gradient-based optimization, so let's show how torch.optim.Adam performs:

We've settled on a local optimum, but due to the poor starting point, it's substantially lower than the global optimum.

Bayesian Optimization

Now let's try BO instead. We'll add some observation noise as well to make our lives a bit harder. First, we need to sample a few random points and pass them through our noisy function:

x_init = torch.rand(2, dtype=torch.float64)
y_init = example_noisy_1d_function(x_init)

GP Fitting

Then, we'll fit a GP to these points:

gp = SingleTaskGP(x_init[:, None], y_init[:, None])
mll = ExactMarginalLogLikelihood(gp.likelihood, gp)
fit_gpytorch_mll(mll)

gp_posterior = gp.posterior(x_grid)
mean = gp_posterior.mean.detach()[:, 0]
std = gp_posterior.variance.detach()[:, 0] ** 0.5

Acquisition Function

Now, we define an acquisition function that we'll maximize in order to find candidate points. There is a wide range of choices in BoTorch, but we'll start with one of the simplest ones — upper confidence bound (UCB):

$\text{UCB}(x) = \mu(x) + \sqrt{\beta} \sigma(x)$

where $\mu(x)$ is the model's mean prediction at $x$, $\sigma(x)$ is the model's standard deviation at $x$, and $\beta$ is a hyperparameter that controls the balance between exploration and exploitation. High $\beta$ prioritizes areas of high uncertainty (exploration); low $\beta$ prioritizes areas where the objective function is believed to be high. Note that for $\beta=1$ we recover the top of the uncertainty interval in the above plot.

ucb = acquisition.UpperConfidenceBound(gp, beta=4)
# We'll explain this indexing later!
ucb_value = ucb(x_grid[:, None, None])

Acquisition Function Maximization

Unlike the function $f(x)$, $\text{UCB}(x)$ is smooth, differentiable, and fast to evaluate. In this simple 1D case, we can just evaluate it on a grid and find the maximum.

i_max = ucb_value.argmax()
x_candidate = x_grid[i_max]
value_candidate = ucb_value[i_max]

Alternatively, BoTorch provides routines for maximizing the acquisition function using SciPy (or even PyTorch) optimizers. We typically use optimize_acqf, which uses scipy.optimize.minimize to perform gradient ascent from many starting points in parallel. In this section, we'll just directly find the maximum by evaluating the acquisition function on a grid.

Keep in mind that, especially in high dimensions, the acquisition function can be a very complicated function with lots of local maxima; as you'll see next, it can be substantially more complicated than the ground truth function.

This might seem counterintuitive — surely BO just makes optimization even more difficult - but remember the following:

  • The acquisition function is fast to evaluate (and parallelizable).
  • The acquisition function is smooth and noise-free.
  • The acquisition function is analytically differentiable.

If all of the above also apply for your ground truth function (for example, if you are using BO to optimize a surrogate model), you might want to consider a more direct optimization method.

Bayesian Optimization Loop

At this stage, we will evaluate $f(x)$, refit the model with the new dataset, and give it another go.

More Acquisition Functions

BoTorch has a wide range of acquisition functions built in. Some of the most popular are the "expected improvement" (EI) family:

$\text{EI}(x) = \mathbb{E}_f\bigl[\max(f(x) - f^*, 0)\bigr]$

where the expectation is taken over the value of the stochastic function $f$ at the point $x$, and $f^*$ is the best function value observed so far (assuming noiseless observations).

Noisy EI (NEI) is a variant of EI that does not require knowledge of $f^*$, as it approximates a noise-free observation from the model at all observed $x$ values.

Probability of improvement (PI) is a simple acquisition function that does what it says on the tin: $\text{PI}(x) = P(f(x) > f^*)$.

Note that if samples from $f(x)$ are Gaussian, then all of these functions can be expressed in terms of the mean $\mu(x)$ and standard deviation $\sigma(x)$ at $x$.

Monte Carlo Acquisition Functions

All of the acquisition functions we just looked at are "analytic" — they are deterministic functions of the mean and standard deviation of the posterior. Although this is convenient (and efficient when working with GPs with Gaussian posteriors), it is also restrictive. In general, batch acquisition functions (that consider the joint utility of multiple candidates) cannot be expressed analytically, nor can acquisition functions where the posterior distribution is not Gaussian.

BoTorch supports Monte Carlo (MC) acquisition functions, which evaluate the expectations in the acquisition functions by sampling. These are generally preferred over analytic acquisition functions, which are efficient and precise, but offer much less flexibility than MC acquisition functions.

Here, we repeat the experiment for MC EI (qEI, as it's called in BoTorch), for a range of different sample sizes for evaluating the expectation. Note that rather than sampling randomly, a determinstic Sobol sequence is used, which both reduces the variance of the estimation and allows the use of gradient-based acquisition function optimization.

In this case, 64 samples are sufficient for a very good approximation of the analytic acquisition function, but even with fewer samples, the location of the maximum is the same.

Batching

We've referred to "batch acquisition functions" a few times. These are helpful in situations where you have the ability to evaluate your ground truth function several times in parallel (for example, you have HPC capacity to run four CFD simulations at a time). Batch acquisition functions allow for the exploration of several different areas at once. This is known as "q-batching", and acquisition functions that support this have class names starting with q.

It's important to remember that the penultimate dimension of the input to a q-acquisition function is the q-batch, so candidate tensors typically have shape $b \times q \times d$ (where candidates across the b dimension are considered independently). See the BoTorch page on batching for more details, including on batched models.

In BoTorch, there are two ways of doing q-batched optimization.

  • Joint optimization (sequential=False, default): directly optimize the full candidate batch in one shot.
  • Sequential optimization (sequential=True): run a separate optimization for each candidate, each one conditional on the previous unevaluated candidate.

For further reading, see the BoTorch page on optimization and Maximizing acquisition functions for Bayesian optimization (Wilson, 2018). In practice, joint optimization is faster but requires more GPU memory, and sequential optimization is slightly more robust (especially with high q).

For a simple example, we'll set q=2 and visualize the acquisition surface for each pair of candidates. The optimal pair of candidates is shown in red.

  • The value is symmetric across $x_1=x_2$ because the candidate order is irrelevant, which explains the presence of two identical maxima.
  • The value decreases towards $x_1=x_2$.

A Slightly More Realistic Example

At PhysicsX, we often work with physical simulations. Let's find a more familiar example function to try and maximize.

We'll use AeroSandbox to generate and evaluate aircraft designs, using the AeroBuildup method.

This simulation runs in a few seconds and, for the purposes of this example, is neither parallelizable nor differentiable.

To make our lives easier later on, we'll define two functions: one to build the simulation inputs (essentially creating a mesh from CAD parameters) and another to call this function, run the simulation, and optionally perform some postprocessing.

def get_design_and_sim_inputs(sample: np.ndarray):
    """Constructs an aircraft model and boundary conditions given a parameter vector."""
    model_fields = AircraftDesignWithTail.model_fields

    # Separate design and operation variables
    design = sample[:len(model_fields)]
    operational = sample[len(model_fields):]

    params = {k: v for k, v in zip(model_fields, design)}
    design = AircraftDesignWithTail.model_construct(**params)

    # Default operational variables
    sim_inputs = {
        "alpha": operational[0],
        "velocity": operational[1],
        "altitude": operational[2],
    }

    return design, sim_inputs

def simulate(sample: np.ndarray, plot=False):
    """Runs an aerodynamic simulation given a parameter vector."""
    design, sim_inputs = get_design_and_sim_inputs(sample)
    aero = design.run_aero_buildup(**sim_inputs)
    if plot:
        fig = design.airplane.draw(
            backend="plotly", 
            show=False, 
            show_kwargs={"colorbar_title": None},
        )
        return aero, fig
    else:
        return aero

For all the recipes, we'll use BoTorch's default single-task multi-output GP. Of course, if you have some knowledge about your problem, incorporating it into the model will likely enhance performance.

def fit_single_task_gp(
    train_x: torch.Tensor,
    train_y: torch.Tensor,
    train_yvar: torch.Tensor | None = None,
) -> SingleTaskGP:
    """Fit a SingleTaskGP

    Args:
        train_x (Tensor): Train inputs, shape (n, d)
        train_y (Tensor): Train targets, shape (n, m)
        train_yvar (Tensor | None, optional): Observation noise,
            shape (n, m). Defaults to None.

    Returns:
        SingleTaskGP: Fitted model
    """

    # Instantiate a GP model
    model = SingleTaskGP(
        train_x,
        train_y,
        train_yvar,
        input_transform=Normalize(d=train_x.shape[-1]),
    )

    # Train the hyperparameters
    mll = ExactMarginalLogLikelihood(model.likelihood, model)
    fit_gpytorch_mll(mll)

    return model

And we'll set some defaults for the optimization parameters:

NUM_RESTARTS = 8
RAW_SAMPLES = 512
N_ITERATIONS = 50

Recipe 1: Vanilla BO

Mmm, ice cream 🍦

Vanilla BO typically refers to single-objective, unconstrained BO without candidate batching. In this case, we'll simply maximize the generated lift force.

We need some initial data to get started. We'll simulate 5 designs with a Latin hypercube DoE. First, let's define the input space:

# Get bounds of aircraft design space
lb, ub = AircraftDesignWithTail.get_bounds()

# Get bounds of operational variables (alpha (deg), velocity (m/s), altitude (m))
lb.extend([-5, 10, 0])
ub.extend([5, 100, 10000])

# Convert to PyTorch
bounds = torch.tensor([lb, ub])

Use SciPy to generate the Latin hypercube samples:

n_init = 5

sampler = scipy.stats.qmc.LatinHypercube(len(AircraftDesignWithTail.model_fields) + 3, seed=2)
x_init = scipy.stats.qmc.scale(sampler.random(n_init), lb, ub)
aero_init, figs_init = zip(*[simulate(sample, plot=True) for sample in x_init])

Our GP output will just be the lift force:

y_init = np.array([aero["L"] for aero in aero_init])

Now we'll run the optimization. In each loop, we generate a candidate by training a model. For the acquisition function, we'll use qLogNoisyExpectedImprovement.

# Initialize BO data
x_bo = torch.from_numpy(x_init).to(device)
y_bo = torch.from_numpy(y_init).to(device)
figs = list(figs_init)
values = []

# BO loop
for i in trange(N_ITERATIONS):

    print(f"Fitting model {i+1}")
    model = fit_single_task_gp(
        x_bo, 
        y_bo, 
    )

    print(f"Generating candidate {i+1}")
    acqf = acquisition.qLogNoisyExpectedImprovement(
        model,
        x_bo,
        prune_baseline=True,
    )

    candidates, value = optimize_acqf(
        acqf,
        bounds,
        q=1,
        num_restarts=NUM_RESTARTS,
        raw_samples=RAW_SAMPLES,
    )

    print(f"Simulating candidate {i+1}")
    aero, fig = simulate(array(candidates[0]), plot=True)
    y = aero["L"]

    figs.append(fig)

    x_bo = torch.cat([x_bo, candidates])
    y_bo = torch.cat([y_bo, torch.from_numpy(y).to(device)[:, None]])
    values.append(value)

Let's visualize the results. On the left, we show the objective (Lift) and some key inputs. The first five points (from -5 to -1) are the initial DoE, and the BO iterations start from $x=0$. On the right, we show a visualization of the best candidate found (highlighted with a dotted vertical line).

We've found that to maximize the lift force, it's crucial to have a large wing area and fly low and fast at high alpha. The optimizer has even adjusted the tailplane to provide maximum lift as well.

Recipe 2: Batched BO

Put the leftovers in the freezer ❄️

Imagine we can run these simulations in parallel on a cluster, and the cluster has the capacity to run four simulations simultaneously. If we only run one simulation at a time, we're not fully utilizing our resources.

BoTorch supports batched acquisition functions, allowing us to quantify the value of a batch of simulations rather than just evaluating a single simulation at a time. The batch size is controlled by the q parameter.

# Initialize BO data
y_init = np.array([aero["L"] for aero in aero_init])
x_bo = torch.from_numpy(x_init).to(device)
y_bo = torch.from_numpy(y_init).to(device)
figs = list(figs_init)
values = []

# BO loop
for i in trange(N_ITERATIONS):

    print(f"Fitting model {i+1}")
    model = fit_single_task_gp(
        x_bo, 
        y_bo, 
    )

    print(f"Generating candidate batch {i+1}")
    acqf = acquisition.qLogNoisyExpectedImprovement(
        model,
        x_bo,
        prune_baseline=True,
    )

    candidates, value = optimize_acqf(
        acqf,
        bounds,
        q=4,
        num_restarts=NUM_RESTARTS,
        raw_samples=RAW_SAMPLES,
    )

    y = []
    for j, candidate in enumerate(candidates):
        print(f"Simulating candidate {j+1} from batch {i+1}")
        aero, fig = simulate(array(candidate), plot=True)
        y.append(aero["L"])
        figs.append(fig)

    y = torch.from_numpy(np.vstack(y)).to(device)

    x_bo = torch.cat([x_bo, candidates])
    y_bo = torch.cat([y_bo, y])
    values.append(value)

We'll visualize the results overlaid on top of the non-batched optimization (represented by grey crosses). We reach values above 100kN in fewer steps, but each step includes 4 simulations instead of just 1. In this case, we haven't observed much benefit from batching because the objective function isn't highly complex. As we can see, each batch typically has one "exploitation" candidate and three "exploration" candidates. We find that this is common for problems that don't have multiple comparable maxima to explore. In general, batched BO tends to find the optimum in fewer steps, but it involves more simulations compared to non-batched BO.

Recipe 3: Multi-Objective (Batched) BO

Sweet or sour? Both please 🥢

What if we want to maximize lift while minimizing drag?

We just have to change the acquisition function. Previously, we were using expected improvement for a scalar objective. In multi-objective optimization, the goal is to find designs that push the Pareto front out as far as possible. To achieve this, we can maximize the expected improvement in the area contained within the Pareto front.

We can apply the same log transform trick to make the optimization more stable, and we'll use the "noisy EI" variant, which turns out to be more efficient. We also refer to the area under the Pareto front as the hypervolume, as it generalizes to any number of objectives. This makes the objective function a bit of a mouthful: qLogNoisyExpectedHypervolumeImprovement, which is sometimes abbreviated to qLogNEHVI.

Our GP now needs to predict both lift and drag. We won't bother about modeling the covariance between lift and drag; instead, we'll fit a multi-output GP to predict them independently.

We'll also switch from lift and drag to their nondimensional coefficients, which means we no longer need to worry about speed, altitude, or aircraft scale (and we can easily redimensionalize the coefficients later).

y_init = np.array([np.hstack([aero["CL"], aero["CD"]]) for aero in aero_init])

We want to maximize the first output and minimize the second one. Since BoTorch is set up to always maximize by default, we need to define an objective function that flips the sign of drag.

objective = acquisition.multi_objective.WeightedMCMultiOutputObjective(
    weights=torch.tensor([1, -1]))

We also need to define a reference point for computing the area under the Pareto front. This reference point should be the minimum value in each dimension that you're interested in. For example, in this case, if we have a point with very low drag but negative lift, we wouldn't consider it an improvement even if it lies outside the existing Pareto front. We'll just set this point to (0, 0.1). If we used (0, 0), we'd end up focusing only on impossible negative drag values.

We'll also pass it through our objective function to ensure that if we set nonzero values, we correctly account for the flipped sign of drag.

Finally, we'll also remove speed and altitude from the design space using the fixed_features argument to optimize_acqf.

ref_point = objective(torch.tensor([0, 0.1]))

# Initialize BO data
x_bo = torch.from_numpy(x_init).to(device)
y_bo = torch.from_numpy(y_init).to(device)
figs = list(figs_init)
values = []

# BO loop
for i in trange(N_ITERATIONS):

    print(f"Fitting model {i+1}")
    model = fit_single_task_gp(
        x_bo, 
        y_bo, 
    )

    print(f"Generating candidate batch {i+1}")
    acqf = acquisition.multi_objective.qLogNoisyExpectedHypervolumeImprovement(
        model,
        ref_point=ref_point,
        objective=objective,
        X_baseline=x_bo,
        prune_baseline=True,
    )

    candidates, value = optimize_acqf(
        acqf,
        bounds,
        q=4,
        num_restarts=NUM_RESTARTS,
        raw_samples=RAW_SAMPLES,
        # Recommended extra argument for batch multi-objective optimization:
        sequential=True,
        # Set velocity = 50 and altitude = 0
        fixed_features={
            bounds.shape[1] - 2: 50, 
            bounds.shape[1] - 1: 0,
        },
    )

    y = []
    for j, candidate in enumerate(candidates):
        print(f"Simulating candidate {j+1} from batch {i+1}")
        aero, fig = simulate(array(candidate), plot=True)
        y.append(np.hstack([aero["CL"], aero["CD"]]))
        figs.append(fig)

    y = torch.from_numpy(np.vstack(y)).to(device)

    figs.append(fig)

    x_bo = torch.cat([x_bo, candidates])
    y_bo = torch.cat([y_bo, y])
    values.append(value)

Below, we visualize the Pareto front and how it evolves over iterations. The reference point is marked as a black cross, and simulations that lie on the Pareto front at each step are shown with a black circle.

We'll visualize the point with the best $L/D$ on the Pareto front. We've found that achieving this typically requires high apsect ratio wings, but we would have been better off directly optimizing for this single objective rather than exploring the entire Pareto front of $C_L$ and $C_D$.

Recipe 4: Constrained (Batched) BO

Catering to allergies and specific dietary requirements 🥜

Now, having an efficient aircraft is great, but getting it to fly well isn't that simple.

Firstly, using a bit of domain knowledge, what we really want to maximize is aerodynamic efficiency, which is lift over drag. To reflect this, we'll change the GP to predict $L/D$ as the first output.

Now let's think about constraints. One important factor is trim — if the pitching moment about the CoG is nonzero, the aircraft will immediately pitch up or down. We want the pitching moment $C_m$ to be zero. We'll predict this as the second output.

Another important factor is longitudinal static stability. In simple terms, if a gust of wind pushes the aircraft's nose up, we want to know if the nose will return to its original position or if the aircraft will continue pitching and flip nose-over-tail. To ensure that the aircraft has a stable response to disturbances in pitch, the pitching moment derivative with respect to the angle of attack, $\frac{dC_m}{d\alpha}$, must be negative.

y_init = np.array([
    np.hstack([aero["CL"]/aero["CD"], aero["Cm"], aero["Cma"]])
for aero in aero_init])

Since our GP predicts three outputs, we need to define the objective for BoTorch. There are several ways to do this, but the most general approach is to define an objective function that operates on samples from the GP. Here, our function will simply take the first output, $L/D$.

objective = acquisition.GenericMCObjective(lambda samples, X: samples[..., 0])

To pass constraints into BoTorch, we need to define a list of constraint functions. These functions should return negative values when the constraint is satisfied and positive values when the constraint is violated.

constraints = [
    # Cm constraint - allow some tolerance so there's a feasible region
    lambda Z: Z[..., 1]**2 - 0.1**2,
    # dCm/da constraint
    lambda Z: Z[..., 2],
]

# Initialize BO data
x_bo = torch.from_numpy(x_init).to(device)
y_bo = torch.from_numpy(y_init).to(device)
figs = list(figs_init)
values = []

# BO loop
for i in trange(N_ITERATIONS):

    print(f"Fitting model {i+1}")
    model = fit_single_task_gp(
        x_bo, 
        y_bo, 
    )

    feasible_points = torch.all(torch.vstack([c(y_bo) < 0 for c in constraints]), dim=0)
    # If none of them are feasible, just take the best seen so far
    if not feasible_points.any():
        feasible_points[:] = True

    print(f"Generating candidate batch {i+1}")
    acqf = acquisition.qLogNoisyExpectedImprovement(
        model,
        x_bo,
        prune_baseline=True,
        objective=objective,
        constraints=constraints,
    )

    candidates, value = optimize_acqf(
        acqf,
        bounds,
        q=4,
        num_restarts=NUM_RESTARTS,
        raw_samples=RAW_SAMPLES,
        # Set velocity = 50 and altitude = 0
        fixed_features={bounds.shape[1] - 2: 50, bounds.shape[1] - 1: 0},
    )

    y = []
    for j, candidate in enumerate(candidates):
        print(f"Simulating candidate {j+1} from batch {i+1}")
        aero, fig = simulate(array(candidate), plot=True)
        y.append(np.hstack([aero["CL"]/aero["CD"], aero["Cm"], aero["Cma"]]))
        figs.append(fig)

    y = torch.from_numpy(np.vstack(y)).to(device)

    figs.append(fig)

    x_bo = torch.cat([x_bo, candidates])
    y_bo = torch.cat([y_bo, y])
    values.append(value)

To visualize the results, we'll plot the feasible objective values (in the first subplot) as closed circles and infeasible values as open circles. We find the feasible region fairly quickly, and then make progress on optimizing $L/D$. The line shows the cumulative maximum feasible point, i.e., the best candidate found so far. The second and third subplots visualize the constraints, with the feasible region highlighted in green.

The best performing feasible aircraft looks quite sensible too: we have a tailplane with a slightly negative angle of incidence.

Recipe 5: Pure Active Learning

I don't care what's for dinner, as long as you know what you're doing 🧑‍🍳

In all of the above examples, we had clear objectives to guide our search. However, what if all you care about is training a model, and you don't know or care what particular part of the design space you want the model to focus on?

This is simply a matter of changing the acquisition function. There are two main options here — one is straightforward, and the other might require a bit more thought:

  • PosteriorStandardDeviation, or PSTD, just finds the candidate with the highest uncertainty.
  • qNegativeIntegratedPosteriorVariance, or qNIPV, finds candidates that minimize the posterior variance integrated over the whole design space. In practice, this means sampling a set of quasi-random points in the design space and then maximizing the negative mean variance across those points.

To make things a little more interesting, we'll use a multi-output GP. The next question is, which of those outputs do we take the uncertainty of?

y_init = np.array([
    np.hstack([aero["CL"]/aero["CD"], aero["Cm"], aero["Cma"]])
for aero in aero_init])

We can use a PosteriorTransform, which functions similarly to an objective, but it operates on the posterior distribution rather than on samples from the distribution. In this case, we use a weighted sum of the outputs to obtain a scalar standard deviation value. Note that while an objective function must scalarize the outputs, a posterior transform does not necessarily need to do the same.

posterior_transform = ScalarizedPosteriorTransform(
    weights=torch.ones(y_init.shape[1], dtype=torch.double))

Posterior Standard Deviation

# Initialize BO data
x_bo = torch.from_numpy(x_init).to(device)
y_bo = torch.from_numpy(y_init).to(device)
figs = list(figs_init)
values = []

# BO loop
for i in trange(N_ITERATIONS):

    print(f"Fitting model {i+1}")
    model = fit_single_task_gp(
        x_bo, 
        y_bo, 
    )

    print(f"Generating candidate batch {i+1}")
    acqf = acquisition.PosteriorStandardDeviation(
        model,
        posterior_transform=posterior_transform,
    )

    candidates, value = optimize_acqf(
        acqf,
        bounds,
        q=1,
        num_restarts=NUM_RESTARTS,
        raw_samples=RAW_SAMPLES,
        # Set velocity = 50 and altitude = 0
        fixed_features={bounds.shape[1] - 2: 50, bounds.shape[1] - 1: 0},
    )

    y = []
    for j, candidate in enumerate(candidates):
        print(f"Simulating candidate {j+1} from batch {i+1}")
        aero, fig = simulate(array(candidate), plot=True)
        y.append(np.hstack([aero["CL"]/aero["CD"], aero["Cm"], aero["Cma"]]))
        figs.append(fig)

    y = torch.from_numpy(np.vstack(y)).to(device)

    x_bo = torch.cat([x_bo, candidates])
    y_bo = torch.cat([y_bo, y])
    values.append(value)

Below, we plot both the acquisition value and the $L/D$ value found. The acquisition value is the maximum uncertainty of the model at each iteration, which gradually reduces. Since we're not optimizing towards a specific objective, we focus equally on both high and low $L/D$ designs.

Negative Integrated Posterior Variance

Compared to PSTD, the only additional step is to sample points from the input space where we'll focus on reducing uncertainty. To achieve a more accurate integrated uncertainty, we should use a space-filling low discrepancy sequence like Sobol.

If we had a specific region in the design space where we wanted lower uncertainty, we could bias the integration by increasing the sampling density in those areas.

Note that this process is slow due to the extra evaluations required, so we'll run it for a relatively low number of iterations.

# Initialize BO data
x_bo = torch.from_numpy(x_init).to(device)
y_bo = torch.from_numpy(y_init).to(device)
figs = list(figs_init)
values = []

sampler = scipy.stats.qmc.Sobol(len(AircraftDesignWithTail.model_fields) + 3, seed=2)
mc_points = torch.from_numpy(scipy.stats.qmc.scale(sampler.random(128), lb, ub)).to(device)
# Ensure all Monte Carlo points have v=50 and altitude=0
mc_points[:, -2] = 50
mc_points[:, -1] = 0

# BO loop
for i in trange(min(N_ITERATIONS, 20)):

    print(f"Fitting model {i+1}")
    model = fit_single_task_gp(
        x_bo, 
        y_bo, 
    )

    print(f"Generating candidate batch {i+1}")
    acqf = acquisition.qNegIntegratedPosteriorVariance(
        model,
        mc_points=mc_points,
        posterior_transform=posterior_transform,
    )

    candidates, value = optimize_acqf(
        acqf,
        bounds,
        q=1,
        num_restarts=NUM_RESTARTS,
        raw_samples=RAW_SAMPLES,
        # Set velocity = 50 and altitude = 0
        fixed_features={bounds.shape[1] - 2: 50, bounds.shape[1] - 1: 0},
    )

    y = []
    for j, candidate in enumerate(candidates):
        print(f"Simulating candidate {j+1} from batch {i+1}")
        aero, fig = simulate(array(candidate), plot=True)
        y.append(np.hstack([aero["CL"]/aero["CD"], aero["Cm"], aero["Cma"]]))
        figs.append(fig)

    y = torch.from_numpy(np.vstack(y)).to(device)

    x_bo = torch.cat([x_bo, candidates])
    y_bo = torch.cat([y_bo, y])
    values.append(value)

We again plot both the acquisition value and the $L/D$. Now, the acquisition value is increasing, because it's the negative uncertainty, although there are jumps where the model's uncertainty actually increases (as the overfitting reduces).

Bonus Recipe: BO vs GA

Ready, steady, cook! 🍅

All of the recipes above can be described as both "Bayesian optimization" and "active learning" — our training set is built iteratively using model predictions with UQ. However, there are other approaches we can take to optimization.

Another common family of black-box optimization technique is genetic algorithms (GA). Below, we will compare BO against GA to obtain some rough rules of thumb to determine when each is likely to be better.

More concretely, using the problem setting from Recipe 4: Constrained (Batched) BO (maximize $L/D$ with constraints on $C_m$ and $\frac{dC_m}{d\alpha}$), we'll compare the following methods:

  • Bayesian optimization (qLogNoisyExpectedImprovement)
  • Genetic algorithm (NSGA-II, with some generic defaut parameters)

To visualize the results, we will show the evolution of $L/D$ against candidate index (showing the number of simulations that were run) and time into optimization (which also takes into account the amount of time taken to choose the next candidate). We will also illustrate what would happen for different orders of magnitude of simulation time by inserting artificial delays into the results.

Summary

  • BO is by far the most sample-efficient optimization method, regardless of simulation time.
  • For objective functions that take no more than a few seconds to evaluate, GA optimizes the quickest.
  • If the objective function takes around 10 seconds, BO is initially faster than GA, but eventually, GA catches up.
  • If the objective function takes more around one minute, BO remains quicker.

Got questions about the recipes? Join the conversation on the BoTorch discussions page.

About PhysicsX

PhysicsX is a deeptech company with roots in numerical physics and Formula One, dedicated to accelerating hardware innovation at the speed of software. We are building an AI-driven simulation software stack for engineering and manufacturing across advanced industries. By enabling high-fidelity, multi-physics simulation through AI inference across the entire engineering lifecycle, PhysicsX unlocks new levels of optimization and automation in design, manufacturing, and operations — empowering engineers to push the boundaries of possibility.

If you’re ready to move beyond trial and error and embed effective optimization into your engineering process, contact us at info@physicsx.ai. Let’s redefine how engineering problems are solved.