Choosing a black-box optimisation algorithm

The best optimisation algorithm is problem-dependent. In a formal mathematical sense, this is an implication of the “no free lunch” theorem. In the originators’ own words:

It tells us that if any search algorithm performs particularly well on one set of objective functions, it must perform correspondingly poorly on all other objective functions. Wolpert (2020)

It has been my experience that for the specific problem of designing turbine geometry, the set of algorithms that perform badly is much larger than the set of algorithms that perform well!

This post is to collect my anecdotal experience of algorithms for turbomachinery optimisation, for future reference. I also hope that it might be useful for others who are struggling to bridge the gap between general optimisation methods and their applications.

Published papers will tell you which algorithm they settled on, but unless they release code you won’t have all the tiny details required to reproduce their results. Furthermore, there are no publications on algorithms that do not work for particular problems, or the ways of crippling a perfectly good algorithm by injudicious choice of parameters.

The folks at NASA have done a lot of the hard work with the OpenMDAO framework, which allows one to construct computational models and drive them with any of the Scipy optimisers. The documentation is very good, but it is still a big leap to go from finding the minima of a two-dimensional paraboloid to making a nice turbine design.

Problem definition

To make the problem a bit more definite, we have a turbine stage geometry parameterised by an \(\renewcommand{\vec}[1]{\mathbf{#1}}N\)-dimensional vector \(\vec{x}\). We can run a computational fluid dynamics simulation \(f\) to predict the flow field for a given design \(f(\vec{x}\)). Then we post-process to evaluate efficiency \(\eta(f(\vec{x}))\), which we want to maximise, and also other variables such as loading coefficient, \(\psi(f(\vec{x}))\), which we would like to constrain during the optimisation. Unfortunately, evaluating \(f\) takes about seven minutes, so we want to find the optimum using the fewest computations possible.

This type of problem is a “black box” optimisation: we can observe the values of \(f\) at any \(\vec{x}\), but (at least initially) we know nothing about the underlying structure of \(f\). We cannot directly evaluate the derivative with respect to the design vector \(\nabla{f}\). If we want to know \(\nabla{f}\), we must approximate it using samples of \(f\).

A finite-difference approximation of \(\nabla{f}\) requires \(N+1\) evaluations of \(f\), which becomes prohibitive for large \(N\). For our turbine, we have \(N\sim 20\), so we discount algorithms that require derivatives at candidate points for now. Unfortunately, that is most of the SciPy suite.

This post will only cover finding local minima. A global search is too expensive. Moreover, I am now convinced the turbine geometry design space is uni-modal, anyway.

Avoid Nelder–Mead

The Nelder–Mead (1965) method is an adequate gradient-free starting point. Its chief advantages are robustness and simplicity. Roughly, we start with an initial simplex \(\vec{x}_m\) of \(m = 1, \dots, N + 1\) points, and at each iteration move the worst point along a line search through the centroid of the other \(N\) points. We thus have a plausible search direction without knowing the gradient \(\nabla f(\vec{x}_m)\). The simplex can also grow when the search is progressing rapidly, and shrink if no progress is being made.

Nelder–Mead works on function comparisons. We throw away the information that the function values \(f(\vec{x}_m)\) contain, reducing it to a binary better/worse test. We cannot afford to do this if computational expense of of any concern!

Another problem with Nelder–Mead is that it does not account for constraints on the evaluated function, such as our loading coefficient \(\psi(f(\vec{x}))\). A widespread, ugly workaround is to add penalties to the objective function if constraints are violated. This is not as straightforward as it sounds: we have to code the implementation ourselves inside the objective function, and also choose the weighting factors between constraint violation and efficiency. I cannot recommend using the penalties method. There is not even a good way to bound \(\vec{x}\). Clipping the vertices of the simplex can result in a collapsed volume, and this is anyway not implemented in the OpenMDAO integration.

COBYLA is better

Powell (1994) described the Constrained Optimisation By Linear Approximation algorithm. Like Nelder–Mead, we start with an initial simplex. Unlike Nelder–Mead, the search direction is given by modelling the objective as a linear function over the simplex, that is using the values of \(f\) at each point to estimate \(\nabla f\). The algorithm also considers the amount and direction of constraint violations when selecting the next point using a composite merit function; and will take steps to preserve a good simplex shape.

For our turbine application, I have found COBYLA to work very well. It addresses the main deficiencies of the Nelder–Mead method yet still does not require function gradient inputs.

Notes on scaling and step sizes

My experience has been that making good progress in an optimisation is heavily dependent on scaling the variables and step sizes. One cannot just leave everything at the defaults and expect things to work. In a multi-dimensional problem, just one errant variable can bring the entire search to a halt.

As a rule of thumb, it seems that numerical implementations of the search algorithms discussed above work best when the design variables and objective are scaled to be of order one. I think this must be due to hard-coded constants and tolerances in the optimisation routines that are not exposed in the SciPy API. Or, put another way, it is easier for users to scale their problem to order one than to re-tune all constants in the algorithm.

Then there is the question of step sizes. In a global search, we want to explore the design space thoroughly and take large steps initially, of order the entire feasible range of designs. Nelder–Mead worked with large steps like these, automatically shrinking the initial simplex to narrow down the design space.

To approximate gradients with a finite difference, we need smaller steps. The step should be as small as possible while avoiding floating-point round off error to get the most accurate gradient.

COBYLA works with a step somewhere between these two extremes. According to this library documentation, a step of one tenth the maximum change is suitable, although I have found that slightly larger steps make for quicker convergence.

State of the art

I will finish this post by noting more advanced methods to tackle my optimisation problem.

If we could evaluate \(\nabla f\) directly, then many more choices of algorithms would open up. More information about the function should lead to faster convergence. Sequential Least-SQuares Quadratic Programming (SLSQP) is a bit like COBYLA but uses a quadratic approximation to the objective function using values and derivatives at each point. By analogy, COBYLA might be termed “sequential linear programming”.

Directly evaluating the derivative of a numerical flow field computation actually requires the solution of a separate differential equation, an “adjoint equation” for the quantity of interest. Adjoint codes for turbomachinery exist, but they are a pain to implement and require very tight convergence to work well.

COBYLA approximates the objective function as linear to select the next point to evaluate. Neural network, Gaussian process, or other surrogate models are just more complex devices to help select the next point. Retraining and evaluating the surrogate is cheap compared to \(f\), although the choice is less clear when implementation and debugging time is taken into account. I can see the point if we are looking for a global optimum in a non-smooth design space — for example running a genetic algorithm using the cheap surrogate at each iteration. However, if there is only one hill in our design space we don’t need anything more than the gradient to find it.

Conclusions

Here are the opinions I would have found useful before I embarked on this little project. In black-box optimisation problems like computational turbine design:

Nelder–Mead is slow and wasteful and should be avoided if possible;
Out of the SciPy optimisers, COBLYA is the best option if the problem has bound or non-linear constraints, and derivatives are not available;
Careful variable scaling and step size selection is not optional;
We can do better if we have access to gradients.

2022-09-03