2D Gaussian process regression in scikit-learn
Programming something new is always easier if you have a working example of
something similar. Recently, I went searching for an example of
multi-dimensional Gaussian process regression in scikit-learn
, but all I
could find in their docs and elsewhere online were one-dimensional problems.
This post plugs that gap.
After a brief primer on the theory involved, I will walk through a Python script that fits a Gaussian process to a two-dimensional function. Full code is available from from my sourcehut.
Theory
Gaussian process regression works by applying Bayes Theorem to a distribution of functions. First we establish some priors on the functions that might fit our data: a mean, a variance, and a degree of smoothness over a given length scale. Second, given some observed data points with a certain noise level, we apply Bayes Theorem to obtain a new posterior distribution of functions. The posterior distribution of functions will have a mean value that passes through the data, and a reduced variance near to data points. Finally, we use the posterior mean to predict unseen samples, and interpret the posterior variance as a measure of uncertainty in those predictions.
The posterior distribution can be evaluated analytically, however, for \(N\) data points it requires the expensive inversion of an \(N\)-by-\(N\) matrix. In practice, the length scale and noise level are unknown, and selecting their values entails an optimisation process with many posterior evaluations. Therefore, Gaussian process regression does not scale well to large data sets.
Gaussian processes are non-parametric models, so our priors are the only assumptions we have to make about the form of the function. For example, linear regression is less flexible — we must stipulate a linear trend which will not fit a quadratic function well.
A two-dimensional example
Suppose we have a few samples of a function of two variables \(z(x,y)\). We wish to interpolate \(z\) using Gaussian process regression over the available data. For this example, we consider the quadratic surface, $$ z(x,y) = (x-3)^2 + 2xy + (2y+3)^2 - 3\ . $$
Preliminaries
Before we begin, we will need some imports and a deterministic random seed:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel
from sklearn.model_selection import train_test_split
np.random.seed(17) # So you can reproduce these results at home
Let’s start by evaluating \(z\) over a regular grid to represent our ground truth:
nv = 20 # Number of points in each direction
xv = np.linspace(-5.,5.,nv) # x-vector
yv = np.linspace(-5.,5.,nv) # y-vector
x, y = np.meshgrid(xv, yv) # x and y are (nv, nv) matrices
z = (x-3.)**2. + 2.*x*y + (2.*y+3.)**2. - 3.
Now, real data may be corrupted by a degree of numerical or measurement error, which we model here as Gaussian noise added to the ground truth:
noise_level = 5.
z_noisy = z + np.random.normal(size=x.shape) * noise_level
scikit-learn
expects each input and output as column vectors, so we must
reshape the grid:
X = np.column_stack((x.reshape(-1), y.reshape(-1)))
Z = z_noisy.reshape(-1,1)
Performing the fit
Before going any further, we need to specify our prior functional distribution.
The smoothness criterion is expressed in terms of a kernel
, an object that
encodes the desired correlation (or covariance) between pairs of points. We
shall use a radial basis function RBF
to ensure fits are infinitely differentiable, with unknown length scales for each of \(x\) and \(y\). Added to the RBF
,
we use a WhiteKernel
with unknown noise level to represent uncertainty in the
data and explain uncorrelated short length scale variations. For the ensuing
fitting process, we must specify bounds and initial guesses for the length
scales and noise levels. Passing a tuple of length scale guesses, rather than a
scalar, in to the RBF
constructor is all that is needed to extend the
one-dimensional examples in the SciPy docs to two dimensions. In code, this all
looks like:
guess_l = (2., 1.) # In general, x and y have different scales
bounds_l = ((1e-1,100.),) * 2 # Same bounds for x and y
guess_n = 1. # Amount of noise
bounds_n = (1e-20, 10.) # Bounds for noise
kernel = ( # Kernel objects can simply be summed using +
RBF(length_scale=guess_l, length_scale_bounds=bounds_l)
+ WhiteKernel(noise_level=guess_n, noise_level_bounds=bounds_n)
)
Before fitting, we extract 5% of our grid of points to use as training samples, the remainder becoming test data:
X_train, X_test, Y_train, Y_test = train_test_split(X, Z, test_size=0.95)
At last, we can do the fitting. Passing normalize_y=True
scales the function
values to zero mean and unit variance before fitting. This is equivalent to
choosing priors: the mean of \(z\) is equal to the training sample mean; and the
variance of \(z\) is equal to the training sample variance.
gpr = GaussianProcessRegressor(kernel, normalize_y=True )
gpr.fit(X_train, Y_train)
We can inspect the fitted kernel parameters:
print(gpr.kernel_)
# RBF(length_scale=[7.86, 3.71]) + WhiteKernel(noise_level=0.00678)
The \(x\) length scale is about twice the \(y\) length scale, as might be expected from the form of \(z\). The noise level is smaller than we prescribed; more training data would give a better estimate of the noise level.
Plotting the results
With our fitted regressor object, we can evaluate the posterior mean and standard deviation over the input data, and convert back to a grid for plotting:
Zfit, Zstd = gpr.predict(X, return_std=True)
zstd = Zstd.reshape(x.shape)
zfit = Zfit.reshape(x.shape)
First, compare the true function with the fitted posterior mean:
# Set up figure
fig, ax = plt.subplots(figsize=(6., 4.))
ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
ax.set_xlim((-5,5))
ax.set_ylim((-5,5))
ax.set_aspect('equal')
ax.set_xticks((-5,0,5))
ax.set_yticks((-5,0,5))
ax.grid(False)
# Do the plotting
lev = np.linspace(0.,250.,6)
ax.contour(x,y,z,lev,colors='k') # Truth
ax.plot(*X_train.T,'o',color="C1") # Training samples
ax.contour(x, y,zfit, lev, colors='C0', linestyles='dashed') # Posterior mean
# Legend
truth_line = mlines.Line2D([], [], color='black', label='True $z(x,y)$')
sample_line = mlines.Line2D([], [], color='C1', marker="o", linestyle="none", label='Train samples')
mean_line = mlines.Line2D([], [], color='C0', linestyle="--", label='Posterior mean')
ax.legend(handles=[truth_line, sample_line, mean_line],bbox_to_anchor=(1.05, 1), loc="upper left")
# Write out
plt.tight_layout()
plt.savefig('gpr_posterior_mean.svg')
The figure shows good agreement between the contours.
Second, plot the variation in posterior standard deviation.
# Set up figure
fig, ax = plt.subplots(figsize=(6., 4.))
ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
ax.set_xlim((-5,5))
ax.set_ylim((-5,5))
ax.set_aspect('equal')
ax.set_xticks((-5,0,5))
ax.set_yticks((-5,0,5))
ax.grid(False)
# Do the plotting
ax.plot(*X_train.T,'o',color="C1") # Training samples
lev = np.linspace(6.,16.,11)
hc = ax.contourf(x, y, zstd, lev) # Posterior std
for hci in hc.collections:
hci.set_edgecolor("face")
# Colorbar
hcb = plt.colorbar(hc)
hcb.ax.grid(False)
hcb.set_label('Posterior standard deviation')
# Write out
plt.tight_layout()
plt.savefig('gpr_posterior_std.svg')
From the figure, uncertainty increases away from the training samples.
Bonus one-liner
Here is a shell command that extracts all code blocks from a markdown file and writes them out to a new Python script:
sed -n '/```/,/```/p' blog-post.md | grep -v '```' > script.py
I used this command to generate the script associated with this post on sourcehut.
References
- Gaussian Processes for Machine Learning, Rasmussen and Williams (2006). Comprehensive exposition of the theory behind GPs. The mathematical detail is not that important for us, but the Preface and Introduction give a good background.
- UQLab Kriging Manual. Chapter 1, Theory, is a more digestible summary of the mathematics and gives a clear description of the choices we make to use GPs in practice.
- scikit-learn GP docs