Our research team, led by Michalis Michaelides, have penned a great perspective on adding quantified uncertainties to Artificial Neural Network (ANN) models.
In the world of decision-making based on model estimates, understanding and quantifying uncertainty is key. It not only helps us optimize more effectively but also lays the groundwork for principled decision-making.
In our latest piece, we break down some easily accessible routes to incorporate uncertainty quantification into ANNs. Dive into the details and explore the broader landscape if you're keen to delve deeper into this intriguing topic.
Abstract
This is a perspective on endowing models, especially ANNs, with quantified uncertainties. It is necessary to augment model estimates with credible intervals, confidence bounds, or other forms of uncertainty, in order to inform downstream tasks like optimisation and decision-making. Methods to produce such output abound, some established and straight-forward, and some more exotic. We here outline some immediately obvious paths to uncertainty quantification in neural networks, paths well-beaten.
Problem Setting
Artificial neural networks (ANNs) typically approximate some unknown function by minimising a loss over a sampled dataset. For a latent function of interest, $f$, we assume that the relationship $y_i = f(x_i) + \epsilon_i$ holds in observed data, where usually $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$.
Given a dataset $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^N,\ x \in \mathcal{X} \subseteq \mathbb{R}^D,\ y\in\mathcal{Y} \subseteq \mathbb{R}^M$, we wish to recover the latent function $f:\mathcal{X}\to\mathcal{Y}$, which is the classical problem of regressing $y$ on $x$. In ANNs, we employ the ansatz that $f(x) \approx \tilde{f}(x; \theta)$, where $\theta$ are parameters of the network that determine the transformations done on input $x$. The architecture of the network is usually chosen to induce reasonable priors over functions, e.g., convolutions to induce translation-invariant features for images, but the discussion on uncertainty quantification (UQ) is independent of architecture.
Following the above, the possible function approximations are determined by the value of $\theta \in \Theta \subseteq \mathbb{R}^P$, so our task is to identify the best $\theta$, $\theta^\star$. To formalise best, we seek a globally optimum $\theta^\star$ that minimises the loss in the dataset, typically chosen to be the MSE, or a variant that penalises deviations from the observed values:
$$\begin{align}\label{eq:ANN-loss}\mathcal{L}(\theta; \mathcal{D}) = \frac{1}{N} \sum_{i=1}^N \lVert y_i - \tilde{f}(x_i; \theta) \rVert^2,\end{align}$$
with $\theta^\star = \arg\min_{\theta} \mathcal{L}(\theta; \mathcal{D})$. Note how the assumed aleatoric noise variance, $\sigma^2$, does not appear in the equation. This is indicative of the lack of uncertainty concerns in the broader ANN literature. Worse yet, a point-estimate $\theta_\star$ is woefully lacking any acknowledgement of alternative latent functions that are consistent with observations.
Most of ideas below reflect classic texts (MacKay 2003; Bishop 2006; Shalizi 2013; Murphy 2022), with added practical references where applicable. Our goal here is to motivate UQ in ANNs, and expose readers to some immediate methods towards that end; one can turn to comprehensive surveys of the field for deeper discussion, e.g., by Abdar et al. (2021).
The importance of being earnest
Optimising a typical ANN captures first-order moments in the output — that is, it penalises deviations from observations equally, and makes no attempt to capture higher-order moments of the predicted variable (like variance, skewness, etc.). While asymmetric losses can reflect other assumptions about the noise distribution, a typical ANN will still target an aleatoric statistic, like the mean, median, or a quantile of the output distribution.
Taking this result as a point-estimate is problematic, especially in situations where downstream decisions have asymmetric utility. Further, such a point-estimate obscures any sources of epistemic uncertainty about the relationship it parametrises. Unfortunately, these uncertainties can be crucial in real-world applications of ANNs, since:
- the global optimum of an ANN parametrisation is hard to find;
- the model may be misspecified (i.e., there is no single parametrisation
to approximate the true latent function well); - the model output will be trustworthy over a limited domain, depending
on the dataset and how well its structure allows it to generalise; - the aleatoric uncertainty, size of the dataset, and prior beliefs
represented in the ANN architectural choices, should all be balanced
and brought to bear upon the confidence of our predictions (epistemic
+ aleatoric uncertainty)
As such, when estimates from the model are used in downstream tasks, the associated uncertainty should guide subsequent decisions based on what values the output is likely to take, or at what input one might expect a particular output value to occur.
In summary, a point-estimate for the output given a new input, does not fully express what we can infer from the dataset and our assumptions. A fuller expression would be a probability distribution over possible output values, one which reflects our uncertainty about the underlying relationship in observations and other components involved in the data generating process.
Uncertainty quantification
We review some options below on endowing our model output with associated uncertainties, either in the form of standard deviations around point-estimates, full probability distributions, or samples from the distributions over output values.
Inference: priors, posteriors, and all that jazz
Ideally, we should like to construct priors over the space of functions that reflect our beliefs about what the function might be. A prior would then be updated with observations to produce a posterior that reflects our new beliefs about that function, given observations. Formally, a prior on functions is combined with a likelihood to produce a posterior. This is the celebrated Bayes’ theorem, and is meant to balance the various sources of uncertainty (aleatoric, epistemic) in order to produce consistent inferences.
While theoretically appealing, and suitable for our purposes in that it quantifies uncertainty, there are practical considerations that render such an approach difficult to execute. Firstly, reasonable and computationally manageable priors and likelihoods are hard to define in this framework. Even when armoured with such objects, performing the inference can become intractable or computationally infeasible, especially for large datasets; this can necessitate approximations that bring further complication in sources of error and uncertainty. Finally, conveying the uncertainty associated with the output of such a model can be taxing; this latter pain point is shared by all UQ methods.
Despite such difficulties, we persevere. Below we lay out a few approaches for UQ in the ANN domain that saw general adoption. We focus on the immediately implementable and principled ideas, than the more exotic beasts one might find — which are plentiful.
Implementation: launching grounds for the practitioner
Many of the following approaches are incarnated in a plethora of open-source tooling. The interested practitioner can turn to libraries like (TyXe; BayesDLL; Fortuna), which wrap standard PyTorch models and endow them with uncertainty output. Tools like uncertainty-toolbox then implement calibration and visualisation methods.
As with all tools, there is quite a bit of variance in how these will perform — some stemming from methodological differences and some from practical implementation choices. We will henceforth focus on mapping the conceptual landscape and spotlighting a few key examples, than on practical particularities.
Bayesian ANNs
This is the most straight-forward adaptation of priors to ANNs. The idea is simply to declare a prior on the ANN parameters, and an appropriate likelihood on the observations (typically chosen to be an iid Gaussian for real-valued output) to best reflect missing relevant information. Then one can perform Bayesian inference using various approximations, e.g., variational inference, MC-dropout, stochastic-gradient MCMC, or Laplace’s approximation. See (Murphy 2022, sec. 17 Bayesian Neural Networks; Kim and Hospedales 2023).
A problem is that commonly used priors on the network parameters (e.g., normal priors around zero), do not necessarily correspond to reasonable priors over the space of functions as represented by the ANN (Murphy 2022, sec. 17.2 Priors for BNNs). While normal zero-mean priors on the coefficients of a linear or polynomial model will (with some care) correspond to desirable priors that prefer functions that vary more slowly (smoother in the Lipschitz sense), the same is not necessarily the case for the non-linear representation of ANNs. Therefore, naïve priors imposed on ANNs might not reflect our actual priors and usual Occam’s razor considerations.
Regardless, given the computational resources, such an approach might require the least implementation effort; a few libraries exist that can wrap a deterministic ANN model in a framework like PyTorch, and produce a Bayesian version of it (Kim and Hospedales 2023). The dropout method that follows is an instance of this class of models, while also an ensemble method.
Ensemble methods: bootstrapping and variants
An intuitive way to quantify epistemic uncertainty is to employ an ensemble of models, instead of relying on a single one. This is commonly done by bagging: a procedure where multiple instances of a base model are each trained on datasets constructed by randomly sampling the full dataset (aka, bootstrap); a new input is then given to every trained instance, and all the outputs are aggregated (typically by averaging in regression) to produce an overall estimate. However, each output can also be considered a sample from the output distribution which expresses model uncertainty.
This approach can be trivially implemented without changing much in the model, rendering it mostly an engineering exercise. Bootstrapped samples from the dataset are constructed, multiple model instances are trained, and in prediction the input is distributed and the output gathered. This brutal modularity, along with the statistical advantages of lower variance and steady bias in the bagged estimate, are the prime attractive qualities of this method.
Despite the embarrassingly parallel property of this estimator (model instances are independent and can be trained in parallel), training can become quite expensive and in the case of large ANNs, prohibitive. Consider how an ANN might take up the full computational resources available in a particular set-up for significant time, and how long it would take to train a hundred instances. Further, small datasets and large models will introduce deleterious correlations between the model instances, increasing the bias term — and potentially enough to offset any gains from lower variance.
Many methods have been suggested to ameliorate the computational costs involved in true ensembles. These are typically referred to as pseudo-ensembles, and two of the most popular ones are outlined below. See (Murphy 2022, sec. 17.3.9 Deep ensembles) for a deeper discussion.
Snapshot ensembles
To ameliorate the computational costs, practitioners typically resort to an adjacent strategy inspired by the idea of multiple model instances of bagging but eschewing the independent training, by using snapshots at various iterations of the training procedure as different model instances. After convergence to a reasonable loss value, model parameters are stored at regular intervals, sufficiently far apart to reduce auto-correlation as much as possible, and stared parameters from each checkpoint are then used to generate outputs at prediction.
Naturally, while this reduces computational cost, it increases the risk of high correlation between model instances and requires additional domain expertise to choose when to start taking snapshots, how far apart, etc.. Regardless, this has been shown to work well, and is common practice.
There is another interpretation of this procedure, which is to assume that the training process is akin to a MCMC on the posterior distribution of the parameters, and we are drawing samples from the posterior of the parameters once the chain is sufficiently mixed. However, there is no explicit prior probability factor applied here, except for the implied prior in the ANN initialisation and any choices in the learning algorithm. Regularisation of parameters in the form of a penalty term in the loss would relate to a prior on the parameters, bringing us closer to a Bayesian idea of sampling a posterior — however, such regularisation is not always present.
All considered, checkpoint pseudo-ensembles is a practical means to produce output samples that give some idea of the model uncertainty, and can be further calibrated to become more reliable. However, it is still a far cry from a principled Bayesian treatment of UQ that would involve sampling a posterior distribution, and is limited by auto-correlations induced by the size of the dataset, the learning rate schedule, and the length of the training process.
Monte Carlo dropout
Another way to sample a range of possible output values from the posterior predictive distribution is to use dropout. This is a method that aims to protect the ANN against over-fitting and promote generalisation, by randomly shutting down connexions in the network during training. Trivially achieved by randomly masking ‘neurons’ of the network during training (i.e., setting them to zero), the method has gained popularity in practice, and has garnered attention from the probabilistic inference community which recovered an intimate relationship between MC dropout and Bayesian inference with reasonable priors, under assumptions (Gal and Ghahramani, 2016; Murphy, 2022, §17.3.1 Monte Carlo dropout). The assumptions do not hold in the general settings where dropout is applied in practice, but do provide some theoretical justification for its use, even in adjacent settings.
Due to its intuitive interpretation of building resilience by guarding against failures in neuronal activation, and due to its computational efficiency, dropout is a common practical means with which to endow an ANN with sampling output capability, and one to add in the arsenal of any brave souls that wish to quantify uncertainty in the modern ML landscape.
Probabilistic output
We now enter the realm of output distributions, where the network is tasked with producing not only samples, but parametrised distributions over output values — that is, we seek to capture a parametrised distribution that approximates the posterior predictive distribution. Not quite generative yet, since an input must be provided to produce an output, the collection of methods under this banner form a bridge between the point-estimate frequentist setting and the fully generative setting.
Normalising flows, VAEs, and other such methods can be fully generative, seeking to capture the joint by recovering some latent structure in the dataset, e.g. by constructing a latent space on which a probabilistic relation is imposed that approximates the joint (Murphy 2022, sec. IV Generation). We here focus on the nearer task, which is to capture the conditional, and save the summit of fully generative models for future expeditions.
Uncertainty calibration
Methods like the ones above and many more can produce uncertainty estimates that reflect model confidences, but which are not necessarily tied to deviations from the ground truth. Comparing uncertainty estimates may reflect confidence orderings for those estimates, but may need to be scaled or otherwise calibrated to accord with ground truth deviations, in order to provide an accurate quantification of the error expected in a particular prediction. This is more of a problem in frequentist approaches, like ensemble methods, than in Bayesian approaches, although the latter can also suffer from miscalibration, especially if priors are badly chosen (Murphy 2022, sec. 14.2.2 Calibration; Kuleshov, Fenner, and Ermon 2018; Kim and Hospedales 2023; Chung et al. 2021).
Calibration methods typically employ some kind of ordered regression to map model uncertainties to expected errors, while respecting ordering. The simplest such method is a constant scaling, but Isotonic Regression methods take this notion further, by transforming estimates more flexibly. To train such calibration models, one must keep a fraction of the training dataset to be used specifically for this — i.e., predict on this calibration dataset, assess deviations, and fit a calibration model. This reduces the amount of data available during training, potentially sacrificing some point-estimate performance, but improves uncertainty estimates. These objectives need to be balanced and the appropriate amount of data resources devoted to each task. At first glance, it might appear foolish to sacrifice accuracy in the first moment to gain accuracy in the second moment, but upon reflection it becomes apparent that such a trade-off is reasonable for downstream decision making, active learning, or performing optimisation.
To make our goal more precise, we wish coverages of observed values to accord with the predictive distribution that the model produces. Armed with a proxy for such calibrated distributions that is non-calibrated uncertainties we can apply a monotonic correction to promote calibration, assuming that at least the rank of aforementioned uncertainties is by and large, trustworthy. As mentioned, this is typically a monotonic transformation on the back of a calibration set, chosen to be a small subset of the training set and set aside for this purpose (i.e., not used during training).
Metrics and visuals: conveying uncertainty and assessing UQ performance
How can uncertainty estimates be effectively communicated? This is a hard question that has plagued statisticians for a long time. Naturally, people are more used to dealing with point-estimates and often instinctively ignore additional complexity. This bias is reflected in the popular metrics that dominate the literature, like MSE, R², MAPE, and many others that ignore second-order moment estimates. Notable departures are the NLL, a metric popular in probabilistic modelling, and utility losses, if task-relevant utility functions can be defined (Murphy 2022, sec. 14.2 Evaluating predictive models).
Overall, there is a large body of literature on how to deal with this problem, to accompany the large gap in effective solutions. Part of the problem is the statistical and psychological discomfort of the audience — some audiences will shirk uncertainty quantification because they:
- lack interpretation capability;
- lack bravery — do not wish to confront the uncertainty they know is there: “out of sight, out of mind” is apropos;
- lack defined utility functions to make good use of this uncertainty;
- lack bravery — do not want the responsibility of defining utility functions or dealing with the uncertainty;
and potentially a few other reasons. However, people can be made to confront and engage with such probabilistic output in a number of ways. We outline a few we found effective and prominent, below.
Conveying uncertainty
There is a long, ongoing discussion, about how to best present statistical content to different audiences (Gelman and Unwin 2013). Given the attitudes to uncertainty examined above, the goal of conveying uncertainty fragments in two distinct ones:
- Accurately conveying relevant quantities in an accessible manner, and providing enough information for interested and informed audiences to interpret and make decisions.
- Capturing the attention of uninterested and uninformed audiences to make them care, before delivering a grand message.
These two goals can be at odds, especially in their incarnations (established vs novel, readability vs aesthetics, balanced vs extremist information load). When at odds, we are called to reconcile them (Gelman and Unwin 2013). The two basic suggestions below are discussed for illustrative purposes; there is a wider variety of ways to achieve the two goals above.
Error bars
In the case of unidimensional real output, the uncertainty can be well represented by associated error bars around a point-estimate. The point-estimate can be the mean or the median, and error bars should ideally reflect low and high quantiles of the predictive distribution, rather than symmetric deviations associated with normality assumptions. Further, the quantiles should be chosen according to the use-case (i.e., in particle physics we require ±5σ, in engineering possibly ±3σ, or even ±2σ, is enough).


We examine a model that predicts lift and drag quantities associated with various geometries. Model predictions with uncertainties are plotted against simulated quantities for a test set of geometries in the figures above. One can see that statistical coverage for drag predictions is more accurate than that for lift, even though lift RMSE is lower. This demonstrates how different aspects of a model should be considered when assessing quality: a model might be more accurate in the mean predictions, but in order to trust it for decision making one might wish to assess the quality of associated uncertainty intervals.
Sampled output
Another method to represent probable outcomes is to sample them and present them all. This is truly an unbiased way to represent the distribution, and allows one to use the samples for further analysis in an intuitive manner (i.e., if there is some decision to be made, the outcome can be judged against each possible sample in the same way without worrying about how the uncertain nature affects the transformation to be imposed). Sampling is also the basis of MC methods that construct decision outcomes that depend on uncertain underlying quantities.
Such output is especially useful when output is multi-dimensional and each dimension is correlated. A prime example of this is time-series forecasts, where global properties of the time-series might not appear in time-wise marginal distributions. Consider the classic Mauna Loa CO2 concentration over time. When fitting a temporal model to that data and making a forecast, certain aspects of any given possible trajectory will vanish in time marginal quantities. In the figures below, the amplitude in annual variation remains stable over time for any one particular trajectory, but if we were to aggregate the trajectories and extract time marginals this property would be lost. Compare sampled trajectories (top) to time-marginal trajectories (bottom): the time-marginal plot also accommodates trajectories where the annual oscillation amplitude grows over time, while in the full distribution, such a trajectory is almost forbidden.


A problem with sampling is that more accurate representation requires bigger samples that consume information bandwidth (cognitive and computational). Cognitive, since people will need to construct personal summaries of this distribution by examining the samples, visually or otherwise. Computational, since communicating the distribution requires copying many samples, and since downstream tasks require computation — if one wants to examine a particular decision outcome, the decision must be evaluated against every sample and be transformed into a sample of outcomes for that decision. This outcome sample must then be examined to draw conclusions.
Another problem with sampling is that its behaviour is not immune to dimensionality: generally speaking, the representation quality of this method can decay with the dimensionality of the distribution. To illustrate, a sample size of 10 tells one a lot more about a univariate distribution, than about a multivariate distribution.
Decision theory: utility functions
We care about the output of a model because we need to make decisions where the outcome depends on the quantity that the model estimates. The gain from the decision, therefore, depends on (a) the action taken, a, and (b) the value of a quantity of interest (QoI) that is being modelled, y. The function that maps these two variables to gain is known as a utility function, r(a, y). There is vast literature on this topic, we here present the main ideas found in (MacKay, 2003, §36 Decision Theory) and (Murphy, 2022, §34 Decision making under uncertainty).
In a usual problem setting, we seek to find the action that optimises utility, a⁎. However, we are immediately faced with the problem of non-determinacy since we do not have access to the true value of the QoI. Instead, we have the uncertain model estimate, i.e., a random variable y|𝒟. To deal with this uncertainty, we choose to optimise a statistically aggregated utility, most commonly the expected (mean) utility:
$$\begin{align}
\label{eq:exp_util} a_\star =& \arg\min_{a} \mathbb{E}_y \left[ r(a, y) | \mathcal{D} \right] \\
=& \arg\min_{a} \int_\mathcal{Y} r(a, y)\ p(y|\mathcal{D})\ dy,
\end{align}$$
where $p(y|\mathcal{D})$ is the posterior predictive distribution offered by the model. Note how the utility under every possible value for the QoI, $y$, is weighted with the probability of that value occurring as estimated by the model, $p(y|\mathcal{D})$, and the weighted sum (mean average) is optimised. This guards against actions that may have dire consequences under unlikely occurrences. In the case of a quadratic function $r(a, y) = (a - y)^2$, the optimisation reduces to optimising the utility against the mean estimate,
$a_\star = \arg\min_{a} r\left(a, \mathbb{E}_y [y | \mathcal{D}] \right)$. Such reduction is not always warranted, however, so the whole predictive distribution and the uncertainty it represents must be balanced against the particular utility function to find optimal actions.
Against an unbiased test set, we expect a model with better calibrated uncertainties to produce higher optimised utilities overall, than a model with similar mean estimates but with worse uncertainty estimates. Thus, utility functions suitable to the problem at hand can be used to assess the quality of uncertainty estimation in a model.
Conclusions
All these methods offer advantages and have limitations, and we are called to choose among them. Some are easier to implement (checkpoint pseudo-ensembles), and some possess theoretical virtues while harder to apply (probabilistic output methods). As such, it is prudent to keep a number of these in our armamentarium and make them interoperable with as broad an array of surrogate models as possible. The practitioner can then easily compare between them and choose according to the demands of the situation. While none of the methods are a complete answer to our problem, taken together they can adequately quantify uncertainty in model estimates.
References
- M. Abdar, F. Pourpanah, S. Hussain, D. Rezazadegan, L. Liu, M. Ghavamzadeh, P. Fieguth, X. Cao, A. Khosravi, U. R. Acharya, V. Makarenkov, and S. Nahavandi. A Review of Uncertainty Quantification in Deep Learning: Techniques, Applications and Challenges. Information Fusion, 76:243–297, Dec. 2021. ISSN 15662535. doi: 10.1016/ j.inffus.2021.05.008. URL http://arxiv.org/abs/2011.06225. arXiv:2011.06225 [cs].
- C. M. Bishop. Pattern recognition and machine learning. Information science and statistics. Springer, New York, 2006. ISBN 978–0–387–31073–2
- Y. Chung, I. Char, H. Guo, J. Schneider, and W. Neiswanger. Uncertainty Toolbox: an Open-Source Library for Assessing, Visualizing, and Improving Uncertainty Quantifi- cation, Sept. 2021. URL http://arxiv.org/abs/2109.10254. arXiv:2109.10254 [cs, stat].
- G. Detommaso, A. Gasparin, M. Donini, M. Seeger, A. G. Wilson, and C. Archambeau. Fortuna: A Library for Uncertainty Quantification in Deep Learning, Feb. 2023. URL http://arxiv.org/abs/2302.04019. arXiv:2302.04019 [cs, stat].
- Y. Gal and Z. Ghahramani. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Proceedings of The 33rd International Conference on Machine Learning, pages 1050–1059. PMLR, June 2016. URL https://proceedings. mlr.press/v48/gal16.html. ISSN: 1938–7228.
- A. Gelman and A. Unwin. Infovis and Statistical Graphics: Different Goals, Different Looks. Journal of Computational and Graphical Statistics, 22(1):2–28, Jan. 2013. ISSN 1061–8600, 1537–2715. doi: 10.1080/10618600.2012.761137. URL https://www. tandfonline.com/doi/full/10.1080/10618600.2012.761137.
- M. Kim and T. Hospedales. BayesDLL: Bayesian Deep Learning Library, Sept. 2023. URL http://arxiv.org/abs/2309.12928. arXiv:2309.12928 [cs, stat].
- V. Kuleshov, N. Fenner, and S. Ermon. Accurate Uncertainties for Deep Learning Using Calibrated Regression, June 2018. URL http://arxiv.org/abs/1807.00263. arXiv:1807.00263 [cs, stat].
- D. J. C. MacKay. Information theory, inference, and learning algorithms. Cambridge University Press, Cambridge, UK ; New York, 2003. ISBN 978–0–521–64298–9.
- K. P. Murphy. Probabilistic machine learning: an introduction. Adaptive computation and machine learning series. The MIT Press, Cambridge, Massachusetts, 2022. ISBN 978–0–262–04682–4.
- H. Ritter and T. Karaletsos. TyXe: Pyro-based Bayesian neural nets for Pytorch, Oct. 2021. URL http://arxiv.org/abs/2110.00276. arXiv:2110.00276 [cs, stat].
- C. Shalizi. Advanced data analysis from an elementary point of view. Cambridge Univer- sity Press, 2013. URL http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/. preprint.
- M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, pages 681–688, Madison, WI, USA, June 2011. Omni- press. ISBN 978–1–4503–0619–5.
- A. G. Wilson and P. Izmailov. Bayesian Deep Learning and a Probabilistic Per- spective of Generalization, Mar. 2022. URL http://arxiv.org/abs/2002.08791. arXiv:2002.08791 [cs, stat].