Non-Intrusive Reduced-Order Modeling Using Uncertainty-Aware Deep Neural Networks and Proper Orthogonal Decomposition: Application to Flood Modeling
Authors
Pierre Jacquier, Azzedine Abdedou, Vincent Delmas, Azzeddine Soulaïmani
Abstract
Deep Learning research is advancing at a fantastic rate, and there is much to gain from transferring this knowledge to older fields like Computational Fluid Dynamics in practical engineering contexts. This work compares state-of-the-art methods that address uncertainty quantification in Deep Neural Networks, pushing forward the reduced-order modeling approach of Proper Orthogonal Decomposition-Neural Networks (POD-NN) with Deep Ensembles and Variational Inference-based Bayesian Neural Networks on two-dimensional problems in space. These are first tested on benchmark problems, and then applied to a real-life application: flooding predictions in the Mille Îles river in the Montreal, Quebec, Canada metropolitan area. Our setup involves a set of input parameters, with a potentially noisy distribution, and accumulates the simulation data resulting from these parameters. The goal is to build a non-intrusive surrogate model that is able to know when it doesn’t know, which is still an open research area in Neural Networks (and in AI in general). With the help of this model, probabilistic flooding maps are generated, aware of the model uncertainty. These insights on the unknown are also utilized for an uncertainty propagation task, allowing for flooded area predictions that are broader and safer than those made with a regular uncertainty-uninformed surrogate model. Our study of the time-dependent and highly nonlinear case of a dam break is also presented. Both the ensembles and the Bayesian approach lead to reliable results for multiple smooth physical solutions, providing the correct warning when going out-of-distribution. However, the former, referred to as POD-EnsNN, proved much easier to implement and showed greater flexibility than the latter in the case of discontinuities, where standard algorithms may oscillate or fail to converge.
Reduced Basis Generation with Proper Orthogonal Decomposition
We start by defining $u$, our $\mathbb{R}^{D}$-valued function of interest. Computing this function is costly, so only a finite number $S$ of solutions called snapshots can be realized. In our applications, the spatial mesh of $N_D$ nodes is considered fixed in time, and since it is known and defined upfront, so we consider the number of outputs $H=N_D\times D$, the total number of degrees of freedom (DOFs) of the mesh. The simulation data, obtained from computing the function $u$ with $S$ parameter sets $\bm{s}^{(i)}$, is stored in a matrix of snapshots $\bm{U} = [u_D(\bm{s}^{(1)})|\ldots|u_D(\bm{s}^{(S)})] \in \mathbb{R}^{H \times S}$. Proper Orthogonal Decomposition (POD) is used to build a Reduced-Order Model (ROM) and produce a low-rank approximation, which will be much more efficient to compute and use when rapid multi-query simulations are required. With the snapshots method, a reduced POD basis can be efficiently extracted in a finite-dimension context. In our case, we begin with the $\bm{U}$ matrix, and use the Singular Value Decomposition algorithm, to extract $\bm{W} \in \mathbb{R}^{H \times H}$, $\bm{Z} \in \mathbb{R}^{S \times S}$ and the $r$ descending-ordered positive singular values matrix $\bm{D} = \text{diag}(\xi_1, \xi_2, \ldots, \xi_r)$ such that
\[ \begin{aligned} \bm{U} = \bm{W} \begin{bmatrix} \bm{D} & 0 \\ 0 & 0 \end{bmatrix} \bm{Z}^\intercal. \end{aligned} \]
For the finite truncation of the first $L$ modes, the following criterion on the singular values is imposed, with a hyperparameter $\epsilon$ given as
\[ \dfrac{\sum_{l=L+1}^{r} \xi_l^2}{\sum_{l=1}^{r} \xi_l^2} \leq \epsilon, \]
and then each mode vector $\bm{V}_j \in \mathbb{R}^{S}$ can be found from $\bm{U}$ and the $j$-th column of $\bm{Z}$, $\bm{Z}_j$, with
\[ \bm{V}_j = \dfrac{1}{\xi_j} \bm{U} \bm{Z}_j, \] so that we can finally construct our POD mode matrix
\[ \bm{V} = \left[\bm{V}_1 | \ldots | \bm{V}_j | \ldots | \bm{V}_L\right] \in \mathbb{R}^{H \times L}. \] To project to and from the low-rank approximation requires projection coefficients; those corresponding to the matrix of snapshots are obtained by the following
\[ \bm{v} = \bm{V}^\intercal \bm{U}, \] and then $\bm{U}_\textrm{POD}$, the approximation of $\bm{U}$, can be projected back to the expanded space:
\[ \bm{U}_\textrm{POD} = \bm{V}\bm{V}^\intercal\bm{U} = \bm{V} \bm{v}. \]
POD-EnsNN: Learning Expansion Coefficients Distributions using Deep Ensembles
Deep Neural Networks with built-in variance
This statistical step is handled in the POD-NN framework by inferring the mapping with a Deep Neural Network, $\hat{u}_{DB}(\bm{s};\bm{w},\bm{b})$. The weights and biases of the network, $\bm{w}$ and $\bm{b}$, respectively, represent the model parameters and are learned during training (offline phase), to be later reused to make predictions (online phase). The network’s number of hidden layers is called the depth, $d$, which is chosen without accounting for the input and output layers. Each layer has a specific number of neurons that constitutes its width, $l^{(j)}$.
The main difference here with an ordinary DNN architecture for regression resides in the dual output, where the final layer size is twice the number of expansion coefficients to project, $l^{(d+1)}=2L$, since it outputs both a mean value $\bm{\mu}^v$ and a raw variance $\bm{\rho}^v$, which will then be constrained for positiveness through a softplus function, finally outputting ${\bm{\sigma^v}}^2$ as
\[ {\bm{\sigma}^v}^2 = \textrm{softplus}(\bm{\rho}^v) := \log(1 + \exp(\bm{\bm{\rho}^v})). \]
Since this predicted variance reports the spread, or noise, in data (the inputs’ data are drawn from a distribution), and so it would not be reduced even if we were to grow our dataset larger, it accounts for the aleatoric uncertainty, which is usually separated from epistemic uncertainty.
Ensemble training
Considering an $N$-sized training dataset $\mathcal{D}={\bm{X}_i, \bm{v}_i}$, with $\bm{X}_i$ denoting the normalized non-spatial parameters $\bm{s}$, and $\bm{v}_i$ the corresponding expansion coefficients from a training/validation-split of the matrix of snapshots $\bm{U}$, an optimizer performs several training epochs $N_e$ to minimize the following Negative Log-Likelihood loss function with respect to the network weights and biases parametrized by $\bm{\theta}=(\bm{w}, \bm{b})$
\[ \begin{aligned} \mathcal{L}_{\textrm{NLL}}(\mathcal{D},\bm{\theta}):=\dfrac{1}{N} \sum{i=1}^{N}\left[\dfrac{\log\ \bm{\sigma}{\bm{\theta}}^v(\bm{X}_i)^2}{2}+ \dfrac{(\bm{v}_i-\bm{\mu}^v{\bm{\theta}}(\bm{X}i))^2}{2 \bm{\sigma}{\bm{\theta}}^v(\bm{X}_i)^2}\right], \end{aligned} \]
with the normalized inputs $\bm{X}$, $\bm{\mu}^v_{\bm{\theta}}(\bm{X})$ and $\bm{\sigma}_{\bm{\theta}}^v(\bm{X})^2$ as the mean and variance, respectively, retrieved from the $\bm{\theta}$-parametrized network.
In practice, this loss gets an L2 regularization as an additional term, producing
\[ \begin{aligned} \mathcal{L}^\lambda_{\textrm{NLL}}(\mathcal{D}, \bm{\theta}):=\mathcal{L}_\textrm{NLL}(\mathcal{D}, \bm{\theta})+\lambda ||\bm{w}||^2.\end{aligned} \]
The idea behind Deep Ensembles is to randomly initialize $M$ sets of $\bm{\theta}_{m}=({\bm{w}},{\bm{b}})$, thereby creating $M$ independent neural networks (NNs). Each NN is then subsequently trained. Overall, the predictions moments in the reduced space $(\bm{\mu}^v_{\bm{\theta}_m},\bm{\sigma}^v_{\bm{\theta}_m})$ of each NN create a probability mixture, which, as suggested by the original authors, we can approximate in a single Gaussian distribution, leading to a mean expressed as
\[ \begin{aligned} \bm{\mu}^v_*(\bm{X}) = \dfrac{1}{M} \sum_{m=1}^{M}\bm{\mu}^v_{\bm{\theta}_m}(\bm{X}), \end{aligned} \] and a variance subsequently obtained as
\[ \bm{\sigma}^v_*(\bm{X})^2 = \dfrac{1}{M} \sum_{m=1}^{M} \left[\bm{\sigma}_{\bm{\theta}_m}^v(\bm{X})^2 + \bm{\mu}^v_{\bm{\theta}_m}(\bm{X})^2\right] - \bm{\mu}_*^v(\bm{X})^2. \]
The model is now accounting for the epistemic uncertainty through random initialization and variability in the training step. This uncertainty is directly linked to the model and could be reduced if we had more data. The uncertainty is directly related to the data-fitting capabilities of the model and thus will snowball in the absence of such data since there are no more constraints.
POD-BNN: Bayesian Neural Networks and Variational Inference as an Alternative
For this model, the epistemic uncertainty treatment is very different. Earlier, even though the NNs were providing us with a mean and variance, they were still deterministic, and variability was obtained by assembling randomly initialized models. The Bayesian treatment instead aims to assign distributions according to the network’s weights, and so they therefore have a probabilistic output by design. Considering a dataset $\mathcal{D}={\bm{X}_i, \bm{v}_i}$, a likelihood function $p(\mathcal{D}|\bm{w})$ can be built, with $\bm{w}$ denoting both the weights $\bm{w}$ and the biases $\bm{b}$ for simplicity. The goal is then to construct a posterior distribution $p(\bm{w}|\mathcal{D})$ to achieve the following posterior predictive distribution on the target $\bm{v}$ for a new input $\bm{X}$
\[ p(\bm{v}|\bm{X},\mathcal{D})=\int p(\bm{v}|\bm{X},\bm{w})p(\bm{w}|\mathcal{D})\,d\bm{w}, \]
which cannot be achieved directly in a NN context, due to the infinite possibilities for the weights $\bm{w}$, leaving the posterior $p(\bm{w}|\mathcal{D})$ intractable.
Variational inference aims at construction an approximate posterior distribution $q(\bm{w}|\bm{\theta})$, by minimizing the KL divergence with the real posterior. In our case, it writes as $\textrm{KL}(q(\bm{w}|\bm{\theta}),||p(\bm{w}|\mathcal{D}))$ with respect to the new parameters $\bm{\theta}$ called latent variables, such as
\[ \begin{aligned} \textrm{KL}(q(\bm{w} | \bm{\theta}) || p(\bm{w} | \mathcal{D})) &= \int q(\bm{w} | \bm{\theta}) \log \dfrac{q(\bm{w} | \bm{\theta})}{p(\bm{w}|\mathcal{D})}\, d\bm{w}=\mathbb{E}_{q(\bm{w} | \bm{\theta})}\log \dfrac{q(\bm{w} | \bm{\theta})}{p(\bm{w}|\mathcal{D})},\end{aligned} \] which can be show to written as
\[ \begin{aligned} \textrm{KL}(q(\bm{w} | \bm{\theta}) || p(\bm{w} | \mathcal{D})) &=\textrm{KL}(q(\bm{w}|\bm{\theta})||p(\bm{w})) - \mathbb{E}_{q(\bm{w} | \bm{\theta})} \log p(\mathcal{D}|\bm{w}) + \log p(\mathcal{D})\\ &=:\mathcal{F}(\mathcal{D},\bm{\theta}) + \log p(\mathcal{D}) \end{aligned} \]
The term $\mathcal{F}(\mathcal{D}, \bm{\theta})$ is commonly known as the variational free energy, and minimizing it with respect to the weights does not involve the last term $\log p(\mathcal{D})$, and so it is equivalent to the goal of minimizing $\textrm{KL}(q(\bm{w}|\bm{\theta}),||p(\bm{w}|\mathcal{D}))$. If an appropriate choice of $q$ is made, the predictive posterior can become computationally tractable.
By drawing $N_\textrm{mc}$ samples $\bm{w}^{(i)}$ from the distribution $q(\bm{w}|\bm{\theta})$ at the layer level, it is possible to construct a tractable Monte-Carlo approximation of the variational free energy, such as
\[ \mathcal{F}(\mathcal{D},\bm{\theta}) \approx \sum_{i=1}^{N_\textrm{mc}} \left[ \log q(\bm{w}^{(i)} | \bm{\theta}) - \log p(\bm{w}^{(i)})\right] - \sum_{m=1}^{N} \log p(\mathcal{D} | \bm{w}_m), \] with $p(\bm{w}^{(i)})$ denoting the prior on the drawn weight $\bm{w}^{(i)}$, which is chosen by the user. The last term shows to be approximated by summing on the $N$ samples at the output level (for each training input).
A Few Results
Benchmark: 2D Ackley Function
To test the methods, one of the chosen benchmark was a stochastic version of Ackley Function. The following result show the predictions in the second and third column, respectively, in and out of the training scope. We see the uncertainties correctly snowballing in the latter.
POD-EnsNN POD-BNN
Shallow Water equations validation: 1D dam break
The final validation case was a 1D, time-unsteady problem involving the equations of interest in flood modeling: the Shallow Water Equations. In this case, the POD handles 2 DOFs per node, the water depth $h$ and the velocity $u$. For both, the results are displayed for the two methods, with the second and third column denoting again in- and out-of-distribution input parameters.
Water depth $h$ Velocity $u$
Rendering training data from CuteFlow, our numerical solver
In the following video, we render using the visualization software ParaView one of the snapshot, representing the training data in the case of a dam break, here on dry soil.
For more details and results, please check out the full paper.
Acknowledgements
This research was enabled in part by funding from the National Sciences and Engineering Research Council of Canada and Hydro-Québec; by bathymetry data from the Communauté métropolitaine de Montréal; and by computational support from Calcul Québec and Compute Canada.
Citation
@misc{jacquier2020nonintrusive,
title={Non-Intrusive Reduced-Order Modeling Using Uncertainty-Aware Deep Neural Networks and Proper Orthogonal Decomposition: Application to Flood Modeling},
author={Pierre Jacquier and Azzedine Abdedou and Vincent Delmas and Azzeddine Soulaimani},
year={2020},
eprint={2005.13506},
archivePrefix={arXiv},
primaryClass={physics.comp-ph}
}