Autoencoder identical to POD

Build a Linear Autoencoder (LAE) comparable to Proper orthogonal decomposition (POD) in two ways. The modes extracted by the two algorithms are compared on the wide (MINST [28*28, 60000]) and slim (Flow past cylinder snapshots [465*354, 110]) datasets. In addition, several regularization methods to improve the orthogonality and optimality of LAE modes are proposed and tested.

Theories

The relation between POD and equivalent linear Autoencoder is first identified by M Milano et.al^[1]. They use the nonlinear Autoencoder extension for reconstruction and prediction for near wall turbulence flow. And it has been seen as the first application of neural networks into the field of fluid mechanics^[2].

POD/PCA and SVD

A more detailed description can be found in the last post. Only the essence of it is shown below.

Consider a vector space \(\mathbf{X}\in\mathbb{R}^{m\times n}\) as \(\mathbf{X}=\left[\begin{array}{llll}x^1 & x^2 & \ldots & x^n\end{array}\right]\), where \(x^i=\left[\begin{array}{llll}x^i_1 & x^i_2 & \ldots & x^i_m\end{array}\right]^T\). The reduced-order representation \(\tilde{\mathbf{X}}\in \mathbb{R}^{m\times n}\) can be established by POD as: \[ \tilde{\mathbf{X}}=\sum_{i=1}^p \alpha_i \boldsymbol{\phi}_i=\boldsymbol{\Phi} \boldsymbol{\alpha},\qquad \tilde{\mathbf{X}}\in\mathbb{R}^{m\times n}, \mathbf{\Phi}\in\mathbb{R}^{m\times p}, \boldsymbol{\alpha}\in\mathbb{R}^{p\times n} \] where the columns of \(\mathbf{\Phi}\) is called modes representing the optimal set of basis spanned by which the reconstructed space \(\tilde{\mathbf{X}}\) has mathematically the lowest \(L2\) loss to the original space \(\mathbf{X}\).

These modes can be extracted by SVD (or efficient SVD): \[ \mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V^T},\qquad \mathbf{U}\in\mathbb{R}^{m\times m}, \boldsymbol{\Sigma}\in\mathbb{R}^{m\times n}, \mathbf{V}\in\mathbb{R}^{n\times n} \] And the reduced order representation can be reconstructed as: \[ \tilde{\mathbf{X}} = \underbrace{\tilde{\mathbf{U}}}_{\boldsymbol{\Phi}}\underbrace{\tilde{\boldsymbol{\Sigma}}\tilde{\mathbf{V}}^\mathbf{T}}_{\boldsymbol{\alpha}},\qquad \tilde{\mathbf{U}}\in\mathbb{R}^{m\times p}, \tilde{\boldsymbol{\Sigma}}\in\mathbb{R}^{p\times p}, \tilde{\mathbf{V}}\in\mathbb{R}^{n\times p} \] where \(\tilde{\mathbf{U}}, \tilde{\mathbf{V}}\) is the first \(p\) columns of \(\mathbf{U},\mathbf{V}\) respectively. \(\tilde{\boldsymbol{\Sigma}}\) is the first \(p\) dimensional diagonal sub-matrix of \(\boldsymbol{\Sigma}\). Relating to POD, we have \(\boldsymbol{\Phi} = \tilde{\mathbf{U}}\) and \(\boldsymbol{\alpha} = \tilde{\boldsymbol{\Sigma}}\tilde{\mathbf{V}}^\mathbf{T}\).

Linear autoencoder (LAE)

One-layer linear neural network (without nonlinear activation function layer) can be concluded with an input data \(\mathbf{X_0}\in\mathbb{R}^{m\times n}\) and the output \(\mathbf{X_1}\in \mathbb{R}^{p\times n}\): \[ \mathbf{X_1}=\boldsymbol{W_1}\mathbf{X_0} + \mathbf{b_1},\qquad \mathbf{X_1}\in \mathbb{R}^{p\times n}, \mathbf{W_1}\in\mathbb{R}^{p\times m}, \mathbf{X_0}\in\mathbb{R}^{m\times n}, \mathbf{b_1}\in\mathbb{R}^{p\times 1}\text{(broundcast)} \] 2 layer linear autoencoder

As the architecture shown above, a linear autoencoder stacks two of linear layer with a lower dimensional intermediate output acting as a bottleneck, also called the latent space. The first and second layer of the autoencoder are always referred to as the encoder and decoder.

With a well trained LAE, the original input data \(\mathbf{X_0}\) can be reconstructed from a latent space \(\mathbf{X_1}\), the error between original data \(\mathbf{X_0}\) and the reconstructed data \(\mathbf{X_2}\) can be minimized regressively via back propagation and gradient descent. When choosing mean square error (MSE) as error/loss function, the optimal latent space extracted by the optimal LAE is identical with that extracted by POD/PCA.

Equivalent LAE to POD

Straightforward method

To get a autoencoder equivalent to POD, set \(\mathbf{b_1} = \mathbf{b_2}=0\), a straightforward way is to make the auto encoder process \(\mathbf{X}\) and reconstruct it as \(\tilde{\mathbf{X}}\). The mathematical representation is: \[ \begin{aligned} layer1: \qquad \mathbf{X_1}=\boldsymbol{W_1}{\color{purple}\mathbf{X}},&\qquad \mathbf{X_1}\in \mathbb{R}^{p\times n}, \mathbf{W_1}\in\mathbb{R}^{p\times m}, \mathbf{X}\in\mathbb{R}^{m\times n} \\ layer2: \qquad {\color{purple}\tilde{\mathbf{X}}}=\boldsymbol{W_2}\mathbf{X_1},&\qquad \tilde{\mathbf{X}}\in \mathbb{R}^{m\times n}, \mathbf{W_2}\in\mathbb{R}^{m\times p} \end{aligned} \] associated with the above symbols, the weights of the second layer are used as the modes i.e. \[ \begin{aligned} \boldsymbol{\Phi} &= \mathbf{W_2} \\ \boldsymbol{\alpha} &= \mathbf{X_1} \end{aligned} \]

Transposition method

Although, there is another equivalence, if we make the autoencoder process a transposition of data, we have a bottleneck mapping from \(\mathbf{X'}\) to \(\tilde{\mathbf{X}}'\): \[ \begin{aligned} layer1: \qquad \mathbf{X_1'}=\boldsymbol{W_1}'{\color{purple}\mathbf{X}'},&\qquad \mathbf{X_1'}\in \mathbb{R}^{p\times m}, \mathbf{W_1'}\in\mathbb{R}^{p\times n}, \mathbf{X'}\in\mathbb{R}^{n\times m} \\ layer2: \qquad {\color{purple}\tilde{\mathbf{X}}'}=\boldsymbol{W_2}'\mathbf{X_1'},&\qquad \tilde{\mathbf{X}}\in \mathbb{R}^{n\times m}, \mathbf{W_2'}\in\mathbb{R}^{n\times p} \end{aligned} \] In this case, we have the intermediate latent space related to the modes, that is \[ \begin{aligned} \boldsymbol{\Phi} &= \mathbf{X_1'} \\ \boldsymbol{\alpha}&= \mathbf{W_2'} \end{aligned} \]

Improvement

Well, those equivalences only exist in imagination because of the large freedom of gradient descend process

non-orthogonality: modes extracted by LAE is not necessarily orthogonal
non-optimality: (L2) energy preserved by the modes are not ranked in descending order

There might be several methods to obtain a set of modes comparable with the POD modes:

Two-stage:
- [method 1] run another SVD on the raw modes extracted by the LAE, it works on both the two methods.
End-to-end:
- Straightforward method
  - [method 2] Orthonormal regularization on the weight
  - [method 3] Orthogonalized the weight every step (parameterization)
    - [method 4] SVD the weight every step (custom SVD parameterization)
- Transpose method
  - [method 5] Add an SVD layer after the 1st layer output.

All these improvements will be compared and tested in the case study section. In this section, the [method 2] regularization methods and the [method 4] SVD parameterization will be further talked about.

L2 regularization

Before introducing the orthogonal regularization used in [method 2], let's recall the popular L2 regularization. For dataset \(\mathbf{X}\in\mathbb{R}^{m\times n}\) as \(\mathbf{X}=\left[\begin{array}{llll}x^1 & x^2 & \ldots & x^n\end{array}\right]\), where \(x^i=\left[\begin{array}{llll}x^i_1 & x^i_2 & \ldots & x^i_m\end{array}\right]^T\), the L2 regularized loss function of the 2-layer linear autoencoder is: \[ \begin{aligned} \mathbf{\hat{\mathcal{L}}_{(W_1,W_2)}} &= \mathcal{L}_{ms} + \mathcal{L}_{L2} \\ \text{where:}\qquad {\mathcal{L}_{ms}}_{\mathbf{(W_1,W_2)}} &= \frac{1}{n}\sum_{i=1}^{n} MSE(\mathbf{\tilde{x^{i}},x^{i}}), \\ {\mathcal{L}_{L2}}_{\mathbf{(W_1,W_2)}} &=\frac{\lambda_{L2}}{2n}\sum_{l=1}^2 (\left\|\mathbf{W}_{l}\right\|^2_F), \\ \left\|\mathbf{W}_1\right\|^2_F &=\sum_{i=1}^{p}\sum_{j=1}^{m}{\mathbf{W}_1}_{i}^{j2} = trace(\mathbf{W}^T_1\mathbf{W}_1), \\ \quad \left\|\mathbf{W}_2\right\|^2_F &= \sum_{i=1}^{p}\sum_{j=1}^{m}{\mathbf{W}_2}_i^{j2} = trace(\mathbf{W}^T_2\mathbf{W}_2) \end{aligned} \]

Normally, the L2 regularization is not applied by tailoring the loss function, but more simply and efficiently, the weight updating process, referring as the weight decay.

In the gradient descend algorithm, weights are updated by the gradient of loss function w.r.t. the weights: \[\mathbf{W}_l = \mathbf{W}_l-\alpha \frac{\partial{\hat{\mathcal{L}}}}{\partial\mathbf{W}_l}\] With \(\hat{\mathcal{L}} = \mathcal{L}_{ms} + \mathcal{L}_{L2}\), we have: \[\begin{aligned}\mathbf{W}_l &= \mathbf{W}_l-\alpha \left(\frac{\partial\mathcal{L}_{ms}}{\partial\mathbf{W}_l}+\frac{\partial\mathcal{L}_{L2}}{\partial\mathbf{W}_l}\right) \\&= \mathbf{W}_l-\alpha \left(\frac{\partial\mathcal{L}_{ms}}{\partial\mathbf{W}_l}+\frac{\lambda_{L2}}{n}\mathbf{W}_l\right) \\\end{aligned}\] And \(\lambda_{L2}\) is called the weight decay parameter, with \(\alpha\) being the learning rate.

We can't do such simplification with other regularization methods.

Ablation tests on the L2 regularization (weight decay) and learning rate are carried out with MNIST dataset. [method 1] is used to inspect the results. Some interesting results are shown in the Appendix. It turns out that a small L2 regularization leads to more orthogonal modes and lower reconstruction error. Too much L2 regularization leads to high bias/under fitting.

Orthonormal regularization

One attractive property of the POD modes is the (semi) orthonormality, i.e. the modes are independent with each others i.e. the inner product of the mode matrix is identity. \[ \begin{aligned} \boldsymbol{\Phi}^T\boldsymbol{\Phi}&=\mathbf{I}, \qquad\mathbf{I} \text{ is the identity matrix} \\ \left\langle \phi_i, \phi_j\right\rangle &= \delta_{i j}, \qquad\forall i, j \end{aligned} \] To impose an orthonormal penalty on the weight of LAE, mimicking the L2 regularization, we can design an orthonormal regularization: \[ {\mathcal{L}_{ortho}}_{\mathbf{(W_1,W_2)}} =\lambda_{ortho}\sum_{l=1}^2\left\|\mathbf{W}^T_l\mathbf{W}_l-\mathbf{I}\right\|^2_F \]

Orthogonal regularization is a technique first proposed in CNNs.

Brock et.al first proposed that maintaining an orthogonal weight is desirable when training neural networks. And they proposed the orthogonal regularization in L1 norm ^[4]. Then they upgraded it to a new variation in BigGAN^[5] based on the L2 norm. Lubana et.al^[6] applied the same method to the field of neural network pruning^[7].

SVD parameterization

Parameterization is a method transforming the parameter of neural network in an appropriate way before using it. For example, orthogonal parameterization orthogonalizes the weight matrix of a layer of the neural network.

Orthogonal parameterization guarantees a arbitrary orthogonal basis, but we want more. So here it is possible to define a SVD parameterization regulating the the decoder such that: \[ \begin{aligned} layer2:& \qquad \tilde{\mathbf{X}}=\boldsymbol{W_2}\mathbf{X_1}= \mathbf{U}\boldsymbol{\Sigma}\mathbf{V^T}\mathbf{X_1} = \boldsymbol{\hat{W_2}}\mathbf{\hat{X_1}} \\ \text{where:}& \qquad \boldsymbol{\hat{W_2}} = \mathbf{U}, \quad \mathbf{\hat{X_1}}=\boldsymbol{\Sigma}\mathbf{V^T}\mathbf{X_1} \\ &\qquad \mathbf{U},\boldsymbol{\Sigma},\mathbf{V} = SVD(\boldsymbol{W_2}) \end{aligned} \] In this case the modes \(\hat{W_2}\) automatically gets singular value decomposed.

Parameterization only affects the weight, i.e. \(\boldsymbol{\hat{W_2}} = \mathbf{U}\) not \(\mathbf{\hat{X_1}}=\boldsymbol{\Sigma}\mathbf{V^T}\mathbf{X_1}\).

The intermediate output \(\mathbf{\hat{X_1}}\) should be designed in the forwarding process. See the code.

Case study

MNIST

The MNIST handwritten digits dataset contains 70,000 (60,000 training, 1000 testing) greyscale fixed-size (28 * 28) images and is a standard dataset for benchmarking computer vision and deep learning algorithms.

First 32 MNIST training set images, two-value images (0 or 1) are rerendered with a color mapping ranging [-0.2,0.2], consistent to the color mapping below

import torch
from torchvision import datasets, transforms, utils
import matplotlib.pyplot as plt
import numpy as np

# GPU if is available
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
print(device, " is in use")

# Load MNIST to a PyTorch Tensor
dataset = datasets.MNIST(root="./data",
                         train=True,
                         download=False,
                         transform=transforms.ToTensor())

POD result

In order to run POD , the training set is flattened, scaled to range [0,1] then centred by subtracting the mean image, resulting a data matrix \(\mathbf{X}\in\mathbb{R}^{784\times60000}\)

# Flatten from [60000, 28, 28] to [28*28, 60000], move to gpu
flatten_data = dataset.data.reshape(dataset.data.shape[0], -1).type(torch.FloatTensor)/255  
mean_data = flatten_data.mean(axis=0)
flatten_data = flatten_data - mean_data
flatten_data = flatten_data.T.to(device)

The PyTorch SVD API is used for extracting POD modes, and the first 81 modes are selected to recover the data

# SVD and truncation
U, S, V = torch.svd(flatten_data)
k = 81
Uk = U[:, :k] # pod modes
Sk = torch.diag(S[:k]) # energy
Vk = V[:, :k]

The memory requirements for direct SVD are high on a dataset of this size. For larger data, such a direct approach is not practical.

The energy spectral/division of each mode are shown below:

the first 81 modes (in use) captured 89.2% of the total energy
the first 87 modes captured 90% of the energy
the first 331 modes captured 99% of the energy

Estimated data can be reconstructed from the extracted modes and associated weights. The error can be therefore calculated by the \(L2\) norm.

reconstructed = Uk@Sk@Vk.T
print("MSE loss", torch.square(reconstructed - flatten_data).mean().cpu().numpy())

SVD reconstructed	Error

the MSE error is 0.007265054
pixel-to-pixel difference ranges from -0.997 to 0.920

Note in this case, the color mapping ranges are all set to be [-0.2, 0.2] for consistence.

The modes can also be extracted easily. Note that these modes are orthonormal.

First 81 POD modes extracted from MNIST - mean

LAE results

The straightforward method is adopted, otherwise the size of weights is too large. The architecture is therefore:

from torchinfo import summary
# ...

k = 81

# Creating a AE class
# 28*28 ==> k ==> 28*28
class AE(torch.nn.Module):

    def __init__(self):
        super().__init__()
        self.encoder = torch.nn.Sequential(torch.nn.Linear(flatten_data.shape[1], k, bias=False))
        self.decoder = torch.nn.Sequential(torch.nn.Linear(k, flatten_data.shape[1], bias=False))

    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return encoded, decoded
      
# Loss function and optimizer
loss_function = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)

# Summary
summary(model, input_size=(60000, 784))

==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
AE                                       [60000, 81]               --
├─Sequential: 1-1                        [60000, 81]               --
│    └─Linear: 2-1                       [60000, 81]               63,504
├─Sequential: 1-2                        [60000, 784]              --
│    └─Linear: 2-2                       [60000, 784]              63,504
==========================================================================================
Total params: 127,008
Trainable params: 127,008
Non-trainable params: 0
Total mult-adds (G): 7.62
==========================================================================================
Input size (MB): 188.16
Forward/backward pass size (MB): 415.20
Params size (MB): 0.51
Estimated Total Size (MB): 603.87
==========================================================================================

An iterative mini batch training is implemented with batch_size = 25, epochs = 3.

The data reconstructed by LAE and the error of it are shown below.

LAE Reconstructed	Error

MSE loss 0.00777
pixel-to-pixel error ranges from -1.01 to 0.938

Note the optimal least error (error of POD) is not possible to reach because of nature of mini-batch regression.

The set of mode is obtained by:

modes = np.array(model.decoder[0].weight.data.cpu())

raw modes	[method 1] error 0.00777

[method 1]

The raw modes constructed by LAE is obviously not identical to that of POD, in order to show the new spaces spanned by these two set of modes are identical/similar, orthonormalized the LAE modes by SVD, we can also get energy spectral (below) and a orthonormalized set of modes (above).

The singular values of each orthonormalized mode are more similar to each other, implying the orthogonality of the raw modes extracted by LAE is stronger, but still far from orthogonal.

The orthonormalized modes are, well, not quite same to that extracted by POD. So we do a small scale convergence check, taking some tricks (bigger batch size, learning rate scheduler etc) to have a better reconstruction result with MSE loss 0.0072723706, applying the similar process to get a orthonormalized set of modes.

[method 1] error 0.007272370	SVD, error 0.007265054

The better equivalence (at least in terms of the first 2 modes) indicating the space spanned by LAE converges to that spanned by POD modes.

[method 2]

In order to apply orthogonal weights, first initialize the weight with radom orthogonal data.

from scipy.stats import ortho_group
# define flatten data, LAE, k = 81
model.encoder[0].weight = nn.Parameter(torch.tensor(ortho_group.rvs(dim=flatten_data.shape[0])[:,:k].T).float())
model.decoder[0].weight = nn.Parameter(torch.tensor(ortho_group.rvs(dim=flatten_data.shape[0])[:,:k]).float())

Then add another orthogonal losses to the MSE loss function

ortho_lambd = 0.01
# in the iteration loop:
  loss_ortho_de = torch.square(torch.dist(model.decoder[0].weight.T @ model.decoder[0].weight, torch.eye(k).to(device)))
  loss_ortho_en = torch.square(torch.dist(model.encoder[0].weight @ model.encoder[0].weight.T, torch.eye(k).to(device)))
  loss = loss_function(reconstructed, image) + ortho_lambd*(loss_ortho_de+loss_ortho_en)

The resulting modes and the singular value energy division is:

SVD AE method2 energy spectural of MNIST - mean.png

SVD AE method2 energy division of MNIST - mean

[method 2] raw error 0.007268394	[method 1] of [method 2]

MSE loss 0.007268394
pixel-to-pixel error ranges from -0.99917483 to 0.91432786

The modes is nearly orthogonal yet it needs a good fine tuning of \(\lambda_{ortho}\) to balance the MSE error with the orthogonality error.

Neither the raw [method 2] modes nor the SVD orthogonalized modes associated to the POD modes, we will talk about that soon later.

[method 3]

Take advantage of the torch.nn.utils.parametrizations.orthogonal library, when define the LAE:

class AE(torch.nn.Module):

    def __init__(self):
        super().__init__()
+ 			self.encoder = torch.nn.Sequential(torch.nn.utils.parametrizations.orthogonal(torch.nn.Linear(flatten_data.shape[1], k, bias=False)))
+ 			self.decoder = torch.nn.Sequential(torch.nn.utils.parametrizations.orthogonal(torch.nn.Linear(k, flatten_data.shape[1], bias=False)))
- 			self.encoder = torch.nn.Sequential(torch.nn.Linear(flatten_data.shape[1], k, bias=False))
- 			self.decoder = torch.nn.Sequential(torch.nn.Linear(k, flatten_data.shape[1], bias=False))

    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return encoded, decoded

The the singular value energy division and resulting modes are:

SVD energy spectual of method3 MNIST-mean

SVD division spectual of method3 MNIST-mean

[method 3] raw error 0.0072978497	[method 1] of [method 3]

MSE loss 0.0072978497
pixel-to-pixel error ranges from -0.9987151 to 0.9268836

The modes is perfect orthogonal, the loss is also small.

But the [method 2,3] modes are totally different from the POD modes. Because there is no regulation of the singular value of the modes, i.e. the optimality is not guaranteed.

[method 4] ( best )

Writing a custom parameterization class and apply to the decoder weights as:

class Svd_W_de(nn.Module):
    def forward(self, W2):
        U,_,_ = torch.svd(W2)
        return U

torch.nn.utils.parametrizations.parametrize.register_parametrization(model.decoder[0], "weight", Svd_W_de())

Tailer the network as

class AE(torch.nn.Module):

    def __init__(self):
        super().__init__()
        self.encoder = torch.nn.Sequential(torch.nn.utils.parametrizations.orthogonal(torch.nn.Linear(flatten_data.shape[1], k, bias=False)))
        self.decoder = torch.nn.Sequential(torch.nn.Linear(k, flatten_data.shape[1], bias=False))

    def forward(self, x):
        encoded = self.encoder(x)
+      	_,S,V = torch.svd(self.decoder[0].parametrizations.weight.original)
+       encoded = encoded@V@torch.diag(S)
        decoded = self.decoder(encoded)
        return encoded, decoded

The results are:

SVD AE method4 energy spectural of MNIST - mean.png

SVD AE method4 energy division of MNIST - mean

[method 4] raw error 0.007290172	[method 1] of [method 4]

MSE loss 0.007290172
pixel-to-pixel error ranges from -1.005479 to 0.9235858

Great! The result is guaranteed as orthogonal and optimal, the error is small, and the model is end-to-end.

Flow past a cylinder

Flow over a 2D cylinder is the fundamental flow that captures the essential features of bluff-body flows. The periodic nature of it known as the von Kármán vortex shedding at Re>47 is always treated as an attractive initial test bed for decomposition methods. The POD result of the DNS simulation snapshots at Re=100 is shown below, the first 8 modes captures 99.97% of the flow fluctuations in terms of the KE.^[3].

POD analysis of cylinder flow: a) original flowfield under study (vorticity shown), b) first 8 dominant POD modes, and c) amount of KE of unsteadiness captured by the POD modes.

Note that these modes appear in pairs because of the periodical flow field and the real value nature of POD modes. Also note that the first two dominant modes are top-down asymmetry, associated with the asymmetry of the flow field.

Compared with dynamic mode decomposition (DMD), 2 POD modes in a pair are concluded by the real and imaginary parts of only one complex DMD mode. In another words, the 1st, 3rd, 5th, and the 2nd, 4th, 6th POD modes are identical with the real parts and imaginary parts of the 1st, 2nd, 3rd DMD modes respectively. (only true for exact periodical flow like this case)

For simplicity, here we take an unsteady FVM result at Re=200, from wolfdynamics OpenFOAM turtorial and perform POD and LAE decomposition based on the snapshots of velocity field from time 250 to 350. The resulting data matrix \(\mathbf{X}\in\mathbb{R}^{192080 \times 101}\)

See the high resolution result on link http://www.wolfdynamics.com/wiki/cylinder_vortex_shedding/movvmag.gif

# Load the fluid Dataset, flattened from [392, 490, 351] to [192080, 351]
flatten_data = load_img_as_matrix("./data/cylinder")
mean_data = flatten_data[250:,:].mean(axis=0)
flatten_data = flatten_data - mean_data
flatten_data = torch.tensor(flatten_data.T[:,250:])

POD result

Follow the process of last case, the energy spectral/division of each mode:

First 2 modes capture 76.95% of energy
First 4 modes capture 86.81% of energy
First 6 modes capture 88.61% of energy
First 8 modes capture 89.51%of energy

And the original, reconstructed, error results are

error between original and SVD reconstructed with 8 modes

MSE loss 7.4397925e-05
pixel-to-pixel error ranges from -0.1370289 to 0.12152482

The first 8 dominant POD modes are shown bleow:

SVD first 8 modes extracted from fluid - mean

Note the modes looks very similar to the standard result at the beginning of this case at Re = 100.

LAE results

Straightforward ( FAILED )

With straightforward method, it is really difficult for autoencoder to converge, and the model is prone to gradient vanishing because of the big data size compared with the length of data.

print("shape of dataset:", flatten_data.shape)
summary(model, input_size=(101, 192080))
print("encoder weight shape: ", model.encoder[0].weight.shape)
print("decoder weight shape: ", model.decoder[0].weight.shape)

shape of dataset: torch.Size([101, 192080])
==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
AE                                       [101, 8]                  --
├─Sequential: 1-1                        [101, 8]                  --
│    └─Linear: 2-1                       [101, 8]                  1,536,640
├─Sequential: 1-2                        [101, 192080]             --
│    └─Linear: 2-2                       [101, 192080]             1,536,640
==========================================================================================
Total params: 3,073,280
Trainable params: 3,073,280
Non-trainable params: 0
Total mult-adds (M): 310.40
==========================================================================================
Input size (MB): 77.60
Forward/backward pass size (MB): 155.21
Params size (MB): 12.29
Estimated Total Size (MB): 245.10
==========================================================================================
encoder weight shape:  torch.Size([8, 192080])
decoder weight shape:  torch.Size([192080, 8])

reconstructed	error	raw modes	[method 1]

The fluid data FAILs being recovered, the repetitive pattern shown on the recovered images is the mean flow.

Compared the recovered orthonormal modes extracted by straightforward AE with POD modes, only the first dominant POD mode is barely captured with low amplitude (note the data range difference), indicating a gradient vanishing.

Transposition [method 1]

To take advantage of the better performance of neural network on large data length, and to prevent the overfitting and gradient vanishing, transpose the flattened data and make LAE compress it in terms of the data length dimension. (101 -> 8 -> 101)

+  flatten_data = flatten_data.T
print("shape of dataset:", flatten_data.shape)
- summary(model, input_size=(101, 192080))
+ summary(model, input_size=(192080, 101))
print("encoder weight shape: ", model.encoder[0].weight.shape)
print("decoder weight shape: ", model.decoder[0].weight.shape)

shape of dataset: torch.Size([192080, 101])
==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
AE                                       [192080, 8]               --
├─Sequential: 1-1                        [192080, 8]               --
│    └─Linear: 2-1                       [192080, 8]               808
├─Sequential: 1-2                        [192080, 101]             --
│    └─Linear: 2-2                       [192080, 101]             808
==========================================================================================
Total params: 1,616
Trainable params: 1,616
Non-trainable params: 0
Total mult-adds (M): 310.40
==========================================================================================
Input size (MB): 77.60
Forward/backward pass size (MB): 167.49
Params size (MB): 0.01
Estimated Total Size (MB): 245.10
==========================================================================================
encoder weight shape:  torch.Size([8, 101])
decoder weight shape:  torch.Size([101, 8])

This autoencoder can be well trained in seconds, and the reconstruction and error result are:

fluid error reconstructed with transposed AE

MSE loss 7.4815114e-05
pixel-to-pixel error ranges from -0.13534062 to 0.12123385

The first 8 modes and [method 1] normalized modes extracted by LAE are:

energy	raw modes	[method 1] error 7.4815114e-05

It can be seen that the latent space spanned by LAE modes and POD are nearly identical.

Note the data range for the raw mode is different (-0.4~0.4 instead of -0.04~0.04)

[method 5] ( FAILED )

The easiest way is applying SVD to the bottleneck feature every step as:

class AE(torch.nn.Module):

    def __init__(self):
        super().__init__()
        self.encoder = torch.nn.Sequential(torch.nn.Linear(flatten_data.shape[1], k, bias=False))
        self.decoder = torch.nn.Sequential(torch.nn.Linear(k, flatten_data.shape[1], bias=False))

    def forward(self, x):
        encoded = self.encoder(x)
+       svd_encoded,_,_ = torch.svd(encoded)
        decoded = self.decoder(encoded)
-       return encoded, decoded
+       return svd_encoded, decoded

This operation has no advantage but 3 drawbacks:

Exactly identical to the result of [method 1] in full batch training. But the resulting mode is wrong with mini-batch gradient descend.
This operation does not affect the training process.
It is computational heavier than [method 1] as the latter only requires one singular value decomposition step

Another way is to pass the \(\mathbf{U}\) (svd_encoded) to the decoder and multiplying the weight by \(\boldsymbol{\Sigma}\mathbf{V^T}\), mimicking [method 4].

class AE(torch.nn.Module):

    def __init__(self):
        super().__init__()
        self.encoder = torch.nn.Sequential(torch.nn.Linear(flatten_data.shape[1], k, bias=False))
        self.decoder = torch.nn.Sequential(torch.nn.Linear(k, flatten_data.shape[1], bias=False))
        self.SVT = torch.zeros((k,k))

    def forward(self, x):
        encoded = self.encoder(x)
        svd_encoded,S,V = torch.svd(encoded)
        decoded = self.decoder(svd_encoded)
        self.SVT = V@torch.diag(S)
        return svd_encoded, decoded


class Svd_W_de(nn.Module):
    def __init__(self, SVT):
        super().__init__()
        self.SVT = SVT
    def forward(self, X):
        return X.to(device)@self.SVT.to(device)


# Model Initialization
model = AE()
torch.nn.utils.parametrizations.parametrize.register_parametrization(model.decoder[0], "weight", Svd_W_de(model.SVT))

This operation has no advantage but 2 vital drawbacks:

Cannot be trained
The resulting mode is wrong with mini-batch gradient descend

As a result, [method 5] is therefore deprecated.

Conclusion

When to use linear autoencoder?

When data is big, the latent space can be well captured with a much lower memory requirement.

When offline training is impractical, for example when new data coming through constantly.
The relationship between LAE and POD modes?

There is no regulation to the form of LAE modes while the POD modes are orthonormal which is suitable to design a sparse representation.

The latent space captured by LAE converges to the optimal space spanned by POD modes.
Straightforward vs transposition LAE?

It mainly depends on the aspect ratio of input data. In order to decrease the number of weights/parameters, it is better to choose a type such that the longer dimension of data is preserved.
- Straightforward method suitable for very wide data, like MNIST.
- Transposition method for very thin data like flow velocity snapshots.
  - If a multi-layer non-linear Autoencoder is applied, the modes extracted by the transposed AE are still easy to get access

Conclusion on the improvement methods?

	[method1]	[method2]	[method3]	[method4]	[method5]	note
suitable LAE	both	straightforward	straightforward	straightforward	transposition
end-to-end	❌	✅	✅	✅	✅
orthogonality of the modes	perfect	nearly perfect	perfect	perfect	perfect
optimality	✅	❌	❌	✅	❌
converge to the POD modes	✅	❌	❌	✅	❌
reconstruction error (straightforward)	7.272e-3	7.268e-3	7.298e-3	7.290e-3	NAN	POD error 7.265e-3
reconstruction error (transposition)	7.48e-5	NAN	NAN	NAN	7.03e-4	POD error 7.44e-05

Appendix

Ablation test of L2 regularization of LAE on MNIST

Different weight decay and learning rate are applied, and the raw modes are inspected by [method 1].

For learning rate \(\alpha = 0.005\), the energy spectral and division for weight decay \(\lambda\) from \(0.005\) to \(0\) are plotted animately as below:

and the raw modes and orthogonal modes recovered by [method 1] are shown below:

The MSE error and the range of the mode value is plotted with the weight decay and learning rate as well:

As we can see, learning rate does not affect the mean square error, but the range of the mode value. And weight decay can be used to control the modes' orthogonality, a suitable weight decay for example \(8e-5\) results in a better orthogonality and the recovered modes (shown below)· shows a better resemblance of the POD modes.

SVD first 81 modes extracted from MNIST - mean

Besides, it is interesting to see that although the biggest weight decay \(0.005\) fail to recover most of the modes, the raw modes are very similar to the its recovered modes, like a shuffled version of many copies of the recovered modes regardless the magnitude difference.

References

Milano, M., & Koumoutsakos, P. (2002). Neural network modeling for near wall turbulent flow. Journal of Computational Physics, 182(1), 1-26. ↩︎
Brunton, S. L., Noack, B. R., & Koumoutsakos, P. (2020). Machine learning for fluid mechanics. Annual review of fluid mechanics, 52, 477-508. ↩︎
Taira, K., Hemati, M. S., Brunton, S. L., Sun, Y., Duraisamy, K., Bagheri, S., ... & Yeh, C. A. (2020). Modal analysis of fluid flows: Applications and outlook. AIAA journal, 58(3), 998-1022. ↩︎
Brock, A., Lim, T., Ritchie, J. M., & Weston, N. (2016). Neural photo editing with introspective adversarial networks. arXiv preprint arXiv:1609.07093. ↩︎
Brock, A., Donahue, J., & Simonyan, K. (2018). Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. ↩︎
Lubana, E. S., Trivedi, P., Hougen, C., Dick, R. P., & Hero, A. O. (2020). OrthoReg: Robust Network Pruning Using Orthonormality Regularization. arXiv preprint arXiv:2009.05014. ↩︎
Blalock, D., Gonzalez Ortiz, J. J., Frankle, J., & Guttag, J. (2020). What is the state of neural network pruning?. Proceedings of machine learning and systems, 2, 129-146. ↩︎

#fluid dynamics #deep learning

Autoencoder identical to POD

https://daydreamatnight.github.io/2022/11/02/Autoencoder-identical-to-POD/

Author

Ryan LI

Posted on

November 2, 2022

Licensed under

test GPU servers single GPU Previous

from ROMs to Discretization Methods Next