AutoEncodersPyTorch
When training CNNs, one of the problems is that we need a lot of labeled data. In the case of image classification, we need to separate images into different classes, which is a manual effort.
However, we might want to use raw (unlabeled) data for training CNN feature extractors, which is called self-supervised learning. Instead of labels, we will use training images as both network input and output. The main idea of autoencoder is that we will have an encoder network that converts input image into some latent space (normally it is just a vector of some smaller size), then the decoder network, whose goal would be to reconstruct the original image.
Since we are training autoencoder to capture as much of the information from the original image as possible for accurate reconstruction, the network tries to find the best embedding of input images to capture the meaning.

Image from Keras blog
Let's create simplest autoencoder for MNIST!
Define training parameters and check if the GPU is available:
The following function will load the MNIST dataset and apply specified transforms to it. It will also split it into train/test datasets.
Now let's load the dataset and define dataloaders for train and test:
100%|██████████| 30/30 [06:49<00:00, 13.65s/it, train loss:=0.104, test loss:=0.104]
Task 1: Try to train autoencoder with very small latent vector size, eg. 2, and plot the dots corresponding to different digits. Hint: Use fully-connected dense layer after the convoluitonal part to reduce the vector size to the required value.
Task 2: Starting from different digits, obtain their latent space representations, and see what effect adding some noise to the latent space has on the resulting digits.
Denoising
Autoencoders can be effectively used to remove noise from images. In order to train denoiser, we will start with noise-free images, and add artificial noise to them. Then, we will feed autoencoder with noisy images as input, and noise-free images as output.
Let's see how this works for MNIST:
100%|██████████| 100/100 [22:29<00:00, 13.49s/it, train loss:=0.134, test loss:=0.133]
Exercise: See how denoiser trained on MNIST digits works for different images. As an example, you can take Fashion MNIST dataset, which has the same image size. Note that denoiser works well only on the same image type that it was trained on (i.e. for the same probability distribution of input data).
Super-Resolution
Similarly to denoiser, we can train autoencoders to increase the resolution of the image. To train super-resolution network, we will start with high-resolution images, and automatically downscale them to produce network inputs. We will then feed autoencoder with small images as inputs and high-resolution images as outputs.
For that let's downscale image to 14x14 at train.
100%|██████████| 30/30 [06:43<00:00, 13.47s/it, train loss:=0.102, test loss:=0.103]
Exercise: Try to train super-resolution network on CIFAR-10 for 2x and 4x upscaling. Use noise as input to 4x upscaling model and observe the result.
Variational Auto-Encoders (VAE)
Traditional autoencoders reduce the dimension of the input data somehow, figuring out the important features of input images. However, latent vectors often do not make much sense. In other words, taking MNIST dataset as an example, figuring out which digits correspond to different latent vectors is not an easy task, because close latent vectors would not necessarily correspond to the same digits.
On the other hand, to train generative models it is better to have some understanding of the latent space. This idea leads us to variational auto-encoder (VAE).
VAE is the autoencoder that learns to predict statistical distribution of the latent parameters, so-called latent distribution. For example, we can assume that latent vectors would be distributed as , where . Encoder in VAE learns to predict those parameters, and then decoder takes a random vector from this distribution to reconstruct the object.
To summarize:
- From input vector, we predict
z_meanandz_log(instead of predicting the standard deviation itself, we predict it's logarithm) - We sample a vector
sample(z_val in code)from the distribution - Decoder tries to decode the original image using
sampleas an input vector
Image from this blog post by Isaak Dykeman
Variational auto-encoders use complex loss function that consists of two parts:
- Reconstruction loss is the loss function that shows how close reconstructed image is to the target (can be MSE). It is the same loss function as in normal autoencoders.
- KL loss, which ensures that latent variable distributions stays close to normal distribution. It is based on the notion of Kullback-Leibler divergence - a metric to estimate how similar two statistical distributions are.
100%|██████████| 30/30 [04:54<00:00, 9.83s/it, train loss:=35.1, test loss:=35.6]
Task: In our sample, we have trained fully-connected VAE. Now take the CNN from traditional auto-encoder above and create CNN-based VAE.
Adversarial Auto-Encoders is a combination of Generative Adversarial Networks and Variational Auto-Encoders.
Encoder will be the generator, discriminator will learn to distinguish the real images encoder output from generated ones. Encoder output is a distribution, from this output decoder will try decode image.
In this approach we have three loss functions: generator loss, discriminator loss from GAN's and reconstruction loss from VAE.
Image from this blog post by Felipe Ducau
100%|██████████| 30/30 [09:22<00:00, 18.75s/it, train reconst loss:=0.0919, train disc loss:=1.39, train enc loss=0.692, test reconst loss:=0.0945, test disc loss:=1.39, test enc loss=0.692]