神经离散表示学习

介绍

本文主要介绍离散表示学习，尤其是图像向量量化中的常见方法、工具和应用，主要是工具记录备忘。

VQ-VAE中的常用Loss

VQ loss/ VQ Objective

Due to the straight-through gradient estimation of mapping from ze(x) to zq(x), the embeddings ei receive no gradients from the reconstruction loss log p(z|zq(x)). Therefore, in order to learn the embedding space, we use one of the simplest dictionary learning algorithms, Vector Quantisation (VQ). The VQ objective uses the l2 error to move the embedding vectors ei towards the encoder outputs ze(x) as shown in the second term of equation

reconstruction loss

optimizes the decoder and the encoder (through the estimator explained above)

commitment loss

since the volume of the embedding space is dimensionless, it can grow arbitrarily if the embeddings ei do not train as fast as the encoder parameters. To make sure the encoder commits to an embedding and its output does not grow,

GAN loss

perceptually important local structure to alleviate the need for modeling low-level statistics with the transformer architecture

perceptual loss

perceptually important local structure to alleviate the need for modeling low-level statistics with the transformer architecture

entropy loss

$$
\mathbb{E}[H(q(z))] - H(\mathbb{E}[q(z)])
$$
an entropy penalty during training to encourage codebook utilization

工具介绍

vector-quantize-pytorch

基于Pytorch实现了VQ-VAE及其常见的变种、改进技巧。例如：

Residual VQ
SoundStream Initialization
Lower codebook dimension
Cosine similarity
Expiring stale codes
Orthogonal regularization loss
Multi-headed VQ
Random Projection Quantizer
Finite Scalar Quantization
Lookup Free Quantization：MaskGIT 基于这种向量化方法进行图像token化。MaskGIT的向量化的Pytorch实现：Phenaki - Pytorch
Latent Quantization

STNN

时空Transformer的一种实现。向量化编码后的图像可以送入Transformer进行计算，时空Transformer是一种考虑时间和空间依赖关系的Transformer结构。不同于直接在3D空间中建模，时空Transformer通过在时间和空间维度上交错Attention来捕获时空关系。