Command Palette
Search for a command to run...
Autoencoders and Variational Autoencoders
Abstract
One-sentence Summary
This paper presents a step-by-step derivation of the closed-form Kullback-Leibler divergence between Gaussian distributions, extending the univariate formulation to the multivariate diagonal covariance case to interpret each term and clarify its impact on Variational Autoencoder training dynamics and latent space regularization.
Key Contributions
- Provides a rigorous step-by-step derivation of the closed-form Kullback-Leibler divergence for Gaussian distributions, progressing from the general definition for continuous variables to the univariate case and finally to multivariate distributions under a diagonal covariance assumption.
- Explicitly decomposes the resulting divergence formula into its constituent mathematical terms to clarify the regularization mechanics within Variational Autoencoder architectures.
- Analyzes the functional role of each term to demonstrate how specific components constrain the latent space and directly influence optimization dynamics during model training.
Introduction
Probabilistic modeling underpins modern generative AI, with Variational Autoencoders (VAEs) relying on the Kullback-Leibler (KL) divergence to regularize learned representations toward a standard normal prior. Despite its critical role in shaping latent spaces during training, prior literature typically introduces the KL divergence abstractly and presents its closed-form Gaussian expression without derivation, leaving practitioners without a clear mechanistic understanding of how it influences optimization. The authors address this gap by providing a rigorous step-by-step derivation of the KL divergence for Gaussian distributions, progressing from the univariate case to multivariate settings with diagonal covariance. They then translate this mathematical foundation into practical insights, demonstrating how the regularization term directly governs VAE training dynamics and representation quality.
Dataset
- Dataset composition and sources: The authors do not describe any dataset composition or external data sources in the provided excerpt, which only includes the paper title and institutional affiliations.
- Key details for each subset: No subset breakdowns, sample sizes, or filtering criteria are mentioned.
- How the paper uses the data: The excerpt provides no information regarding training splits, mixture ratios, or model training procedures.
- Processing details: The authors do not outline any cropping strategies, metadata construction, or additional data processing steps.
Method
The authors leverage the Kullback-Leibler (KL) divergence as a central regularization mechanism in Variational Autoencoders (VAEs), where it quantifies the discrepancy between the approximate posterior distribution over latent variables and a predefined prior. This divergence is computed between two multivariate Gaussian distributions: the approximate posterior q(z∣x), parameterized by a mean μ(x) and covariance Σ(x), and the standard normal prior p(z)=N(0,Ik). The KL divergence serves as a critical component of the VAE objective function, encouraging the learned latent space to conform to the prior while enabling generative modeling.
[[IMG:]] As shown in the figure below, the KL divergence derivation begins with the general definition for continuous distributions, expressed as an integral of the log-ratio of probability densities. For Gaussian distributions, the density functions are substituted into this definition, and the logarithm of each density is expanded into terms involving the mean, covariance, and dimensionality. After simplifying the expression by leveraging properties of the logarithm and integrating over the domain, the KL divergence decomposes into three distinct components. The first term captures the ratio of the determinants of the covariance matrices, the second involves the trace of the product of the inverse prior covariance and the posterior covariance, and the third accounts for the Mahalanobis distance between the means, scaled by the inverse prior covariance.
[[IMG:]] The computation of these terms proceeds by exploiting the linearity of expectation and trace operators. The first term simplifies directly due to the normalization of the probability distribution. The second term is expanded using the identity x−μ2=(x−μ1)+(μ1−μ2), and the expectation is evaluated term by term. The cross-term vanishes due to the zero mean of the centered variable under the posterior, while the remaining terms yield the trace of Σ2−1Σ1 and the squared Mahalanobis distance between the means. The third term, representing the expected value of the quadratic form under the posterior, reduces to 21k due to the trace of the identity matrix. The final expression for the KL divergence is obtained by combining these components, resulting in a closed-form expression that depends on the trace of the covariance product, the Mahalanobis distance, the log-determinant ratio, and the dimensionality.
In the context of VAEs, the prior is typically the standard normal distribution, which simplifies the general expression significantly. The resulting formula becomes 21(tr(Σ(x))+μ(x)⊤μ(x)−k−log∣Σ(x)∣). This closed-form solution allows for efficient evaluation and differentiable computation during training, making it amenable to gradient-based optimization. The use of this expression ensures that the latent space remains well-structured and facilitates smooth sampling and generation of new data points.