Probabilistic Machine Learning

Variational Inference and Bayesian Gaussian Mixture Model

When your computer cannot handle the burden of MCMC, you might as well allow for some bias and do some heavy math yourself

Kang Gyeonghun

2020-08-25

Probabilistic Machine Learning

Suppose that the true density of a random variable $x$ is $p(x)$. Since this is unknown, we can try to come up with an approximation $q(x)$. Then KL divergences is a good measure of mismatch between $p$ and $q$ distribution. $$ \begin{align*} \text{KL divergence:}\quad KL(p||q) = \int p(x)\log \dfrac{p(x)}{q(x)}dx \end{align*} $$ From the formula we can see that KL divergence is a weighted average, with wighted $p(x)$, of an error induced by approximation ($\log p(x) - \log q(x)$).

How do we quantify an amount of information that some data $x$ contains? If the data is pretty much expected than it tells nothing new to us. But if it is so rare then it has some value. In this sense, we can think of an amount of information as a “degree of surprise”, and define $$ \text{information content of data $x$:}\quad h(x) = -\log p(x) $$ where the logarithm ensures $h(x,y)=h(x)+h(y) \Leftrightarrow p(x,y)=p(x)p(y)$, and the negative sign makes $h(x)\geq 0$.

Bayesian Networks (Directed Acyclical Graphs)

For an observed data $\mathbf{x}$, we might posit the existence of an unobserved data $\mathbf{z}$ and include it in model $p(\mathbf{x,z}\mid \theta)$. This is called a latent variable model. The question is, why bother? It turns out that in many cases, learning $\theta$ with the marginal log likelihood $p(\mathbf{x}\mid \theta)$ is hard, whereas learning with the joint likelihood with a complete data set $p(\mathbf{x,z}\mid \theta)$ is relatively easy. GMM is one such case.

Mixtures of Gaussians (GMM) GMM as a joint distribution Suppose a random vector $\mathbf{x}$ follows a $K$ Gaussian mixture distribution, $$ p(\mathbf{x}) = \sum_{k=1}^K \pi_k N(\mathbf{x}\mid \boldsymbol{\mu_k, \Sigma_k}) $$ Knowing the distribution means we have complete information about the set of parameters $\pi_k, \boldsymbol{\mu_k, \Sigma_k}$ for all $k$. Let us say that the parameter $\pi_k$ is shrouded, and instead we have a random variable $\mathbf{z}$ with $1-to-K$ coding where exactly one of $K$ elements (say $z_k$) be $1$ while all else are $0$.

Probabilistic Machine Learning

Variational Inference and Bayesian Gaussian Mixture Model

Forward and Reverse KL divergence

Interpretation of MLE in terms of KL divergence

Note on Kullback-Leibler Divergence

Bayesian Networks (Directed Acyclical Graphs)

EM Algorithm for Latent Variable Models

Mixtures of Gaussians and EM algorithm