Probabilistic Machine Learning

Interpretation of MLE in terms of KL divergence

In terms of parametric function approximation, MLE minimizes the KL divergence

Kang Gyeonghun
Suppose that the true density of a random variable $x$ is $p(x)$. Since this is unknown, we can try to come up with an approximation $q(x)$. Then KL divergences is a good measure of mismatch between $p$ and $q$ distribution. $$ \begin{align*} \text{KL divergence:}\quad KL(p||q) = \int p(x)\log \dfrac{p(x)}{q(x)}dx \end{align*} $$ From the formula we can see that KL divergence is a weighted average, with wighted $p(x)$, of an error induced by approximation ($\log p(x) - \log q(x)$).

Note on Kullback-Leibler Divergence

How much of a loss (i.e. additional amount of coding) to bear when you approximate

Kang Gyeonghun
How do we quantify an amount of information that some data $x$ contains? If the data is pretty much expected than it tells nothing new to us. But if it is so rare then it has some value. In this sense, we can think of an amount of information as a “degree of surprise”, and define $$ \text{information content of data $x$:}\quad h(x) = -\log p(x) $$ where the logarithm ensures $h(x,y)=h(x)+h(y) \Leftrightarrow p(x,y)=p(x)p(y)$, and the negative sign makes $h(x)\geq 0$.
EM Algorithm for Latent Variable Models

EM Algorithm for Latent Variable Models

At least you get a pretty tight lower bound

Kang Gyeonghun
For an observed data $\mathbf{x}$, we might posit the existence of an unobserved data $\mathbf{z}$ and include it in model $p(\mathbf{x,z}\mid \theta)$. This is called a latent variable model. The question is, why bother? It turns out that in many cases, learning $\theta$ with the marginal log likelihood $p(\mathbf{x}\mid \theta)$ is hard, whereas learning with the joint likelihood with a complete data set $p(\mathbf{x,z}\mid \theta)$ is relatively easy. GMM is one such case.
Mixtures of Gaussians and EM algorithm

Mixtures of Gaussians and EM algorithm

When the typical Maximum-Likelihood approach leads you astray

Kang Gyeonghun
Mixtures of Gaussians (GMM) GMM as a joint distribution Suppose a random vector $\mathbf{x}$ follows a $K$ Gaussian mixture distribution, $$ p(\mathbf{x}) = \sum_{k=1}^K \pi_k N(\mathbf{x}\mid \boldsymbol{\mu_k, \Sigma_k}) $$ Knowing the distribution means we have complete information about the set of parameters $\pi_k, \boldsymbol{\mu_k, \Sigma_k}$ for all $k$. Let us say that the parameter $\pi_k$ is shrouded, and instead we have a random variable $\mathbf{z}$ with $1-to-K$ coding where exactly one of $K$ elements (say $z_k$) be $1$ while all else are $0$.