Kullback-Leibler (KL) Divergence

Kullback-Leibler (KL) divergence, also known as relative entropy, is a measure of how one probability distribution diverges from a second expected probability distribution. Originating from information theory, KL divergence quantifies the difference between two probability distributions in terms of the number of extra bits required to code samples from one distribution using a code optimized for another distribution.

How is KL Divergence Defined?

Given two probability distributions $P$ and $Q$ over the same discrete random variable, the KL divergence of $Q$ from $P$ is defined as:
$$
\begin{equation}
D_{KL}(P||Q) = \sum_{i} P(i) \log \left( \frac{P(i)}{Q(i)} \right)
\end{equation}
$$

For continuous distributions, the sum is replaced by an integral:

$$ \begin{equation}
D_{KL}(P||Q) = \int P(x) \log \left( \frac{P(x)}{Q(x)} \right) dx
\end{equation}
$$ 

Properties of KL Divergence

1. Non-Negativity: $D_{KL}(P||Q) \ge 0$. This derives from Gibbs’ inequality and is a cornerstone in information theory. The KL divergence equals zero if and only if $P$ and $Q$ are the same distribution in the case of discrete random variables, or equal “almost everywhere” for continuous random variables.
2. Asymmetry: An essential property to be aware of is that KL divergence is not symmetric. Meaning, $D_{KL}(P||Q) \ne D_{KL}(Q||P)$. This asymmetry indicates that KL divergence is not a true metric in the mathematical sense.

Transform your ML development with DagsHub –
Try it now!

Significance of KL Divergence

KL divergence finds extensive applications across various domains:

Information Theory: It measures the inefficiency of assuming a distribution Q when the true distribution is P.
Machine Learning: In unsupervised learning, especially in models like Variational Autoencoders (VAEs), the KL divergence measures the difference between the learned representation and the actual data distribution.
Natural Language Processing: For tasks like topic modeling, KL divergence provides insights into the similarity between different document distributions.

Interpretation

A smaller KL divergence indicates that the distributions P and Q are closer to each other. Conversely, a larger KL divergence signals that the distributions are different. However, due to its asymmetry, the divergence of Q from P might give different insights than the divergence of P from Q.

Limitations

1. Undefined Values: KL divergence is not always defined for pairs of distributions. If there exists an $i$ for which $P(i) > 0$ and $Q(i) = 0$, then $D_{KL}(P||Q)$ is infinity.
2. Asymmetry: Since $D_{KL}(P||Q) \ne D_{KL}(Q||P)$, it’s essential to choose the order meaningfully based on the application.

Dagshub Glossary

Kullback-Leibler (KL) Divergence

How is KL Divergence Defined?

Properties of KL Divergence

Transform your ML development with DagsHub – Try it now!

Significance of KL Divergence

Interpretation

Limitations

Transform your ML development with DagsHub – Try it now!

Related terms

Transform your ML development with DagsHub –
Try it now!

Transform your ML development with DagsHub –
Try it now!