# Kullback-Leibler Divergence and Cross-entropy Loss

Science is all about data and theories explaining those data. The theory behind coin tossing says that the probability of tail and head is the same, and that given N tosses we expect k heads with a probability $binomial(trials=N, successes=k)$.

In Bayesian statistics, probability has a nice meaning –it means how strongly you believe a certain outcome will happen. If this probability is normalized ($p_i > 0$ and $\sum_i p_i = 1$) we can also define a function which measures our surprise to when a measurement is "$i$": $Surprise(i) = -log(p_i)$. The more we thought the event was rare, the more we are surprised by the outcome.

The entropy can be defined as the average surprise:

$ = \sum_i q_i \cdot (-log(p_i))$

where $q_i$ is how often the measurements actually gave $i$, and $p_i$ how strongly we believed (prior to running the experiment) that the measurement could be $i$. It's easy to prove the Gibb's inequality:

$\sum_i q_i \cdot (-log(p_i)) >\sum_i p_i \cdot (-log(p_i))$ if there is at least one $p_i \neq q_i$

which says: your average surprise is smaller if your model –your prior distribution– is closer to the "reality". If you are using a biased coin, your average surprise to the results is bigger than the surprise from someone who knew the coin was biased and used the appropriate distribution. The opposite is true of course –someone thought the coin was bias but it was not.

The Kullback-Leibler Divergence measures exactly that difference:

$D_{KL} =\sum_i q_i \cdot (-log(p_i)) - p_i \cdot (-log(p_i)) =$

$D_{KL} =\sum_i q_i \cdot log(q_i/p_i)$

The $D_{KL}$ can be big for two reasons: we are using the wrong distribution (e.g. Poisson instead of binomial) or the "right" distribution but with wrong parameters (e.g. the binomial but with the wrong probability of success).

Given that $D_{KL}$  measures how "more" our measurement surprise us, it's a good way of measuring how much we are wrong with our model (usually we think we have the right distribution and want to find the best values of the parameters).

Usually, we simply take the cross-entropy  $\sum_i q_i \cdot (-log(p_i))$, i.e. our actual average surprise given the frequency $p_i$ of our data and the expected frequency $q_i$ given our model:

$H(q, p) =\sum_i q_i \cdot (-log(p_i))$

Below some books on the subject, plus the autobiography of Edward Thorp –a must if you are interested in entropy and information!