Kullback-Leibler Divergence and Cross-entropy Loss

Share on FacebookShare on Google+Tweet about this on TwitterShare on LinkedIn

Science is all about data and theories explaining those data. The theory behind coin tossing says that the probability of tail and head is the same, and that given N tosses we expect k heads with a probability binomial(trials=N, successes=k).

In Bayesian statistics, probability has a nice meaning –it means how strongly you believe a certain outcome will happen. If this probability is normalized (p_i > 0 and \sum_i p_i = 1) we can also define a function which measures our surprise to when a measurement is "i": Surprise(i) = -log(p_i). The more we thought the event was rare, the more we are surprised by the outcome.

The entropy can be defined as the average surprise:

<surprise> = \sum_i q_i \cdot (-log(p_i))

where q_i is how often the measurements actually gave i, and p_i how strongly we believed (prior to running the experiment) that the measurement could be i. It's easy to prove the Gibb's inequality:

\sum_i q_i \cdot (-log(p_i)) >\sum_i p_i \cdot (-log(p_i)) if there is at least one p_i \neq q_i

which says: your average surprise is smaller if your model –your prior distribution– is closer to the "reality". If you are using a biased coin, your average surprise to the results is bigger than the surprise from someone who knew the coin was biased and used the appropriate distribution. The opposite is true of course –someone thought the coin was bias but it was not.

The Kullback-Leibler Divergence measures exactly that difference:

D_{KL} =\sum_i q_i \cdot (-log(p_i)) - p_i \cdot (-log(p_i)) =

D_{KL} =\sum_i q_i \cdot log(q_i/p_i)

The D_{KL} can be big for two reasons: we are using the wrong distribution (e.g. Poisson instead of binomial) or the "right" distribution but with wrong parameters (e.g. the binomial but with the wrong probability of success).

Given that D_{KL}  measures how "more" our measurement surprise us, it's a good way of measuring how much we are wrong with our model (usually we think we have the right distribution and want to find the best values of the parameters).

Usually, we simply take the cross-entropy  \sum_i q_i \cdot (-log(p_i)), i.e. our actual average surprise given the frequency p_i of our data and the expected frequency q_i given our model:

H(q, p) =\sum_i q_i \cdot (-log(p_i))

Below some books on the subject, plus the autobiography of Edward Thorp –a must if you are interested in entropy and information!

Share on FacebookShare on Google+Tweet about this on TwitterShare on LinkedIn

Leave a Reply

Your email address will not be published. Required fields are marked *