Science is all about data and theories explaining those data. The theory behind coin tossing says that the probability of tail and head is the same, and that given N tosses we expect k heads with a probability .
In Bayesian statistics, probability has a nice meaning –it means how strongly you believe a certain outcome will happen. If this probability is normalized ( and ) we can also define a function which measures our surprise to when a measurement is "": . The more we thought the event was rare, the more we are surprised by the outcome.
The entropy can be defined as the average surprise:
where is how often the measurements actually gave , and how strongly we believed (prior to running the experiment) that the measurement could be . It's easy to prove the Gibb's inequality:
if there is at least one
which says: your average surprise is smaller if your model –your prior distribution– is closer to the "reality". If you are using a biased coin, your average surprise to the results is bigger than the surprise from someone who knew the coin was biased and used the appropriate distribution. The opposite is true of course –someone thought the coin was bias but it was not.
The Kullback-Leibler Divergence measures exactly that difference:
The can be big for two reasons: we are using the wrong distribution (e.g. Poisson instead of binomial) or the "right" distribution but with wrong parameters (e.g. the binomial but with the wrong probability of success).
Given that measures how "more" our measurement surprise us, it's a good way of measuring how much we are wrong with our model (usually we think we have the right distribution and want to find the best values of the parameters).
Usually, we simply take the cross-entropy , i.e. our actual average surprise given the frequency of our data and the expected frequency given our model: