![]() |
Finally I have come out of the introductory chapters forest. In mission 3.15, MacKay summons us to slay the mythical p-value. The weapons we carry are simple, but sharp and the insights look very promising.
This mission will lead us to confront the p-value and its meaning when we are comparing two different models. In this post we are going to carefully go through the exercise solution and insights. First we are going to recalculate the statistician conclusion and then move on to MacKay's approach of comparing two models by how likely is the data given each of them. In a way, this post is also intended as containing part of the essence of chapters 1 and 3.
Before reading on I suggest you work on the exercise yourself (for 15 minutes, even if you can't solve it):
![]() |
‘If the coin were unbiased the chance of getting a result as extreme as that would be less than 7%’
We interpret this sentence as meaning that 7% is the probability of a fair coin getting 140 or more heads in 250 spins, including also the symmetric result of getting 140 or more tails (110 or less heads). To first check his conclusion we need a way to calculate the probability of getting a given amount of heads in a given amount of coin spins. This is when we summon Mr. Binomial and the probability that bears his name.
The binomial distribution is the first thing MacKay introduces in his book. This is a very useful function when we have a random variable that takes a binary value (e.g. heads or tails) in one or more independent experiments. The binomial function allows us to calculate the probability of getting an Fh number of heads in F spins, where each spin has probability ph.
The intuition behind this distribution is that we are counting (F choose Fh) all the ways in which Fh heads can be distributed in F spins and multiplying that by the probability of getting one of these specific outcomes: All outcomes with Fh heads in F spins have the same probability because the experiments are independent.
If it is still confusing, think of 4 fair coin spins and getting 2 heads. We have tthh, thth, thht, htth, htht, hhtt and all of them have the same probability 0.54 = 0.0625 of happening. We have 6 of them (4 choose 2) so we multiply 0.0625 by 6 and get 0.375 (6/16), which is the probability of getting 2 heads in 4 spins.
If we want the probability of getting Fh or more heads we need to sum all the probabilities from Fh to F. Since we have F = 250, Fh = 140, and ph = (1 - ph) = 0.5 for a fair coin, the probability of getting 140 or more heads is given by:
By symmetry (or calculation if you are a true skeptic) the probability of getting less than 110 heads (or 140 tails) is the same. Therefore, the total probability of this "extreme result" is 0.0664: Less than 7% as said the statistician.
![]() | |
| Binomial distribution for 250 spins of a fair coin. Bars at 110 and 140. |
If we move the bars to values of 141 (and 109), we get a probability of 0.0497. At this value, the statistician would be a happy man and reject the null hypothesis (that the coin is fair) with a significance level of 5%.
If we assume a model of a fair coin, it would only produce this data (140 heads or 140 tails) 7% of the time. But notice that we start by assuming a model, and here is precisely MacKay's point: Why not try to calculate what model the data favors instead of coming up with a model a priori. After all, what we want to know here is:
Given the data, what model is more likely (fair or biased coin)?
This is a different question than:
Given a fair coin, what is the probability it would produce the data?
Think about this for a moment as this is a very important distinction.
Now let's move to the Bayesian point-of-view of comparing the likelihood of both models given the data. We have two hypothesis, the data and the previous variable for the data outcome.

Please note that the data D is the single outcome of getting exactly 140 heads in 250 spins and it does not represent the idea of 140 or more heads.
With Bayes' theorem we can calculate the probability of a hypothesis in terms of three other probabilities: The likelihood of the data given the hypothesis, the prior probability of the hypothesis itself and the overall probability of the data:
This may look like a lot of information to input in order to compare the models. But MacKay's point, in the words of the great man, is exactly that:
You can't do inference without making assumptions**
We assume a uniform prior over the two hypothesis and give 1/2 each to represent the fact that we have no information about what hypothesis is more likely (biased or fair) and that, together, these two hypotheses cover all the space of possible hypotheses (either the coin is biased or fair).
The probability of the data P(D) is the same for both models, so, if we want to compare the models, we compute the ratio:
This ratio represents how much the data favours hypothesis 1 that the coin is biased. Beware that the ratio is not necessarily a value between 0 and 1. The data can favor one hypothesis over the other by orders of magnitude as we will see below.
There are two probabilities to be computed relative to each hypothesis. We are going to start with the first one, which requires a bit more maths. The first thing we can do is notice that we can compute the conditional probability using the the sum rule of probability theory. Since ph is a Real value our sum becomes an integration and is given by:
The first probability inside the integral is the probability of getting the data given a coin with probability ph. We have done this before above and this probability is simply
We then assume a uniform probability over the bias of the coin
This gives us
If we introduce the Beta integral
We can reformulate our probability as
The Gamma function is
Finally, we can have our probability in terms of F, Fh and Ft and we call it the likelihood of the hypothesis given the data
Next we move for the likelihood of hypothesis 0 (fair coin). With a fair coin we have ph = 0.5 and the likelihood is then
We can then compute the likelihood ratio as
Interestingly we arrive at 0.4767, a number that tell us that the data gives approximately 2:1 evidence in favour of the coin being fair!
One could argue that we had to set subjective values for the prior of the hypotheses and for the prior over the probability ph of the biased coin. Regarding the first, we explain above the rationale behind it. For the prior over ph, however, one could say that it shouldn't be uniform because we expect the coin to be biased (or we have a bias over its bias).
What is interesting though is that we can actually calculate the likelihood ratio for the best possible prior. That is, the prior that gives us the maximum likelihood ratio. Intuitively it would be the prior that sets the maximum probability for the outcome we have and the lowest to all the other ones. In other words, the prior that sets probability 1 to ph = 140/250 and 0 to all other values. That is a very unreasonable prior, but it helps making a point.
Using the best possible prior makes the likelihood for hypothesis 1 be simply
Therefore our likelihood ratio becomes
With the best possible prior, the likelihood ratio is at most 6:1.
Thinking about the mythical p-value of 5% (or 20:1 against the null hypothesis), we can then ask a final question:
What number of heads would give us 20:1 on the coin being biased?
The answer is Fh = 145, which gives us approximately 25:1. But please note that this answer considers the (very unreasonable) prior distribution above. For a uniform prior, Fh = 145 would only improve the 0.48:1 ratio to 0.86:1. Furthermore, with a uniform prior it would require a number of heads Fh = 166 to get 22:1, above the mythical 20:1. I leave as an exercise for you to calculate the results on this paragraph.
Beware of p-values!
* Image: http://dnd.wizards.com/products/tabletop-games/rpg-products/dungeon-masters-screen
** p. 51
** p. 51

















Comments
Post a Comment