Skip to main content

Slaying The Mythical p




Finally I have come out of the introductory chapters forest. In mission 3.15, MacKay summons us to slay the mythical p-value. The weapons we carry are simple, but sharp and the insights look very promising.

This mission will lead us to confront the p-value and its meaning when we are comparing two different models. In this post we are going to carefully go through the exercise solution and insights. First we are going to recalculate the statistician conclusion and then move on to MacKay's approach of comparing two models by how likely is the data given each of them. In a way, this post is also intended as containing part of the essence of chapters 1 and 3.

Before reading on I suggest you work on the exercise yourself (for 15 minutes, even if you can't solve it):

‘If the coin were unbiased the chance of getting a result as extreme as that would be less than 7%’

We interpret this sentence as meaning that 7% is the probability of a fair coin getting 140 or more heads in 250 spins, including also the symmetric result of getting 140 or more tails (110 or less heads). To first check his conclusion we need a way to calculate the probability of getting a given amount of heads in a given amount of coin spins. This is when we summon Mr. Binomial and the probability that bears his name.

The binomial distribution is the first thing MacKay introduces in his book. This is a very useful function when we have a random variable that takes a binary value (e.g. heads or tails) in one or more independent experiments. The binomial function allows us to calculate the probability of getting an Fh number of heads in F spins, where each spin has probability ph.


The intuition behind this distribution is that we are counting (F choose Fh) all the ways in which Fh heads can be distributed in F spins and multiplying that by the probability of getting one of these specific outcomes: All outcomes with Fh heads in F spins have the same probability because the experiments are independent.

If it is still confusing, think of 4 fair coin spins and getting 2 heads. We have tthh, thth, thht, htth, htht, hhtt and all of them have the same probability 0.54 = 0.0625 of happening. We have 6 of them (4 choose 2) so we multiply 0.0625 by 6 and get 0.375 (6/16), which is the probability of getting 2 heads in 4 spins.

If we want the probability of getting Fh or more heads we need to sum all the probabilities from Fh to F. Since we have F = 250, Fh = 140, and ph = (1 - ph) = 0.5 for a fair coin, the probability of getting 140 or more heads is given by:



By symmetry (or calculation if you are a true skeptic) the probability of getting less than 110 heads (or 140 tails) is the same. Therefore, the total probability of this "extreme result" is 0.0664: Less than 7% as said the statistician.


Binomial distribution for 250 spins of a fair coin. Bars at 110 and 140.

If we move the bars to values of 141 (and 109), we get a probability of 0.0497. At this value, the statistician would be a happy man and reject the null hypothesis (that the coin is fair) with a significance level of 5%.

If we assume a model of a fair coin, it would only produce this data (140 heads or 140 tails) 7% of the time. But notice that we start by assuming a model, and here is precisely MacKay's point: Why not try to calculate what model the data favors instead of coming up with a model a priori. After all, what we want to know here is:

Given the data, what model is more likely (fair or biased coin)?

This is a different question than:

Given a fair coin, what is the probability it would produce the data?

Think about this for a moment as this is a very important distinction.


Now let's move to the Bayesian point-of-view of comparing the likelihood of both models given the data. We have two hypothesis, the data and the previous variable for the data outcome.




Please note that the data D is the single outcome of getting exactly 140 heads in 250 spins and it does not represent the idea of 140 or more heads.

With Bayes' theorem we can calculate the probability of a hypothesis in terms of three other probabilities: The likelihood of the data given the hypothesis, the prior probability of the hypothesis itself and the overall probability of the data:



This may look like a lot of information to input in order to compare the models. But MacKay's point, in the words of the great man, is exactly that:

You can't do inference without making assumptions**

We assume a uniform prior over the two hypothesis and give 1/2 each to represent the fact that we have no information about what hypothesis is more likely (biased or fair) and that, together, these two hypotheses cover all the space of possible hypotheses (either the coin is biased or fair).


The probability of the data P(D) is the same for both models, so, if we want to compare the models, we compute the ratio:


This ratio represents how much the data favours hypothesis 1 that the coin is biased. Beware that the ratio is not necessarily a value between 0 and 1. The data can favor one hypothesis over the other by orders of magnitude as we will see below.

There are two probabilities to be computed relative to each hypothesis. We are going to start with the first one, which requires a bit more maths. The first thing we can do is notice that we can compute the conditional probability using the the sum rule of probability theory. Since ph is a Real value our sum becomes an integration and is given by:


The first probability inside the integral is the probability of getting the data given a coin with probability ph. We have done this before above and this probability is simply


We then assume a uniform probability over the bias of the coin


This gives us


If we introduce the Beta integral


We can reformulate our probability as


The Gamma function is 


Finally, we can have our probability in terms of F,  Fh and Ft and we call it the likelihood of the hypothesis given the data



Next we move for the likelihood of hypothesis 0 (fair coin). With a fair coin we have ph = 0.5 and the likelihood is then


We can then compute the likelihood ratio as



Interestingly we arrive at 0.4767, a number that tell us that the data gives approximately 2:1 evidence in favour of the coin being fair!

One could argue that we had to set subjective values for the prior of the hypotheses and for the prior over the probability ph of the biased coin. Regarding the first, we explain above the rationale behind it. For the prior over ph, however, one could say that it shouldn't be uniform because we expect the coin to be biased (or we have a bias over its bias).

What is interesting though is that we can actually calculate the likelihood ratio for the best possible prior. That is, the prior that gives us the maximum likelihood ratio. Intuitively it would be the prior that sets the maximum probability for the outcome we have and the lowest to all the other ones. In other words, the prior that sets probability 1 to ph = 140/250 and 0 to all other values. That is a very unreasonable prior, but it helps making a point.

Using the best possible prior makes the likelihood for hypothesis 1 be simply

 

Therefore our likelihood ratio becomes



With the best possible prior, the likelihood ratio is at most 6:1.

Thinking about the mythical p-value of 5% (or 20:1 against the null hypothesis), we can then ask a final question:

What number of heads would give us 20:1 on the coin being biased?


The answer is Fh = 145, which gives us approximately 25:1. But please note that this answer considers the (very unreasonable) prior distribution above. For a uniform prior, Fh = 145 would only improve the 0.48:1 ratio to 0.86:1. Furthermore, with a uniform prior it would require a number of heads Fh = 166 to get 22:1, above the mythical 20:1. I leave as an exercise for you to calculate the results on this paragraph.


Beware of p-values!


* Image: http://dnd.wizards.com/products/tabletop-games/rpg-products/dungeon-masters-screen
** p. 51

Comments

Popular posts from this blog

Jester in the Dark: Encoding Knowledge into Inferences - Part I

You are dining at the castle hall and the king announces he is going to give the jester's his yearly payment. He points at a small adjacent room, where there are 3 small equal chests. The first with 2 gold coins; the second, 1 gold and 1 silver coin; the third, 2 silver coins. The king's serfs go inside, shuffle the chests, lock them, put out the candles, and summon the court jester to the hall. The buffoon is told to put his hand in a bag, grab 1 chest key, walk into the dark room and get 1 coin out of the matching chest. He brings the coin back and it is a gold one. The king takes the coin from his hand and says aloud: - If I send this idiot back there to get the other coin out of the same chest, how likely it is to be gold? The jester interrupts: - Please do send me, sire! I'd likely be a happy idiot, with one more gold coin. The king angrily points out to the jester's face : - Folly! I'll have you beheaded for stupidity. This gold coin could ha...

First Week, First Chapter

I just finished my first week and want to reflect a bit on why I started this. There are two main reasons: Deepen and broaden my knowledge of Information Theory; and motivate people who might hesitate on taking up the challenge or feel too old to start learning or revisiting a field. The book is very exciting; Mackay's writing is clear and flows nicely, introducing a concept, diving into it, and then out to generalise. His approach is to make you feel as if you are constructing and checking the concepts alongside him. His magic is that the presentation feels both casual and principled, like if you were sitting together with a very smart friend who is explaining you things. I ended this week at page 21, so my current rate is 3 pages per day since I will not be working the weekend. At the current rate I would finish the book in about 6 months. Mackay's prediction for the time required for each exercise worked quite well for me although it took me more time tha...