You are currently browsing the category archive for the ‘IB HL stats and probability’ category.

**Simulating a Football Season**

This is a nice example of how statistics are used in modeling – similar techniques are used when gambling companies are creating odds or when computer game designers are making football manager games. We start with some statistics. The soccer stats site has the data we need from the 2018-19 season, and we will use this to predict the outcome of the 2019-20 season (assuming teams stay at a similar level, and that no-one was relegated in 2018-19).

**Attack and defense strength**

For each team we need to calculate:

- Home attack strength
- Away attack strength
- Home defense strength
- Away defense strength.

For example for Liverpool (LFC)

LFC Home attack strength = (LFC home goals in 2018-19 season)/(average home goals in 2018-19 season)

LFC Away attack strength = (LFC away goals in 2018-19 season)/(average away goals in 2018-19 season)

LFC Home defense strength = (LFC home goals conceded in 2018-19 season)/(average home goals conceded in 2018-19 season)

LFC Away defense strength = (LFC away goals conceded in 2018-19 season)/(average away goals conceded in 2018-19 season)

**Calculating lamda**

We can then use a Poisson model to work out some probabilities. First though we need to find our lamda value. To make life easier we can also use the fact that the lamda value for a Poisson gives the mean value – and use this to give an approximate answer.

So, for example if Liverpool are playing at home to Arsenal we work out Liverpool’s lamda value as:

LFC home lamda = league average home goals per game x LFC home attack strength x Arsenal away defense strength.

We would work out Arsenal’s away lamda as:

Arsenal away lamda = league average away goals per game x Arsenal away attack strength x Liverpool home defense strength.

Putting in some values gives a home lamda for Liverpool as 3.38 and an away lamda for Arsenal as 0.69. So we would expect Liverpool to win 3-1 (rounding to the nearest integer).

**Using Excel**

I then used an Excel spreadsheet to work out the home goals in each fixture in the league season (green column represents the home teams)

and then used the same method to work out the away goals in each fixture in the league (yellow column represents the away team)

I could then round these numbers to the nearest integer and fill in the scores for each match in the table:

Then I was able to work out the point totals to produce a predicted table:

Here we had both Liverpool and Manchester City on 104 points, but with Manchester City having a better goal difference, so winning the league again.

**Using a Poisson model.**

The poisson model allows us to calculate probabilities. The mode is:

P(k goals) = (e^{-λ}λ^{k})/k!

λ is the symbol lamda which we calculated before.

So, for example with Liverpool at home to Arsenal we calculate

Liverpool’s home lamda = league average home goals per game x LFC home attack strength x Arsenal away defense strength.

**Liverpool’s home lamda = 1.57 x 1.84 x 1.17 = 3.38**

Therefore

P(Liverpool score 0 goals) = (e^{-3.38}3.38^{0})/0! = 0.034

P(Liverpool score 1 goal) = (e^{-3.38}3.38^{1})/1! = 0.12

P(Liverpool score 2 goals) = (e^{-3.38}3.38^{2})/2! = 0.19

P(Liverpool score 3 goals) = (e^{-3.38}3.38^{3})/3! = 0.22

P(Liverpool score 4 goals) = (e^{-3.38}3.38^{1})/1! = 0.19

P(Liverpool score 5 goals) = (e^{-3.38}3.38^{5})/5! = 0.13 etc.

**Arsenal’s away lamda = 1.25 x 1.30 x 0.42 = 0.68**

P(Arsenal score 0 goals) = (e^{-0.68}0.68^{0})/0! = 0.51

P(Arsenal score 1 goal) = (e^{-0.68}0.68^{1})/1! = 0.34

P(Arsenal score 2 goals) = (e^{-0.68}0.68^{2})/2! = 0.12

P(Arsenal score 3 goals) = (e^{-0.68}0.68^{3})/3! = 0.03 etc.

**Probability that Arsenal win**

Arsenal can win if:

Liverpool score 0 goals and Arsenal score 1 or more

Liverpool score 1 goal and Arsenal score 2 or more

Liverpool score 2 goals and Arsenal score 3 or more etc.

i.e the approximate probability of Arsenal winning is:

0.034 x 0.49 + 0.12 x 0.15 + 0.19 x 0.03 = 0.04.

Using the same method we could work out the probability of a draw and a Liverpool win. This is the sort of method that bookmakers will use to calculate the probabilities that ensure they make a profit when offering odds.

**Could Trump be the next President of America?**

There is a lot of statistical maths behind polling data to make it as accurate as possible – though poor sampling techniques can lead to unexpected results. For example in the UK 2015 general election even though labour were predicted to win around 37.5% of the vote, they only polled 34%. This was a huge political shock and led to a Conservative government when all the pollsters were predicting a hung parliament. In the postmortem following the fallout of this failure, YouGov concluded that their sampling methods were at fault – leading to big errors in their predictions.

**Trump versus Clinton**

The graph above from Real Clear Politics shows the current hypothetical face off between Clinton and Trump amongst American voters. Given that both are now clear favourites to win their respective party nominations, attention has started to turn to how they fare against each other.

**Normal distribution**

A great deal of statistics dealing with populations is based on the normal distribution. The normal distribution has the bell curve shape above – with the majority of the population bunched around the mean value, and with symmetrical tails at each end. For example most men in the UK will be between 5 feet 8 and 6 foot – with a symmetrical tail of men much taller and much smaller. For polling data mathematicians usually use a sample of 1000 people – this is large enough to give a good approximation to the normal distribution whilst not being too large to be prohibitively expensive to conduct.

**A Polling Example**

The following example is from the excellent introduction to this topic from the University of Arizona.

So, say we have sample 1000 people asking them a simple Yes/No/Don’t Know type question. Say for example we asked 1000 people if they would vote for Trump, Clinton or if they were undecided. In our poll 675 people say, “Yes” to Trump – so what we want to know is what is our confidence interval for how accurate this prediction is. Here is where the normal distribution comes in. We use the following equations:

We have μ representing the mean.

n = the number of people we asked which is 1000

p_{0} = our sample probability of “Yes” for Trump which is 0.675

Therefore μ = 1000 x 0.675 = 675

We can use the same values to calculate the standard deviation σ:

σ = (1000(0.675)(1-0.675))^{0.5}

σ = 14.811

We now can use the following table:

This tells us that when we have a normal distribution, we can be 90% confident that the data will be within +/- 1.645 standard deviations of the mean.

So in our hypothetical poll we are 90% confident that the real number of people who will vote for Trump will be +/- 1.645 standard deviations from our sample mean of 675

This gives us the following:

upper bound estimate = 675 + 1.645(14.811) = 699.4

lower bound estimate = 675 – 1.645(14.811) = 650.6

Therefore we can convert this back to a percent – and say that we can be 90% confident that between 65% and 70% of the population will vote for Trump. We therefore have a prediction of 67.5% with a margin of error of +or – 2.5%. You will see most polls that are published using a + – 2.5% margin of error – which means they are using a sample of 1000 people and a confidence interval of 90%.

**Real Life**

Back to the real polling data on the Clinton, Trump match-up. We can see that the current trend is a narrowing of the polls between the 2 candidates – 47.3% for Clinton and 40.8% for Trump. This data is an amalgamation of a large number of polls – so should be reasonably accurate. You can see some of the original data behind this:

This is a very detailed polling report from CNN – and as you can see above, they used a sample of 1000 adults in order to get a margin of error of around 3%. However with around 6 months to go it’s very likely these polls will shift. Could we really have President Trump? Only time will tell.

**Quantum Mechanics – Statistical Universe**

Quantum mechanics is the name for the mathematics that can describe physical systems on extremely small scales. When we deal with the macroscopic – i.e scales that we experience in our everyday physical world, then Newtonian mechanics works just fine. However on the microscopic level of particles, Newtonian mechanics no longer works – hence the need for quantum mechanics.

Quantum mechanics is both very complicated and very weird – I’m going to try and give a very simplified (though not simple!) example of how *probabilities *are at the heart of quantum mechanics. Rather than speaking with certainty about the property of an object as we can in classical mechanics, we need to take about the probability that it holds such a property.

For example, one property of particles is *spin. *We can have create a particle with the property of either *up* spin or *down* spin. We can visualise this as an arrow pointing up or down:

We can then create an apparatus (say the slit below parallel to the z axis) which measures whether the particle is in either up state or down state. If the particle is in up spin then it will return a value of +1 and if it is in down spin then it will return a value of -1.

So far so normal. But here is where things get weird. If we then rotate the slit 90 degrees clockwise so that it is parallel to the x axis, we would expect from classical mechanics to get a reading of 0. i.e the “arrow” will not fit through the slit. However that is not what happens. Instead we will still get readings of -1 or +1. However if we run the experiment a large number of times we find that the mean average reading will indeed be 0!

What has happened is that the act of measuring the particle with the slit has changed the state of the particle. Say it was previously +1, i.e in *up* spin, by measuring it with the newly rotated slit we have forced the particle into a new state of either pointing right (*right* spin) or pointing left (*left* spin). Our rotated slit will then return a value of +1 if the particle is in right spin, and will return a value of -1 if the particle in in left spin.

In this case the probability that the apparatus will return a value of +1 is 50% and the probability that the apparatus will return a value of -1 is also 50%. Therefore when we run this experiment many times we get the average value of 0. Therefore classical mechanics is achieved as an probabilistic approximation of repeated particle interactions

We can look at a slightly more complicated example – say we don’t rotate the slit 90 degrees, but instead rotate it an arbitrary number of degrees from the z axis as pictured below:

Here the slit was initially parallel to the z axis in the x,y plane (i.e y=0), and has been rotated Θ degrees. So the question is what is the probability that our previously *up* spin particle will return a value of +1 when measured through this new slit?

The equations above give the probabilities of returning a +1 spin or a -1 spin depending on the angle of orientation. So in the case of a 90 degree orientation we have both P(+1) and P(-1) = 1/2 as we stated earlier. An orientation of 45 degrees would have P(+1) = 0.85 and P(-1) = 0.15. An orientation of 10 degrees would have P(+1) = 0.99 and P(-1) = 0.01.

The statistical average meanwhile is given by the above formula. If we rotate the slit by Θ degrees from the z axis in the x,z plane, then run the experiment many times, we will get a long term average of cosΘ. As we have seen before, when Θ = 90 this means we get an average value of 0. if Θ = 45 degrees we would get an average reading of √2/2.

This gives a very small snapshot into the ideas of quantum mechanics and the crucial role that probability plays in understanding quantum states. If you found that difficult, then don’t worry you’re in good company. As Richard Feynman the legendary physicist once said, “If you think you understand quantum mechanics, you don’t understand quantum mechanics.”

**Reaction times – How fast are you?**

Go to the Human Benchmark site and test your reaction times. You have five attempts to press the mouse as soon as you see the screen turn green. You can then see how your reaction times compare with people around the world. According to the site over there have been over 15 million clicks – with a median reaction time of 251 milliseconds and a mean reaction time of 262 milliseconds.

We can see how this data looks plotted on a chart. As we can see this is quite a good approximation of a bell curve – but with a longer tail to the right (some people have much longer reaction times than we would expect from a pure normal distribution). In a true normal distribution we would have the mean and the median the same. Nevertheless this is close enough to model our data using a normal distribution.

From the data we can take the mean time as 255 milliseconds, and a standard deviation of around 35 (just by looking at the points where around 68% are within 1s.d)

So, with X ∼ N(255, 35²) we can then see how we compare with people around the world. Reaction times significantly faster than average would suggest an ability to do well in sports such as baseball or cricket where batters need to react to the ball in a fraction of a second.

I just tried this, and got an average of 272. I can work out what percentage of the population I’m faster than by doing the normal distribution calculation – which gives 31% of people slower than this. Trying it again gives an average of 261 – this time 43% of people would be slower than this.

Have a go yourselves and see how you get on!

**Medical Data Mining**

It’s worth watching the video above, where Derren Brown manages to flip 10 heads in a row. With Derren being a professional magician, you might expect some magic or sleight of hand – but no, it’s all filmed with a continuous camera, and no tricks. So, how does he achieve something which should only occur with probability (0.5)^{10} ≈ 0.001, or 1 time in every thousand? Understanding this trick is essential to understanding the dangers of accepting data presented to you without being aware of how it was generated.

At 7 minutes in Derren reveals the trick – it’s very easy, but also a very persuasive way to convince people something unusual is happening. The trick is that Derren has spent the best part of an entire day tossing coins – and only showed the sequence in which he achieved 10 heads in a row. Suddenly with this new information the result looks much less remarkable.

Scientific tests are normally performed to a 5% confidence interval – that is, if there is a less than 5% chance of something happening by chance then we regard the data as evidence to reject the null hypothesis and to accept the alternate hypothesis. In the case of the coin toss, we would if we didn’t know better, reject the null hypothesis that this is a fair coin and conjecture that Derren is somehow affecting the results.

Selectively presenting results from trials is called* data mining *– and it’s a very powerful way to manipulate data. Unfortunately it is also a widespread technique in the pharmaceutical industry when they release data on new drugs. Trials which show a positive effect are published, those which show no effect (or negative effects) are not. This is a massive problem – and one which has huge implications for people’s health. After all, we are prescribed drugs based on scientific trials which attest to their efficiency. If this data is being mined to skew results in the drug company’s favour then we may end up taking drugs that don’t work – or even make us worse.

Dr Ben Goldacre has written extensively on this topic – and an extract from his article “The Drugs Don’t Work” is well worth a read:

**The Drugs Don’t Work**

*Reboxetine is a drug I have prescribed. Other drugs had done nothing for my patient, so we wanted to try something new. I’d read the trial data before I wrote the prescription, and found only well-designed, fair tests, with overwhelmingly positive results. Reboxetine was better than a placebo, and as good as any other antidepressant in head-to-head comparisons. It’s approved for use by the Medicines and Healthcare products Regulatory Agency (the MHRA), which governs all drugs in the UK. Millions of doses are prescribed every year, around the world. Reboxetine was clearly a safe and effective treatment. The patient and I discussed the evidence briefly, and agreed it was the right treatment to try next. I signed a prescription.*

*But we had both been misled. In October 2010, a group of researchers was finally able to bring together all the data that had ever been collected on reboxetine, both from trials that were published and from those that had never appeared in academic papers. When all this trial data was put together, it produced a shocking picture. Seven trials had been conducted comparing reboxetine against a placebo. Only one, conducted in 254 patients, had a neat, positive result, and that one was published in an academic journal, for doctors and researchers to read. But six more trials were conducted, in almost 10 times as many patients. All of them showed that reboxetine was no better than a dummy sugar pill. None of these trials was published. I had no idea they existed.*

*It got worse. The trials comparing reboxetine against other drugs showed exactly the same picture: three small studies, 507 patients in total, showed that reboxetine was just as good as any other drug. They were all published. But 1,657 patients’ worth of data was left unpublished, and this unpublished data showed that patients on reboxetine did worse than those on other drugs. If all this wasn’t bad enough, there was also the side-effects data. The drug looked fine in the trials that appeared in the academic literature; but when we saw the unpublished studies, it turned out that patients were more likely to have side-effects, more likely to drop out of taking the drug and more likely to withdraw from the trial because of side-effects, if they were taking reboxetine rather than one of its competitors.*

The whole article is a fantastic (and worrying) account of regulatory failure. At the heart of this problem lies a social and political misunderstanding of statistics which is being manipulated by drug companies for profit. A proper regulatory framework would ensure that all trials were registered in advance and their data recorded. Instead what happens is trials are commissioned by drugs companies, published if they are favourable and quietly buried if they are not. This data mining would be mathematically rejected in an IB exploration coursework, yet these statistics still governs what pills doctors do and don’t prescribe.

When presented data therefore, your first question should be, “Where did this come from?” shortly followed by, “What about the data you’re not showing me?” Lies, damn lies and statistics indeed!

If you enjoyed this post, you might also like:

How contagious is Ebola? – how we can use differential equations to model the spread of the disease.