You are currently browsing the category archive for the ‘statistics’ category.

**Simulating a Football Season**

This is a nice example of how statistics are used in modeling – similar techniques are used when gambling companies are creating odds or when computer game designers are making football manager games. We start with some statistics. The soccer stats site has the data we need from the 2018-19 season, and we will use this to predict the outcome of the 2019-20 season (assuming teams stay at a similar level, and that no-one was relegated in 2018-19).

**Attack and defense strength**

For each team we need to calculate:

- Home attack strength
- Away attack strength
- Home defense strength
- Away defense strength.

For example for Liverpool (LFC)

LFC Home attack strength = (LFC home goals in 2018-19 season)/(average home goals in 2018-19 season)

LFC Away attack strength = (LFC away goals in 2018-19 season)/(average away goals in 2018-19 season)

LFC Home defense strength = (LFC home goals conceded in 2018-19 season)/(average home goals conceded in 2018-19 season)

LFC Away defense strength = (LFC away goals conceded in 2018-19 season)/(average away goals conceded in 2018-19 season)

**Calculating lamda**

We can then use a Poisson model to work out some probabilities. First though we need to find our lamda value. To make life easier we can also use the fact that the lamda value for a Poisson gives the mean value – and use this to give an approximate answer.

So, for example if Liverpool are playing at home to Arsenal we work out Liverpool’s lamda value as:

LFC home lamda = league average home goals per game x LFC home attack strength x Arsenal away defense strength.

We would work out Arsenal’s away lamda as:

Arsenal away lamda = league average away goals per game x Arsenal away attack strength x Liverpool home defense strength.

Putting in some values gives a home lamda for Liverpool as 3.38 and an away lamda for Arsenal as 0.69. So we would expect Liverpool to win 3-1 (rounding to the nearest integer).

**Using Excel**

I then used an Excel spreadsheet to work out the home goals in each fixture in the league season (green column represents the home teams)

and then used the same method to work out the away goals in each fixture in the league (yellow column represents the away team)

I could then round these numbers to the nearest integer and fill in the scores for each match in the table:

Then I was able to work out the point totals to produce a predicted table:

Here we had both Liverpool and Manchester City on 104 points, but with Manchester City having a better goal difference, so winning the league again.

**Using a Poisson model.**

The poisson model allows us to calculate probabilities. The mode is:

P(k goals) = (e^{-λ}λ^{k})/k!

λ is the symbol lamda which we calculated before.

So, for example with Liverpool at home to Arsenal we calculate

Liverpool’s home lamda = league average home goals per game x LFC home attack strength x Arsenal away defense strength.

**Liverpool’s home lamda = 1.57 x 1.84 x 1.17 = 3.38**

Therefore

P(Liverpool score 0 goals) = (e^{-3.38}3.38^{0})/0! = 0.034

P(Liverpool score 1 goal) = (e^{-3.38}3.38^{1})/1! = 0.12

P(Liverpool score 2 goals) = (e^{-3.38}3.38^{2})/2! = 0.19

P(Liverpool score 3 goals) = (e^{-3.38}3.38^{3})/3! = 0.22

P(Liverpool score 4 goals) = (e^{-3.38}3.38^{1})/1! = 0.19

P(Liverpool score 5 goals) = (e^{-3.38}3.38^{5})/5! = 0.13 etc.

**Arsenal’s away lamda = 1.25 x 1.30 x 0.42 = 0.68**

P(Arsenal score 0 goals) = (e^{-0.68}0.68^{0})/0! = 0.51

P(Arsenal score 1 goal) = (e^{-0.68}0.68^{1})/1! = 0.34

P(Arsenal score 2 goals) = (e^{-0.68}0.68^{2})/2! = 0.12

P(Arsenal score 3 goals) = (e^{-0.68}0.68^{3})/3! = 0.03 etc.

**Probability that Arsenal win**

Arsenal can win if:

Liverpool score 0 goals and Arsenal score 1 or more

Liverpool score 1 goal and Arsenal score 2 or more

Liverpool score 2 goals and Arsenal score 3 or more etc.

i.e the approximate probability of Arsenal winning is:

0.034 x 0.49 + 0.12 x 0.15 + 0.19 x 0.03 = 0.04.

Using the same method we could work out the probability of a draw and a Liverpool win. This is the sort of method that bookmakers will use to calculate the probabilities that ensure they make a profit when offering odds.

**Quantum Mechanics – Statistical Universe**

Quantum mechanics is the name for the mathematics that can describe physical systems on extremely small scales. When we deal with the macroscopic – i.e scales that we experience in our everyday physical world, then Newtonian mechanics works just fine. However on the microscopic level of particles, Newtonian mechanics no longer works – hence the need for quantum mechanics.

Quantum mechanics is both very complicated and very weird – I’m going to try and give a very simplified (though not simple!) example of how *probabilities *are at the heart of quantum mechanics. Rather than speaking with certainty about the property of an object as we can in classical mechanics, we need to take about the probability that it holds such a property.

For example, one property of particles is *spin. *We can have create a particle with the property of either *up* spin or *down* spin. We can visualise this as an arrow pointing up or down:

We can then create an apparatus (say the slit below parallel to the z axis) which measures whether the particle is in either up state or down state. If the particle is in up spin then it will return a value of +1 and if it is in down spin then it will return a value of -1.

So far so normal. But here is where things get weird. If we then rotate the slit 90 degrees clockwise so that it is parallel to the x axis, we would expect from classical mechanics to get a reading of 0. i.e the “arrow” will not fit through the slit. However that is not what happens. Instead we will still get readings of -1 or +1. However if we run the experiment a large number of times we find that the mean average reading will indeed be 0!

What has happened is that the act of measuring the particle with the slit has changed the state of the particle. Say it was previously +1, i.e in *up* spin, by measuring it with the newly rotated slit we have forced the particle into a new state of either pointing right (*right* spin) or pointing left (*left* spin). Our rotated slit will then return a value of +1 if the particle is in right spin, and will return a value of -1 if the particle in in left spin.

In this case the probability that the apparatus will return a value of +1 is 50% and the probability that the apparatus will return a value of -1 is also 50%. Therefore when we run this experiment many times we get the average value of 0. Therefore classical mechanics is achieved as an probabilistic approximation of repeated particle interactions

We can look at a slightly more complicated example – say we don’t rotate the slit 90 degrees, but instead rotate it an arbitrary number of degrees from the z axis as pictured below:

Here the slit was initially parallel to the z axis in the x,y plane (i.e y=0), and has been rotated Θ degrees. So the question is what is the probability that our previously *up* spin particle will return a value of +1 when measured through this new slit?

The equations above give the probabilities of returning a +1 spin or a -1 spin depending on the angle of orientation. So in the case of a 90 degree orientation we have both P(+1) and P(-1) = 1/2 as we stated earlier. An orientation of 45 degrees would have P(+1) = 0.85 and P(-1) = 0.15. An orientation of 10 degrees would have P(+1) = 0.99 and P(-1) = 0.01.

The statistical average meanwhile is given by the above formula. If we rotate the slit by Θ degrees from the z axis in the x,z plane, then run the experiment many times, we will get a long term average of cosΘ. As we have seen before, when Θ = 90 this means we get an average value of 0. if Θ = 45 degrees we would get an average reading of √2/2.

This gives a very small snapshot into the ideas of quantum mechanics and the crucial role that probability plays in understanding quantum states. If you found that difficult, then don’t worry you’re in good company. As Richard Feynman the legendary physicist once said, “If you think you understand quantum mechanics, you don’t understand quantum mechanics.”

**Predicting the UK election using linear regression**

The above data is the latest opinion poll data from the Guardian. The UK will have (another) general election on June 8th. So can we use the current opinion poll data to predict the outcome?

**Longer term data trends**

Let’s start by looking at the longer term trend following the aftermath of the Brexit vote on June 23rd 2016. I’ll plot some points for Labour and the Conservatives and see what kind of linear regression we get. To keep things simple I’ve looked at randomly chosen poll data approximately every 2 weeks – assigning 0 to July 1st 2016, 1 to mid July, 2 to August 1st etc. This has then been plotted using the fantastic Desmos.

**Labour**

You can see that this is not a very good fit – it’s a very weak correlation. Nevertheless let’s see what we would get if we used this regression line to predict the outcome in June. With the x axis scale I’ve chosen, mid June 2017 equates to 23 on the x axis. Therefore we predict the percentage as

y = -0.130(23) + 30.2

y = 27%

Clearly this would be a disaster for Labour – but our model is not especially accurate so perhaps nothing to worry about just yet.

**Conservatives**

As with Labour we have a weak correlation – though this time we have a positive rather than negative correlation. If we use our regression model we get a prediction of:

y = 0.242(23) + 38.7

y = 44%

So, we are predicting a crushing victory for the Conservatives – but could we get some more accurate models to base this prediction on?

**Using moving averages**

The Guardian’s poll tracker at the top of the page uses moving averages to smooth out poll fluctuations between different polls and to arrive at an averaged poll figure. Using this provides a stronger correlation:

**Labour**

This model doesn’t take into account a (possible) late surge in support for Labour but does fir better than our last graph. Using the equation we get:

y = -0.0764(23) + 28.8

y = 27%

**Conservatives**

We can have more confidence in using this regression line to predict the election. Putting in the numbers we get:

y = 0.411(23) + 36.48

y = 46%

**Conclusion**

Our more accurate models merely confirm what we found earlier – and indeed what all the pollsters are predicting – a massive win for the Conservatives. Even allowing for a late narrowing of the polls the Conservatives could be on target for winning by over 10% points – which would result in a very large majority. Let’s see what happens!

**Modelling Radioactive decay**

We can model radioactive decay of atoms using the following equation:

**N(t) = N _{0} e^{-λt}**

Where:

**N _{0}**: is the initial quantity of the element

**λ**: is the radioactive decay constant

**t**: is time

**N(t)**: is the quantity of the element remaining after time t.

So, for Carbon-14 which has a half life of 5730 years (this means that after 5730 years exactly half of the initial amount of Carbon-14 atoms will have decayed) we can calculate the decay constant **λ. **

After 5730 years, N(5730) will be exactly half of N_{0}, therefore we can write the following:

**N(5730) = 0.5N _{0} = N_{0} e^{-λt}**

therefore:

**0.5 = e ^{-λt}**

and if we take the natural log of both sides and rearrange we get:

**λ = ln(1/2) / -5730**

**λ ≈0.000121**

We can now use this to solve problems involving Carbon-14 (which is used in Carbon-dating techniques to find out how old things are).

eg. You find an old parchment and after measuring the Carbon-14 content you find that it is just 30% of what a new piece of paper would contain. How old is this paper?

We have

**N(t) = N _{0} e^{-0.000121t}**

**N(t)/N _{0}** =

**e**

^{-0.000121t}**0.30** = **e ^{-0.000121t}**

**t = ln(0.30)/(-0.000121)**

**t = 9950 years old.**

**Probability density functions**

We can also do some interesting maths by rearranging:

**N(t) = N _{0} e^{-λt}**

**N(t)/N _{0}** =

**e**

^{-λt}and then plotting **N(t)/N _{0}** against time.

**N(t)/N _{0}** will have a range between 0 and 1 as when t = 0,

**N(0)**=

**N**which gives

_{0}**N(0)**/

**N(0)**= 1.

We can then manipulate this into the form of a probability density function – by finding the constant a which makes the area underneath the curve equal to 1.

solving this gives a = λ. Therefore the following integral:

will give the fraction of atoms which will have decayed between times t1 and t2.

We could use this integral to work out the half life of Carbon-14 as follows:

Which if we solve gives us t = 5728.5 which is what we’d expect (given our earlier rounding of the decay constant).

We can also now work out the expected (mean) time that an atom will exist before it decays. To do this we use the following equation for finding E(x) of a probability density function:

and if we substitute in our equation we get:

Now, we can integrate this by parts:

So the expected (mean) life of an atom is given by 1/λ. In the case of Carbon, with a decay constant λ ≈0.000121 we have an expected life of a Carbon-14 atom as:

E(t) = 1 /0.000121

E(t) = 8264 years.

Now that may sound a little strange – after all the half life is 5730 years, which means that half of all atoms will have decayed after 5730 years. So why is the mean life so much higher? Well it’s because of the long right tail in the graph – we will have some atoms with very large lifespans – and this will therefore skew the mean to the right.

**Modeling Volcanoes – When will they erupt?**

A recent post by the excellent Maths Careers website looked at how we can model volcanic eruptions mathematically. This is an important branch of mathematics – which looks to assign risk to events and these methods are very important to statisticians and insurers. Given that large-scale volcanic eruptions have the potential to end modern civilisation, it’s also useful to know how likely the next large eruption is.

The Guardian has recently run a piece on the dangers that large volcanoes pose to humans. Iceland’s Eyjafjallajökull volcano which erupted in 2010 caused over 100,000 flights to be grounded and cost the global economy over $1 billion – and yet this was only a very minor eruption historically speaking. For example, the Tombora eruption in Indonesia (1815) was so big that the explosion could be heard over 2000km away, and the 200 million tones of sulpher that were emitted spread across the globe, lowering global temperatures by 2 degrees Celsius. This led to widespread famine as crops failed – and tens of thousands of deaths.

**Super volcanoes**

Even this destruction is insignificant when compared to the potential damage caused by a super volcano. These volcanoes, like that underneath Yellowstone Park in America, have the potential to wipe-out millions in the initial explosion and and to send enough sulpher and ash into the air to cause a “volcanic winter” of significantly lower global temperatures. The graphic above shows that the ash from a Yellowstone eruption could cover the ground of about half the USA. The resultant widespread disruption to global food supplies and travel would be devastating.

So, how can we predict the probability of a volcanic eruption? The easiest model to use, if we already have an estimated probability of eruption is the Poisson distribution:

This formula calculates the probability that X equals a given value of k. λ is the mean of the distribution. If X represents the number of volcanic eruptions we have Pr(X ≥1) = 1 – Pr(x = 0). This gives us a formula for working out the probability of an eruption as 1 -e^{-λ}. For example, the Yellowstone super volcano erupts around every 600,000 years. Therefore if λ is the number of eruptions every year, we have λ = 1/600,000 ≈ 0.00000167 and 1 -e ^{-λ} also ≈ 0.00000167. This gets more interesting if we then look at the probability over a range of years. We can do this by modifying the formula for probability as 1 -e^{-tλ} where t is the number of years for our range.

So the probability of a Yellowstone eruption in the next 1000 years is 1 -e^{-0.00167} ≈ 0.00166, and the probability in the next 10,000 years is 1 -e^{-0.0167} ≈ 0.0164. So we have approximately a 2% chance of this eruption in the next 10,000 years.

A far smaller volcano, like Katla in Iceland has erupted 16 times in the past 1100 years – giving a average eruption every ≈ 70 years. This gives λ = 1/70 ≈ 0.014. So we can expect this to erupt in the next 10 years with probability 1 -e^{-0.14} ≈ 0.0139. And in the next 30 years with probability 1 -e^{-0.42} ≈ 0.34.

The models for volcanic eruptions can get a lot more complicated – especially as we often don’t know the accurate data to give us an estimate for the λ. λ can be estimated using a technique called Maximum Likelihood Estimation – which you can read about here.

If you enjoyed this post you might also like:

Black Swans and Civilisation Collapse. How effective is maths at guiding government policies?

**Are you Psychic?**

There have been people claiming to have paranormal powers for thousands of years. However, scientifically we can say that as yet we still have no convincing proof that any paranormal abilities exist. We can show this using some mathematical tests – such as the binomial or normal distribution.

**ESP Test **

You can test your ESP powers on this site (our probabilities will be a little different than their ones). You have the chance to try and predict what card the computer has chosen. After repeating this trial 25 times you can find out if you possess psychic powers. As we are working with discrete data and have a fixed probability of guessing (0.2) then we can use a binomial distribution. Say I got 6 correct, do I have psychic powers?

We have the Binomial model B(25, 0.2), 25 trials and 0.2 probability of success. So we want to find the probability that I could achieve 6 **or more** by luck.

The probability of getting exactly 6 right is 0.16. Working out the probability of getting 6 or more correct would take a bit longer by hand (though could be simplified by doing 1 – P(x ≤ 5). Doing this, or using a calculator we find the probability is 0.38. Therefore we would expect someone to get 6 or more correct just by guessing 38% of the time.

So, using this model, when would we have evidence for potential ESP ability? Well, a minimum bar for our percentages would probably be 5%. So how many do you need to get correct before there is less than a 5% of that happening by chance?

Using our calculator we can do trial and error to see that the probability of getting 9 or more correct by guessing is only 4.7%. So, someone getting 9 correct might be showing some signs of ESP. If we asked for a higher % threshold (such as 1%) we would want to see someone get 11 correct.

Now, in the video above, one of the Numberphile mathematicians manages to toss 10 heads in a row. Again, we can ask ourselves if this is evidence of some extraordinary ability. We can calculate this probability as 0.5^{10} = 0.001. This means that such an event would only happen 0.1% of the time. But, we’re only seeing a very small part of the total video. Here’s the full version:

Suddenly the feat looks less mathematically impressive (though still an impressive endurance feat!)

You can also test your psychic abilities with this video here.

**Statistics to win penalty shoot-outs**

With the World Cup nearly upon us we can look forward to another heroic defeat on penalties by England. England are in fact the worst country of any of the major footballing nations at taking penalties, having won only 1 out of 7 shoot-outs at the Euros and World Cup. In fact of the 35 penalties taken in shoot-outs England have missed 12 – which is a miss rate of over 30%. Germany by comparison have won 5 out of 7 – and have a miss rate of only 15%.

With the stakes in penalty shoot-outs so high there have been a number of studies to look at optimum strategies for players.

**Shoot left when ahead
**

One study published in Psychological Science looked at all the penalties taken in penalty shoot-outs in the World Cup since 1982. What they found was pretty incredible – goalkeepers have a subconscious bias for diving to the right when their team is behind.

As is clear from the graphic, this is not a small bias towards the right, but a very strong one. When their team is behind the goalkeeper apparently favours his (likely) strong side 71% of the time. The strikers’ shot meanwhile continues to be placed either left or right with roughly the same likelihood as in the other situations. So, this built in bias makes the goalkeeper much less likely to help his team recover from a losing position in a shoot-out.

**Shoot high**

Analysis by Prozone looking at the data from the World Cups and European Championships between 1998 and 2010 compiled the following graphics:

The first graphic above shows the part of the goal that scoring penalties were aimed at. With most strikers aiming bottom left and bottom right it’s no surprise to see that these were the most successful areas.

The second graphic which shows where penalties were saved shows a more complete picture – goalkeepers made nearly all their saves low down. A striker who has the skill and control to lift the ball high makes it very unlikely that the goalkeeper will save his shot.

The last graphic also shows the risk involved in shooting high. This data shows where all the missed penalties (which were off-target) were being aimed. Unsurprisingly strikers who were aiming down the middle of the goal managed to hit the target! Interestingly strikers aiming for the right corner (as the goalkeeper stands) were far more likely to drag their shot off target than those aiming for the left side. Perhaps this is to do with them being predominantly right footed and the angle of their shooting arc?

**Win the toss and go first**

The Prozone data also showed the importance of winning the coin toss – 75% of the teams who went first went on to win. Equally, missing the first penalty is disastrous to a team’s chances – they went on to lose 81% of the time. The statistics also show a huge psychological role as well. Players who needed to score to keep their teams in the competition only scored a miserable 14% of the time. It would be interesting to see how these statistics are replicated over a larger data set.

**Don’t dive**

A different study which looked at 286 penalties from both domestic leagues and international competitions found that goalkeepers are actually best advised to stay in the centre of the goal rather than diving to one side. This had quite a significant affect on their ability to save the penalties – increasing the likelihood from around 13% to 33%. So, why don’t more goalkeepers stay still? Well, again this might come down to psychology – a diving save looks more dramatic and showcases the goalkeeper’s skill more than standing stationary in the centre.

**So, why do England always lose on penalties?**

There are some interesting psychological studies which suggest that England suffer more than other teams because English players are inhibited by their high public status (in other words, there is more pressure on them to perform – and hence that pressure is harder to deal with). One such study noted that the best penalty takers are the ones who compose themselves prior to the penalty. England’s players start to run to the ball only 0.2 seconds after the referee has blown – making them much less composed than other teams.

However, I think you can put too much analysis on psychology – the answer is probably simpler – that other teams beat England because they have technically better players. English footballing culture revolves much less around technical skill than elsewhere in Europe and South America – and when it comes to the penalty shoot-outs this has a dramatic effect.

As we can see from the statistics, players who are technically gifted enough to lift their shots into the top corners give the goalkeepers virtually no chance of saving them. England’s less technically gifted players have to rely on hitting it hard and low to the corner – which gives the goalkeeper a much higher percentage chance of saving them.

**Test yourself**

You can test your penalty taking skills with this online game from the Open University – choose which players are best suited to the pressure, decide what advice they need and aim your shot in the best position.

If you liked this post you might also like:

Championship Wages Predict League Position? A look at how statistics can predict where teams finish in the league.

Premier League Wages Predict League Positions? A similar analysis of Premier League teams.

**Which Times Tables do Students Find Difficult? **

There’s an excellent article on today’s Guardian Datablog looking at a computer based study (with 232 primary school students) on which times tables students find easiest and difficult. Edited highlights (Guardian quotes in italics):

**Which multiplication did students get wrong most often?**

*The hardest multiplication was six times eight, which students got wrong 63% of the time (about two times out of three). This was closely followed by 8×6, then 11×12, 12×8 and 8×12.*

The graphic shows the questions that were answered correctly the greatest percentage of times as dark blue (eg 1×12 was answered 95% correctly). The colours then change through lighter shades of blue, then from lighter reds to darker reds. It’s interesting to see that the difficult multiplications cluster in the middle – perhaps due to how students anchor from either 5 or 10 – so numbers away from both these anchors are more difficult.

**Which times table multiplication did students take the longest time to answer?
**

*Maybe unsurprisingly, 1×1 got answered the quickest (but perhaps illustrating the hazards of speed, pupils got it wrong about 10% of the time), at 2.4 seconds on average – while it was 12×9 which made them think for longest, at an average of 7.9 seconds apiece.*

It’s quite interesting to see that this data is somewhat different to the previous graph. You might have expected the most difficult multiplications to also take the longest time – however it looks as though some questions, whilst not intuitive can be worked out through mental methods (eg doing 12×9 by doing 12×10 then subtracting 12.)

**How did boys and girls differ?**

*On average, boys got 32% of answers wrong, and took 4.2 seconds to answer each question. Girls, by contrast, got substantially fewer wrong, at 22%, but took 4.6 seconds on average to answer.*

Another interesting statistic – boys were more reckless and less considered with their answers! The element of competition (ie. having to answer against a clock) may well have encouraged this attitude. It would be interesting to see the gender breakdown to see whether boys and girls have any differences in which multiplication they find difficult.

**Which times table was the hardest?**

As you might expect, overall the 12 times table was found most difficult – closely followed by 8. The numbers furthest away from 5 and 10 (7,8,12) are also the most difficult. Is this down to how students are taught to calculate their tables – or because of the sequence patterns are less memorable?

This would be a really excellent investigation topic for IGCSE, IB Studies or IB SL. It is something that would be relatively easy to collect data on in a school setting and then can provide a wealth of data to analyse. The full data spreadsheet is also available to download on the Guardian page.

If you enjoyed this post you may also like:

Finger Ratio Predicts Maths Ability?– a maths investigation about finger ratio and mathematical skill.

Premier League Finances – Debt and Wages – an investigation into the finances of Premier League clubs.