You are currently browsing the tag archive for the ‘statistics’ tag.

Anscombe’s Quartet was devised by the statistician Francis Anscombe to illustrate how important it was to not just rely on statistical measures when analyzing data. To do this he created 4 data sets which would produce nearly identical statistical measures. The scatter graphs above generated by the Python code here.

**Statistical measures**

1) Mean of x values in each data set = 9.00

2) Standard deviation of x values in each data set = 3.32

3) Mean of y values in each data set = 7.50

4) Standard deviation of x values in each data set = 2.03

5) Pearson’s Correlation coefficient for each paired data set = 0.82

6) Linear regression line for each paired data set: y = 0.500x + 3.00

When looking at this data we would be forgiven for concluding that these data sets must be very similar – but really they are quite different.

**Data Set A:**

x = [10,8,13,9,11,14,6,4,12,7,5]

y = [8.04, 6.95,7.58,8.81,8.33, 9.96,7.24,4.26,10.84,4.82,5.68]

Data Set A does indeed fit a linear regression – and so this would be appropriate to use the line of best fit for predictive purposes.

**Data Set B:**

x = [10,8,13,9,11,14,6,4,12,7,5]

y = [9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74]

You could fit a linear regression to Data Set B – but this is clearly not the most appropriate regression line for this data. Some quadratic or higher power polynomial would be better for predicting data here.

**Data Set C:**

x = [10,8,13,9,11,14,6,4,12,7,5]

y = [7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73]

In Data set C we can see the effect of a single outlier – we have 11 points in pretty much a perfect linear correlation, and then a single outlier. For predictive purposes we would be best investigating this outlier (checking that it does conform to the mathematical definition of an outlier), and then potentially doing our regression with this removed.

**Data Set D:**

x = [8,8,8,8,8,8,8,19,8,8,8]

y = [6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.50,5.56,7.91,6.89]

In Data set D we can also see the effect of a single outlier – we have 11 points in a vertical line, and then a single outlier. Clearly here again drawing a line of best fit for this data is not appropriate – unless we remove this outlier first.

**The moral of the story**

So – the moral here is always use graphical analysis alongside statistical measures. A very common mistake for IB students is to rely on Pearson’s Product coefficient without really looking at the scatter graph to decide whether a linear fit is appropriate. If you do this then you could end up with a very low mark in the E category as you will not show good understanding of what you are doing. So always plot a graph first!

**Quantum Mechanics – Statistical Universe**

Quantum mechanics is the name for the mathematics that can describe physical systems on extremely small scales. When we deal with the macroscopic – i.e scales that we experience in our everyday physical world, then Newtonian mechanics works just fine. However on the microscopic level of particles, Newtonian mechanics no longer works – hence the need for quantum mechanics.

Quantum mechanics is both very complicated and very weird – I’m going to try and give a very simplified (though not simple!) example of how *probabilities *are at the heart of quantum mechanics. Rather than speaking with certainty about the property of an object as we can in classical mechanics, we need to take about the probability that it holds such a property.

For example, one property of particles is *spin. *We can have create a particle with the property of either *up* spin or *down* spin. We can visualise this as an arrow pointing up or down:

We can then create an apparatus (say the slit below parallel to the z axis) which measures whether the particle is in either up state or down state. If the particle is in up spin then it will return a value of +1 and if it is in down spin then it will return a value of -1.

So far so normal. But here is where things get weird. If we then rotate the slit 90 degrees clockwise so that it is parallel to the x axis, we would expect from classical mechanics to get a reading of 0. i.e the “arrow” will not fit through the slit. However that is not what happens. Instead we will still get readings of -1 or +1. However if we run the experiment a large number of times we find that the mean average reading will indeed be 0!

What has happened is that the act of measuring the particle with the slit has changed the state of the particle. Say it was previously +1, i.e in *up* spin, by measuring it with the newly rotated slit we have forced the particle into a new state of either pointing right (*right* spin) or pointing left (*left* spin). Our rotated slit will then return a value of +1 if the particle is in right spin, and will return a value of -1 if the particle in in left spin.

In this case the probability that the apparatus will return a value of +1 is 50% and the probability that the apparatus will return a value of -1 is also 50%. Therefore when we run this experiment many times we get the average value of 0. Therefore classical mechanics is achieved as an probabilistic approximation of repeated particle interactions

We can look at a slightly more complicated example – say we don’t rotate the slit 90 degrees, but instead rotate it an arbitrary number of degrees from the z axis as pictured below:

Here the slit was initially parallel to the z axis in the x,y plane (i.e y=0), and has been rotated Θ degrees. So the question is what is the probability that our previously *up* spin particle will return a value of +1 when measured through this new slit?

The equations above give the probabilities of returning a +1 spin or a -1 spin depending on the angle of orientation. So in the case of a 90 degree orientation we have both P(+1) and P(-1) = 1/2 as we stated earlier. An orientation of 45 degrees would have P(+1) = 0.85 and P(-1) = 0.15. An orientation of 10 degrees would have P(+1) = 0.99 and P(-1) = 0.01.

The statistical average meanwhile is given by the above formula. If we rotate the slit by Θ degrees from the z axis in the x,z plane, then run the experiment many times, we will get a long term average of cosΘ. As we have seen before, when Θ = 90 this means we get an average value of 0. if Θ = 45 degrees we would get an average reading of √2/2.

This gives a very small snapshot into the ideas of quantum mechanics and the crucial role that probability plays in understanding quantum states. If you found that difficult, then don’t worry you’re in good company. As Richard Feynman the legendary physicist once said, “If you think you understand quantum mechanics, you don’t understand quantum mechanics.”

Essential resources for IB students:

Revision Village has been put together to help IB students with topic revision both for during the course and for the end of Year 12 school exams and Year 13 final exams. I would strongly recommend students use this as a resource during the course (not just for final revision in Y13!) There are specific resources for HL and SL students for both Analysis and Applications.

There is a comprehensive Questionbank takes you to a breakdown of each main subject area (e.g. Algebra, Calculus etc) and then provides a large bank of graded questions. What I like about this is that you are given a difficulty rating, as well as a mark scheme and also a worked video tutorial. Really useful!

The Practice Exams section takes you to a large number of ready made quizzes, exams and predicted papers. These all have worked solutions and allow you to focus on specific topics or start general revision. This also has some excellent challenging questions for those students aiming for 6s and 7s.

Each course also has a dedicated video tutorial section which provides 5-15 minute tutorial videos on every single syllabus part – handily sorted into topic categories.

2) Exploration Guides and Paper 3 Resources

I’ve put together four comprehensive pdf guides to help students prepare for their exploration coursework and Paper 3 investigations. The exploration guides talk through the marking criteria, common student mistakes, excellent ideas for explorations, technology advice, modeling methods and a variety of statistical techniques with detailed explanations. I’ve also made 17 full investigation questions which are also excellent starting points for explorations. The Exploration Guides can be downloaded here and the Paper 3 Questions can be downloaded here.

**Modeling Volcanoes – When will they erupt?**

A recent post by the excellent Maths Careers website looked at how we can model volcanic eruptions mathematically. This is an important branch of mathematics – which looks to assign risk to events and these methods are very important to statisticians and insurers. Given that large-scale volcanic eruptions have the potential to end modern civilisation, it’s also useful to know how likely the next large eruption is.

The Guardian has recently run a piece on the dangers that large volcanoes pose to humans. Iceland’s Eyjafjallajökull volcano which erupted in 2010 caused over 100,000 flights to be grounded and cost the global economy over $1 billion – and yet this was only a very minor eruption historically speaking. For example, the Tombora eruption in Indonesia (1815) was so big that the explosion could be heard over 2000km away, and the 200 million tones of sulpher that were emitted spread across the globe, lowering global temperatures by 2 degrees Celsius. This led to widespread famine as crops failed – and tens of thousands of deaths.

**Super volcanoes**

Even this destruction is insignificant when compared to the potential damage caused by a super volcano. These volcanoes, like that underneath Yellowstone Park in America, have the potential to wipe-out millions in the initial explosion and and to send enough sulpher and ash into the air to cause a “volcanic winter” of significantly lower global temperatures. The graphic above shows that the ash from a Yellowstone eruption could cover the ground of about half the USA. The resultant widespread disruption to global food supplies and travel would be devastating.

So, how can we predict the probability of a volcanic eruption? The easiest model to use, if we already have an estimated probability of eruption is the Poisson distribution:

This formula calculates the probability that X equals a given value of k. λ is the mean of the distribution. If X represents the number of volcanic eruptions we have Pr(X ≥1) = 1 – Pr(x = 0). This gives us a formula for working out the probability of an eruption as 1 -e^{-λ}. For example, the Yellowstone super volcano erupts around every 600,000 years. Therefore if λ is the number of eruptions every year, we have λ = 1/600,000 ≈ 0.00000167 and 1 -e ^{-λ} also ≈ 0.00000167. This gets more interesting if we then look at the probability over a range of years. We can do this by modifying the formula for probability as 1 -e^{-tλ} where t is the number of years for our range.

So the probability of a Yellowstone eruption in the next 1000 years is 1 -e^{-0.00167} ≈ 0.00166, and the probability in the next 10,000 years is 1 -e^{-0.0167} ≈ 0.0164. So we have approximately a 2% chance of this eruption in the next 10,000 years.

A far smaller volcano, like Katla in Iceland has erupted 16 times in the past 1100 years – giving a average eruption every ≈ 70 years. This gives λ = 1/70 ≈ 0.014. So we can expect this to erupt in the next 10 years with probability 1 -e^{-0.14} ≈ 0.0139. And in the next 30 years with probability 1 -e^{-0.42} ≈ 0.34.

The models for volcanic eruptions can get a lot more complicated – especially as we often don’t know the accurate data to give us an estimate for the λ. λ can be estimated using a technique called Maximum Likelihood Estimation – which you can read about here.

If you enjoyed this post you might also like:

Black Swans and Civilisation Collapse. How effective is maths at guiding government policies?

**IB Revision**

If you’re already thinking about your coursework then it’s probably also time to start planning some revision, either for the end of Year 12 school exams or Year 13 final exams. There’s a really great website that I would strongly recommend students use – you choose your subject (HL/SL/Studies if your exam is in 2020 or Applications/Analysis if your exam is in 2021), and then have the following resources:

The Questionbank takes you to a breakdown of each main subject area (e.g. Algebra, Calculus etc) and each area then has a number of graded questions. What I like about this is that you are given a difficulty rating, as well as a mark scheme and also a worked video tutorial. Really useful!

The Practice Exams section takes you to ready made exams on each topic – again with worked solutions. This also has some harder exams for those students aiming for 6s and 7s and the Past IB Exams section takes you to full video worked solutions to every question on every past paper – and you can also get a prediction exam for the upcoming year.

I would really recommend everyone making use of this – there is a mixture of a lot of free content as well as premium content so have a look and see what you think.

**Which Times Tables do Students Find Difficult? **

There’s an excellent article on today’s Guardian Datablog looking at a computer based study (with 232 primary school students) on which times tables students find easiest and difficult. Edited highlights (Guardian quotes in italics):

**Which multiplication did students get wrong most often?**

*The hardest multiplication was six times eight, which students got wrong 63% of the time (about two times out of three). This was closely followed by 8×6, then 11×12, 12×8 and 8×12.*

The graphic shows the questions that were answered correctly the greatest percentage of times as dark blue (eg 1×12 was answered 95% correctly). The colours then change through lighter shades of blue, then from lighter reds to darker reds. It’s interesting to see that the difficult multiplications cluster in the middle – perhaps due to how students anchor from either 5 or 10 – so numbers away from both these anchors are more difficult.

**Which times table multiplication did students take the longest time to answer?
**

*Maybe unsurprisingly, 1×1 got answered the quickest (but perhaps illustrating the hazards of speed, pupils got it wrong about 10% of the time), at 2.4 seconds on average – while it was 12×9 which made them think for longest, at an average of 7.9 seconds apiece.*

It’s quite interesting to see that this data is somewhat different to the previous graph. You might have expected the most difficult multiplications to also take the longest time – however it looks as though some questions, whilst not intuitive can be worked out through mental methods (eg doing 12×9 by doing 12×10 then subtracting 12.)

**How did boys and girls differ?**

*On average, boys got 32% of answers wrong, and took 4.2 seconds to answer each question. Girls, by contrast, got substantially fewer wrong, at 22%, but took 4.6 seconds on average to answer.*

Another interesting statistic – boys were more reckless and less considered with their answers! The element of competition (ie. having to answer against a clock) may well have encouraged this attitude. It would be interesting to see the gender breakdown to see whether boys and girls have any differences in which multiplication they find difficult.

**Which times table was the hardest?**

As you might expect, overall the 12 times table was found most difficult – closely followed by 8. The numbers furthest away from 5 and 10 (7,8,12) are also the most difficult. Is this down to how students are taught to calculate their tables – or because of the sequence patterns are less memorable?

This would be a really excellent investigation topic for IGCSE, IB Studies or IB SL. It is something that would be relatively easy to collect data on in a school setting and then can provide a wealth of data to analyse. The full data spreadsheet is also available to download on the Guardian page.

If you enjoyed this post you may also like:

Finger Ratio Predicts Maths Ability?– a maths investigation about finger ratio and mathematical skill.

Premier League Finances – Debt and Wages – an investigation into the finances of Premier League clubs.

Essential resources for IB students:

Revision Village has been put together to help IB students with topic revision both for during the course and for the end of Year 12 school exams and Year 13 final exams. I would strongly recommend students use this as a resource during the course (not just for final revision in Y13!) There are specific resources for HL and SL students for both Analysis and Applications.

There is a comprehensive Questionbank takes you to a breakdown of each main subject area (e.g. Algebra, Calculus etc) and then provides a large bank of graded questions. What I like about this is that you are given a difficulty rating, as well as a mark scheme and also a worked video tutorial. Really useful!

The Practice Exams section takes you to a large number of ready made quizzes, exams and predicted papers. These all have worked solutions and allow you to focus on specific topics or start general revision. This also has some excellent challenging questions for those students aiming for 6s and 7s.

Each course also has a dedicated video tutorial section which provides 5-15 minute tutorial videos on every single syllabus part – handily sorted into topic categories.

2) Exploration Guides and Paper 3 Resources

I’ve put together four comprehensive pdf guides to help students prepare for their exploration coursework and Paper 3 investigations. The exploration guides talk through the marking criteria, common student mistakes, excellent ideas for explorations, technology advice, modeling methods and a variety of statistical techniques with detailed explanations. I’ve also made 17 full investigation questions which are also excellent starting points for explorations. The Exploration Guides can be downloaded here and the Paper 3 Questions can be downloaded here.

**Amanda Knox and Bad Maths in Courts**

This post is inspired by the recent BBC News article, “Amanda Knox and Bad Maths in Courts.” The article highlights the importance of good mathematical understanding when handling probabilities – and how mistakes by judges and juries can sometimes lead to miscarriages of justice.

**A scenario to give to students:**

*A murder scene is found with two types of blood – that of the victim and that of the murderer. As luck would have it, the unidentified blood has an incredibly rare blood disorder, only found in 1 in every million men. The capital and surrounding areas have a population of 20 million – and the police are sure the murderer is from the capital. The police have already started cataloging all citizens’ blood types for their new super crime-database. They already have nearly 1 million male samples in there – and bingo – one man, Mr XY, is a match. He is promptly marched off to trial, there is no other evidence, but the jury are told that the odds are 1 in a million that he is innocent. He is duly convicted. The question is, how likely is it that he did not commit this crime? *

**Answer:**

*We can be around 90% confident that he did not commit this crime. Assuming that there are approximately 10 million men in the capital, then were everyone cataloged on the database we would have on average 10 positive matches. Given that there is no other evidence, it is therefore likely that he is only a 1 in 10 chance of being guilty. Even though P(Fail Test/Innocent) = 1/1,000,000, P(Innocent/Fail test) = 9/10.
*

**Amanda Knox**

Eighteen months ago, Amanda Knox and Raffaele Sollecito, who were previously convicted of the murder of British exchange student Meredith Kercher, were acquitted. The judge at the time ruled out re-testing a tiny DNA sample found at the scene, stating that, “The sum of the two results, both unreliable… cannot give a reliable result.”

This logic however, whilst intuitive is not mathematically correct. As explained by mathematician Coralie Colmez in the BBC News article, by repeating relatively unreliable tests we can make them more reliable – the larger the pooled sample size, the more confident we can be in the result.

**Sally Clark**

One of the most (in)famous examples of bad maths in the court room is that of Sally Clark – who was convicted of the murder of her two sons in 1999. It has been described as, “one of the great miscarriages of justice in modern British legal history.” Both of Sally Clark’s children died from cot-death whilst still babies. Soon afterwards she was arrested for murder. The case was based on a seemingly incontrovertible statistic – that the chance of 2 children from the same family dying from cot-death was 1 in 73 million. Experts testified to this, the jury were suitably convinced and she was convicted.

The crux of the prosecutor’s case was that it was so statistically unlikely that this had happened by chance, that she must have killed her children. However, this was bad maths – which led to an innocent woman being jailed for four years before her eventual acquittal.

**Independent Events**

The 1 in 73 million figure was arrived at by simply looking at the probability of a single cot-death (1 in 8500 ) and then squaring it – because it had happened twice. However, this method only works if both events are independent – and in this case they clearly weren’t. Any biological or social factors which contribute to the death of a child due to cot-death will also mean that another sibling is also at elevated risk.

**Prosecutor’s Fallacy**

Additionally this figure was presented in a way which is known as the “prosecutor’s fallacy” – the 1 in 73 million figure (even if correct) didn’t represent the probability of Sally Clark’s innocence, because it should have been compared against the probability of guilt for a double homicide. In other words, the probability of a false positive is not the same as the probability of innocence. In mathematical language, P(Fail Test/Innocent) is not equal to P(Innocent/Fail test).

Subsequent analysis of the Sally Clark case by a mathematics professor concluded that rather than having a 1 in 73 million chance of being innocent, actually it was about 4-10 times more likely this was due to natural causes rather than murder. Quite a big turnaround – and evidence of why understanding statistics is so important in the courts.

This topic has also been highlighted recently by the excellent ToK website, Lancaster School ToK.

If you enjoyed this topic you might also like:

Benford’s Law – Using Maths to Catch Fraudsters

The Mathematics of Cons – Pyramid Selling

Essential resources for IB students:

Revision Village has been put together to help IB students with topic revision both for during the course and for the end of Year 12 school exams and Year 13 final exams. I would strongly recommend students use this as a resource during the course (not just for final revision in Y13!) There are specific resources for HL and SL students for both Analysis and Applications.

There is a comprehensive Questionbank takes you to a breakdown of each main subject area (e.g. Algebra, Calculus etc) and then provides a large bank of graded questions. What I like about this is that you are given a difficulty rating, as well as a mark scheme and also a worked video tutorial. Really useful!

The Practice Exams section takes you to a large number of ready made quizzes, exams and predicted papers. These all have worked solutions and allow you to focus on specific topics or start general revision. This also has some excellent challenging questions for those students aiming for 6s and 7s.

Each course also has a dedicated video tutorial section which provides 5-15 minute tutorial videos on every single syllabus part – handily sorted into topic categories.

2) Exploration Guides and Paper 3 Resources

I’ve put together four comprehensive pdf guides to help students prepare for their exploration coursework and Paper 3 investigations. The exploration guides talk through the marking criteria, common student mistakes, excellent ideas for explorations, technology advice, modeling methods and a variety of statistical techniques with detailed explanations. I’ve also made 17 full investigation questions which are also excellent starting points for explorations. The Exploration Guides can be downloaded here and the Paper 3 Questions can be downloaded here.

**Premier League Finances – Debt and Wages**

This is a great article from the Guardian DataBlog analysing the finances for last season’s Premier League clubs. As the Guardian says, “More than two thirds of the Premier League’s record £2.4bn income in 2011-12 was paid out in wages, according to the most recently published accounts of all 20 clubs. The Guardian’s annual special report of Premier League clubs’ finances shows they spent £1.6bn on wages last season, most of it going to players.”

The first graph (above) shows the net debt levels for different clubs.

The second graph shows the total turnover:

The third graph shows wages as a proportion of turnover:

and the last one is particularly interesting – as it ranks clubs on their wage bills and their league position. This would be an interesting piece of data to test for the strength of correlation:

I’ve used an online scatter plot to calculate both the regression line and the correlation coefficient:

Which clearly shows a strong positive correlation. This would be an interesting exercise for both IGCSE or IB students (especially Maths Studies).

For even more data, a club by club full breakdown is also provided by the Guardian here. I have also made the data above into a word document to be used as a some A4 posters – and you can download that here: Premier League Debt

If you enjoyed this post you might also like:

Which Times Tables do Students Find Difficult? An Investigation.

Why Study Maths? Careers Inspiration

Essential resources for IB students:

2) Exploration Guides and Paper 3 Resources

This is a really nice worksheet and associated powerpoint for collecting a variety of data from the class – measuring reaction times, memory, head circumference etc etc. Everything is easily laid out ready for students to fill in. Would also be suitable for IGCSE, and even KS3 (you would just interpret to different levels.

Making Statistics Relevant is a brilliant website from the same creator as the RISPS resources – Jonny Griffiths. Each statistics topic has an extension task created to get students using their problem solving skills. Topics covered include measures of central tendency, probability, discrete random variables, poisson distribution, binomial distribution, normal distribution and chi squared. Great for providing varied non-textbook material and for stretching gifted and talented students.

To give an idea of the kind of tasks provided, here is one below. (Worked answers are also provided).