Chi Square: Language Detection + Code Breaking

Screen Shot 2023-11-04 at 7.21.58 AM

Chi Square: Language Detection + Code Breaking

We can use the power of maths to allow computers to accurately recognise which language someone is writing in – even without needing to have understanding of any language at all.  How?  With the Chi Square goodness of fit test.  Every language in the world has its own unique distribution of letter frequencies – in English E is the most commonly used (around 12.4%).  In Spanish E is also the most common – but has a different probability of occurrence (around 13.1%).  Given a text (the longer the better) we can conduct a Chi Square goodness of fit to see which language it likely came from.

Test 1:  Spanish or English

Pretend for a moment that you have no knowledge of any languages – and you want to test whether the following quote is from English:

This will test how useful the chi squared test is in identifying the correct sentence. The longer the sentence the more accurate this technique will be.  Here is something to always remember, as the great Einstein said, “Do not worry too much about your difficulties in mathematics, I can assure you that mine are still greater.”

We can now do a Chi Square goodness of fit.  I used the Boxentriq site for the frequency analysis (and also the language frequencies).  I chose to group together letters to get expected frequencies at least 5.  This means that I meet the requirements that all expected frequencies are at least 1 and no more than 20% are less than 5.

Screen Shot 2023-11-04 at 7.32.56 AMOur Null hypothesis is that this language distribution fits the distribution for the English language. Do the calculation gives:

Screen Shot 2023-11-04 at 7.36.00 AM

We have 20-1 = 19 degrees of freedom therefore our critical value at 5% is 30.144.

28.6 < 30.144 therefore we have no evidence to reject the null hypothesis that this is from the English language.

Test 2:  Spanish or English?

Here is another quote (translated by Google – so maybe not perfect!)

Esto comprobará qué tan útil es la prueba de chi cuadrado para identificar la oración correcta. Cuanto más larga sea la frase, más precisa será esta técnica. Aquí hay algo para recordar siempre, como dijo   Einstein: “No te preocupes demasiado por tus dificultades en matemáticas, te puedo asegurar que las mías son aún mayores”.

This time when we do a Chi square test on this for English we get the following:

Screen Shot 2023-11-04 at 7.41.06 AM

This time we have:

Screen Shot 2023-11-04 at 7.41.54 AM

84.1 >  30.144 so this time we have evidence at the 5% level to reject our null hypothesis that this is from the English language.

So let’s test this quote against the distribution for Spanish letters.  Now the null hypothesis is that the Spanish language fits this data.  (Note I had to regroup slightly to ensure all expected were 5 or more):

Screen Shot 2023-11-04 at 7.43.45 AM

Screen Shot 2023-11-04 at 7.44.17 AM

This time we have 17 degrees of freedom, so the critical value at 5% is 27.587

19.5< 27.587 therefore we have no evidence at the 5% level to reject the null hypothesis that Spanish fits this data.

We can see how powerful this tool is – in the hands of a computer, these calculations are extremely easy and can be performed to compare every language in the world.  As long as you have a large enough text size this method will be very effective in identifying all languages – without ever having to understand a single word!

Test 3:  Code breaking

 Chi square also allows computers to quickly break all codes based on letter frequencies.  The simplest is a Caesar shift.  Say for example the following message is received:

uijtxjmmuftuipxvtfgvmuifdijtrvbsfeuftujtjojefoujgzjohuifdpssfdutfoufodfuifmpohfsuiftfoufodfuifnpsfbddvsbufuijtufdiojrvfxjmmcf.Ifsf jt tpnfuijoh up bmxbzt sfnfncfs, bt uif hsfbu fjotufjo tbje Ep opu xpssz upp nvdi bcpvu zpvs ejggjdvmujft jo nbuifnbujdt, J dbo bttvsf zpv uibu njof bsf tujmm hsfbufs.

All a computer has to do, is perform a chi squared goodness of fit on this data, then shift all the letters along by one and perform another test.  Once it has performed all 26 tests then the Chi Square with the lowest score is most likely the correct message.

For example, the Chi Square on this data gives:

Screen Shot 2023-11-04 at 7.51.11 AM

Screen Shot 2023-11-04 at 8.00.04 AM

Clearly not a good fit!  So we would then shift all observed letters back by 1 (so F goes to E etc).  This would then give a Chi Square of:

Screen Shot 2023-11-04 at 7.36.00 AM

The computer would do this for all 26 combinations – the Chi Square of 28.6 is likely to be the lowest as this is the correct decoded message (the same as the message at the start of the post).

So – there we go!  An extremely powerful algorithm for computers to employ in language recognition, pattern recognition and code breaking.

IB teacher? Please visit my new site http://www.intermathematics.com ! Hundreds of IB worksheets, unit tests, mock exams, treasure hunt activities, paper 3 activities, coursework support and more. Take some time to explore!

Andrew Chambers: (Resources for IB teachers)

Please visit the site shop:  http://www.ibmathsresources.com/shop to find lots of great resources to support IB students and teachers – including the brand new May 2025 prediction papers.

Andrew Chambers (Resources for Students)

Leave a Reply

Your email address will not be published. Required fields are marked *

Powered by WordPress.com.

Up ↑