Medical Data Mining

It’s worth watching the video above, where Derren Brown manages to flip 10 heads in a row.  With Derren being a professional magician, you might expect some magic or sleight of hand – but no, it’s all filmed with a continuous camera, and no tricks.  So, how does he achieve something which should only occur with probability (0.5)10 ≈ 0.001, or 1 time in every thousand?  Understanding this trick is essential to understanding the dangers of accepting data presented to you without being aware of how it was generated.

At 7 minutes in Derren reveals the trick – it’s very easy, but also a very persuasive way to convince people something unusual is happening.  The trick is that Derren has spent the best part of an entire day tossing coins – and only showed the sequence in which he achieved 10 heads in a row.  Suddenly with this new information the result looks much less remarkable.

Scientific tests are normally performed to a 5% confidence interval – that is, if there is a less than 5% chance of something happening by chance then we regard the data as evidence to reject the null hypothesis and to accept the alternate hypothesis.   In the case of the coin toss, we would if we didn’t know better, reject the null hypothesis that this is a fair coin and conjecture that Derren is somehow affecting the results.

Selectively presenting results from trials is called data mining – and it’s a very powerful way to manipulate data.  Unfortunately it is also a widespread technique in the pharmaceutical industry when they release data on new drugs.  Trials which show a positive effect are published, those which show no effect (or negative effects) are not.  This is a massive problem – and one which has huge implications for people’s health.  After all, we are prescribed drugs based on scientific trials which attest to their efficiency.  If this data is being mined to skew results in the drug company’s favour then we may end up taking drugs that don’t work – or even make us worse.

Dr Ben Goldacre has written extensively on this topic – and an extract from his article “The Drugs Don’t Work” is well worth a read:

The Drugs Don’t Work

Reboxetine is a drug I have prescribed. Other drugs had done nothing for my patient, so we wanted to try something new. I’d read the trial data before I wrote the prescription, and found only well-designed, fair tests, with overwhelmingly positive results. Reboxetine was better than a placebo, and as good as any other antidepressant in head-to-head comparisons. It’s approved for use by the Medicines and Healthcare products Regulatory Agency (the MHRA), which governs all drugs in the UK. Millions of doses are prescribed every year, around the world. Reboxetine was clearly a safe and effective treatment. The patient and I discussed the evidence briefly, and agreed it was the right treatment to try next. I signed a prescription.

But we had both been misled. In October 2010, a group of researchers was finally able to bring together all the data that had ever been collected on reboxetine, both from trials that were published and from those that had never appeared in academic papers. When all this trial data was put together, it produced a shocking picture. Seven trials had been conducted comparing reboxetine against a placebo. Only one, conducted in 254 patients, had a neat, positive result, and that one was published in an academic journal, for doctors and researchers to read. But six more trials were conducted, in almost 10 times as many patients. All of them showed that reboxetine was no better than a dummy sugar pill. None of these trials was published. I had no idea they existed.

It got worse. The trials comparing reboxetine against other drugs showed exactly the same picture: three small studies, 507 patients in total, showed that reboxetine was just as good as any other drug. They were all published. But 1,657 patients’ worth of data was left unpublished, and this unpublished data showed that patients on reboxetine did worse than those on other drugs. If all this wasn’t bad enough, there was also the side-effects data. The drug looked fine in the trials that appeared in the academic literature; but when we saw the unpublished studies, it turned out that patients were more likely to have side-effects, more likely to drop out of taking the drug and more likely to withdraw from the trial because of side-effects, if they were taking reboxetine rather than one of its competitors.

The whole article is a fantastic (and worrying) account of regulatory failure.  At the heart of this problem lies a social and political misunderstanding of statistics which is being manipulated by drug companies for profit.  A proper regulatory framework would ensure that all trials were registered in advance and their data recorded.  Instead what happens is trials are commissioned by drugs companies, published if they are favourable and quietly buried if they are not.  This data mining would be mathematically rejected in an IB exploration coursework, yet these statistics still governs what pills doctors do and don’t prescribe.

When presented data therefore, your first question should be, “Where did this come from?” shortly followed by, “What about the data you’re not showing me?”  Lies, damn lies and statistics indeed!

If you enjoyed this post, you might also like:

How contagious is Ebola? – how we can use differential equations to model the spread of the disease.