Statistics have always had a bad press. Ever since Disraeli’s supposed utterance “lies, damn lies and statistics” the public have treated any claim based on statistical analysis with appropriate caution. But if it is indeed true that statistics can be used (or more likely abused) to ‘prove’ almost anything, it is equally true that without proper use of statistics almost nothing can be considered proven.
When it comes to drugs, that means comparing the response in the treated group with a placebo group to determine whether a given treatment has had any effect at all in that particular patient population.
With the rise in personalized medicine (the sound concept that what works for one patient may not work for the next), this well-accepted paradigm has started to face challenges – not least from patients brought up with a healthy dose of skepticism for statistics. Looking at the trial data, there always seems to be a proportion of the people who seemed to respond very well to the drug – even if, on average, the improvement was not statistically significant.
Inebriated by the principle of personalization, patients and advocacy groups are quick to demand access to such a drug that can provide benefit for a subset of the patients in the trial. But that conveniently forgets the purpose of statistics – the aggregation of data from lots of individuals so as to eliminate the operation of chance.
In an untreated population, some people get better and some people get worse over a given time interval. Statistics determines how like it is that people getting treated did, on average, better than those who were untreated.
Assuming that anyone receiving a drug who is better at then end of the trial than they were before (without reference to a control population) has benefited from the treatment is very dangerous (if entirely understandable for the patient themselves and those who care about them). Unless the difference is statistically significant, the supposed benefit is as likely to be due to chance than the costly drug.
Statistics are our only barrier of defense against old-fashioned quackery – where anecdotal evidence, powerful marketing and advocacy, rather than solid scientific data, are the basis for adopting new medicines
Of course, clinical trials can be (and often are) badly designed – looking at the wrong patient population or the wrong end-point. That is the fault of the drug developer, not infrequently aided and abetted in their folly by the regulators. Making such mistakes lets down the very people we are trying to help (as well as the investors in those companies). But we must not try to make up for those errors by undermining the primacy of the statistically significant clinical trial as the only acceptable metric for efficacy.
Dredging through the data of a failed trial to find subsets of ‘responders’ is becoming a favorite pastime. Drug companies are doing it in Phase 2 data, and progressing drugs into vast and expensive Phase 3 trials that are doomed to fail. Patient advocacy groups are doing it to Phase 3 data to plead for the approval (and sale) of drugs that more than likely do not work at all (often with the full support of the owner of the drug who would love to be able to sell their product).
Statisticians of the world must unite, rise up and find a voice to halt this madness.
If you are treated with an unknown white powder, and you get better, we can never know if that would have happened anyway, or the powder is a wonder drug. Any given individual can only ever follow one timeline, one set of choices, through life. We can never know what would have happened if a different path had been taken.
Getting round this question is the purpose of statistics. By aggregating data from lots of individuals, identical (on average) in every respect save for the intervention, it is possible to say whether the treated group did better (on average) than those who were untreated.
Its never possible to be certain – no matter how big the study its always possible in principle that the patients selected for treatment would have done better than those who were untreated. But as the apparent benefit gets larger, the chance that the benefit was due to chance falls dramatically. Statistics, then, simply puts a number on this probability.
This, the internal mechanics of statistics, is a mathematical certainty. There is no room for doubt. Statistics properly quantifies the chance that two groups were different.
If statistics is so powerful, how come it has earned such a grim reputation?
The issue is not in the answer but in the question. Statistics often provides the perfect answer to an injudicious question. Very subtle errors in the way the question was asked, in the selection of the groups, or the measurement of the outcome, and you find you have answered a question that was marginally, but critically, different from the one you intended. Worse still, people love to generalize: most statistical studies answer a very specific question – for that population, the thing you measured most likely changed in response to treatment. But for any other population, or any other measurement? Statistics is silent – without another trial.
This generalization is wide open to abuse. Politicians in particular love to take statistics that ‘prove’ one question, but use them – stretching generalization beyond breaking point – to answer a superficially similar question to which they actually have little applicability. This confidence trick, oft repeated, has led to the epidemic of doubt surrounding anything statistical.
On top of this landscape of general mistrust of statistics, another force has begun to challenge the value of a statistically significant clinical trial result: personalized medicine. Patients are not numerical averages, they are individuals. And individuals respond differently to medicines.
The temptation, armed with this knowledge (which is undisputedly true) is to trawl through the data of any clinical study and search for the individuals who responded.
But for any dataset with a reasonable amount of random noise in it (for that, read almost every clinical trial dataset), it is possible find a subset in the treated group who, once identified and analysed separately, clearly showed a much better outcome than the placebo group. You can even (erroneously) apply a new statistical test comparing this group to the placebo and find the result massively significant.
Armed with such a finding, its easy to believe the drug works, and demand its approval.
But such a conclusion is completely unreliable. Why? Because you only selected the responders AFTER you saw the data. Statisticians use the Latin “post hoc” (meaning ‘after the fact’) to describe such analyses. A more accessible term for the same trick is “data dredging” which conjures up a mostly accurate image of desperately sifting through sludge to locate a lost wedding ring.
“There are lies, damn lies and post-hoc data analysis”
Statistics absolutely depends for its power on doing your selection of who to include in the analysis before you collect the data – or “a priori” in statistico-Latin.
If you are “allowed” to pick the best responders in the treatment group, why shouldn’t you also be “allowed” to remove the people who got sick in the placebo group? That would make any drug effect seem smaller. But, you argue, some of the people in the treated group also would have got sick, so you cant eliminate them from one group and not the other! Precisely. Nor can you eliminate the poor-responders from the treatment group and not the placebo group. But since the definition of ‘responder’ is only based on the response to the drug its impossible to know who they would be in the placebo group.
This kind of “post hoc” data dredging is now so prevalent, it has almost become the norm. And its not just the statistically unsophisticated who are guilty. It may be excusable (though no more factually misleading) for patients and advocates to engage in a little data dredging to demand better treatment, but global pharma companies do it too – with devastating consequences for their balance sheets.
The most egregious example, which DrugBaron has highlighted before, is the post hoc analysis of the solaneuzumab dataset collected by Eli Lilly. The Phase 3 trials of this anti-amyloid failed by the only definition that matters: the pre-defined primary end-point showed no statistically significant difference between the treated and untreated patients.
There are (at least) two possible reasons for that: solaneuzumab does nothing useful at all in Alzheimer’s Disease, or Lilly chose the wrong patient population or end-point.
Convinced (by the strength of pre-existing amyloid hypothesis) it was the latter, Lilly and others powered up the data dredger. And what did they find? In one of the studies, a sub-group of patients with the mildest symptoms showed an improvement. Even without the knowledge that the clinical definition of the sub-group matched the inclusion criteria for another Phase 3 trial that also failed, drawing this conclusion ‘post hoc’ was very unreliable.
The link between approval and a presumption of reimbursement is still too strong to allow approvals before proof of efficacy has been unambiguously obtained
To be fair, Lilly and the regulators have done the right thing: they have treated the ‘post hoc’ analysis as a hypothesis generator. This is the only permissible use for data dredging: to better formulate the question for next time. And so Lilly have embarked on yet another large and expensive Phase 3 programme that will properly test if solaneuzumab works in very mild (or pro-dromal) Alzheimer’s Disease.
Maybe they will be right (and for all the many patients just beginning down the track to Alzheimer’s, DrugBaron sincerely hopes they are). After all, the data collected so far certainly doesn’t prove that they will fail. BUT that’s not the purpose of early development. The paradigm for drug development is to reduce the risk at each stage that the next (more expensive) step will fail. To make economic sense, the chance of success must be above a certain threshold (that’s determined by the cost of development and the future market size if you are successful). The problem for Lilly (and even more so for their investors) is that they have dramatically over-valued the ‘evidence’ from post hoc analysis. They may not fail, but it’s a blind gamble at very poor odds. If you like gambling against the house, go to Las Vegas not Wall Street.
The recent decision by the FDA not to approve the cystic fibrosis drug Bronchitol™ from Pharmaxis has highlighted the same problem. Unlike the European regulators, the FDA unanimously decided the Phase 3 dataset did not support approval – after all, the improvement in the primary end-point (FEV1) in the whole treated population was not statistically significant.
All kinds of ‘re-analysis’ had then ensued to try and support approval. Missing data was imputed, sub-groups were analysed and so forth.
Oli Rayer, one of the most eloquent patient advocates for better treatment of CF, wondered “whether the standard of statistical significance should be lowered in this case on the basis that statistically insignificant improvements in FEV1 could still yield important benefits from a clinical perspective in addition to standard of care”.
He went on to say: “The FDA makes its decisions based on a hypothetical average risk tolerance but there is really no such person as the average patient. It is unfortunate that patients in the US will, for now, be deprived of this important treatment option partly because the FDA committee members felt it would be a bad risk/benefit trade for a hypothetical patient that exists only in their minds.”
But these arguments are based on the, as yet unproven, assumption that Bronchitol™ actually does some good in at least some patients. That is not proven precisely because the a priori analysis of the whole Phase 3 data set did not find a statistically significant effect of the treatment.
Implicit in the advocacy is the assumption that a small, but statistically insignificant, improvement on the average was caused by real improvement in a subset of responders – in other words, a post hoc analysis.
Relying on post hoc analyses is a blind gamble at very poor odds. If you like gambling against the house, go to Las Vegas not Wall Street
In marked contrast to the solaneuzumab debacle, where DrugBaron suspects the eventual outcome will be failure, here the position is different: it looks quite likely that, if the appropriate trial had been done, Bronchitol™ would have proven effective by the conventional standards of statistical proof.
But does that mean the drug should be approved? Can we blame the regulator for letting down CF patients? In reality, was it not Pharmaxis who let down the CF patients they were trying to help because the trial design and execution turned out to be inadequate? That may be harsh – perhaps no-one could have done a better job developing this drug – but the fact remains they didn’t cross the finishing line (with a positive Phase 3 trial) and that remains, rightly, the standard for approval.
DrugBaron has argued many times that regulators are too powerful, and dominate rather than regulate the drug development pathway. Together with @DShaywitz and @MichelleNMayer, he has argued that regulators should approve any drug that is reasonable safe and reasonably effective (versus placebo) and then leave it to very capable physicians to work with the individual patient in front of them to decide if the drug offers an desirable risk:benefit trade off for that individual.
But, and it is a big but, even such “minimal” approval standards should still require a demonstration of acceptable safety and efficacy versus placebo. And that means an a priori analysis of a primary end-point Phase 3 trial.
To remove such a standard threatens the entire fabric of our healthcare systems. Approval carries with it some expectation of reimbursement. Patient advocacy groups are quick to criticize when payers (whether insurers or national governments) refuse to fund an approved treatment – most recently, the same CF community that supports approval of Bronchitol™ were quick (and in DrugBaron’s opinion right) to challenge the reticence of payers to meet the high price Vertex demanded for Kalydeco™ following its approval.
What if Pharmaxis wanted to charge a hundred thousand dollars per patient per year for Bronchitol™? Once approved, would the bandwagon demand we, the taxpayer, meet those costs too? Even though there is no definitive evidence that Bronchitol™, unlike Kalydeco™, actually works? For anyone? Surely not.
The link between approval and a presumption of reimbursement is still too strong to allow approvals before proof of efficacy has been unambiguously and robustly obtained.
If it is indeed true that statistics can be used (or more likely abused) to ‘prove’ almost anything, it is equally true that without proper use of statistics almost nothing can be considered proven
The take-home message, then, is this: the danger in statistics lies entirely in the unwarranted generalization that sees the answer to one, very limited, question applied inappropriately to other more general questions that look superficially similar. Being on guard against inappropriate generalization should not make us throw out statistics altogether though. Without it, we have no tool to ever prove any hypothesis.
Used properly, in well designed trials addressing the right question, with an a priori analysis of a primary end-point (in other words, the kind of Phase 3 trial that’s acceptable to regulators), statistics are no more dangerous than a Catherine Wheel on bonfire night. Indeed, they are our only barrier of defense against old-fashioned quackery – where anecdotal evidence from impressed patients and powerful marketing and advocacy, rather than solid scientific data, are the basis for adopting new medicines.
We must not let a culture of skepticism around statistics, nor a misunderstanding of personal medicine, nor even eloquent, heart-rending pleading from patients and their supporters, demolish this last bastion of rationality trumping emotion.