Drug Baron

The Primacy of Statistics: In defense of the pivotal Phase 3 Clinical Trial


Statistics have always had a bad press.  Ever since Disraeli’s supposed utterance “lies, damn lies and statistics” the public have treated any claim based on statistical analysis with appropriate caution.  But if it is indeed true that statistics can be used (or more likely abused) to ‘prove’ almost anything, it is equally true that without proper use of statistics almost nothing can be considered proven.

When it comes to drugs, that means comparing the response in the treated group with a placebo group to determine whether a given treatment has had any effect at all in that particular patient population.

With the rise in personalized medicine (the sound concept that what works for one patient may not work for the next), this well-accepted paradigm has started to face challenges – not least from patients brought up with a healthy dose of skepticism for statistics.  Looking at the trial data, there always seems to be a proportion of the people who seemed to respond very well to the drug – even if, on average, the improvement was not statistically significant.

Inebriated by the principle of personalization, patients and advocacy groups are quick to demand access to such a drug that can provide benefit for a subset of the patients in the trial.  But that conveniently forgets the purpose of statistics – the aggregation of data from lots of individuals so as to eliminate the operation of chance.

In an untreated population, some people get better and some people get worse over a given time interval.  Statistics determines how like it is that people getting treated did, on average, better than those who were untreated.

Assuming that anyone receiving a drug who is better at then end of the trial than they were before (without reference to a control population) has benefited from the treatment is very dangerous (if entirely understandable for the patient themselves and those who care about them).  Unless the difference is statistically significant, the supposed benefit is as likely to be due to chance than the costly drug.

Statistics are our only barrier of defense against old-fashioned quackery – where anecdotal evidence, powerful marketing and advocacy, rather than solid scientific data, are the basis for adopting new medicines

Of course, clinical trials can be (and often are) badly designed – looking at the wrong patient population or the wrong end-point.  That is the fault of the drug developer, not infrequently aided and abetted in their folly by the regulators.  Making such mistakes lets down the very people we are trying to help (as well as the investors in those companies). But we must not try to make up for those errors by undermining the primacy of the statistically significant clinical trial as the only acceptable metric for efficacy.

Dredging through the data of a failed trial to find subsets of ‘responders’ is becoming a favorite pastime.  Drug companies are doing it in Phase 2 data, and progressing drugs into vast and expensive Phase 3 trials that are doomed to fail. Patient advocacy groups are doing it to Phase 3 data to plead for the approval (and sale) of drugs that more than likely do not work at all (often with the full support of the owner of the drug who would love to be able to sell their product).

Statisticians of the world must unite, rise up and find a voice to halt this madness.

If you are treated with an unknown white powder, and you get better, we can never know if that would have happened anyway, or the powder is a wonder drug.  Any given individual can only ever follow one timeline, one set of choices, through life.  We can never know what would have happened if a different path had been taken.

Getting round this question is the purpose of statistics.  By aggregating data from lots of individuals, identical (on average) in every respect save for the intervention, it is possible to say whether the treated group did better (on average) than those who were untreated.

Its never possible to be certain – no matter how big the study its always possible in principle that the patients selected for treatment would have done better than those who were untreated.  But as the apparent benefit gets larger, the chance that the benefit was due to chance falls dramatically.  Statistics, then, simply puts a number on this probability.

This, the internal mechanics of statistics, is a mathematical certainty.  There is no room for doubt.  Statistics properly quantifies the chance that two groups were different.

If statistics is so powerful, how come it has earned such a grim reputation?

The issue is not in the answer but in the question.  Statistics often provides the perfect answer to an injudicious question.  Very subtle errors in the way the question was asked, in the selection of the groups, or the measurement of the outcome, and you find you have answered a question that was marginally, but critically, different from the one you intended.  Worse still, people love to generalize: most statistical studies answer a very specific question – for that population, the thing you measured most likely changed in response to treatment.  But for any other population, or any other measurement? Statistics is silent – without another trial.

This generalization is wide open to abuse.  Politicians in particular love to take statistics that ‘prove’ one question, but use them – stretching generalization beyond breaking point – to answer a superficially similar question to which they actually have little applicability.  This confidence trick, oft repeated, has led to the epidemic of doubt surrounding anything statistical.

On top of this landscape of general mistrust of statistics, another force has begun to challenge the value of a statistically significant clinical trial result: personalized medicine.  Patients are not numerical averages, they are individuals.  And individuals respond differently to medicines.

The temptation, armed with this knowledge (which is undisputedly true) is to trawl through the data of any clinical study and search for the individuals who responded.

But for any dataset with a reasonable amount of random noise in it (for that, read almost every clinical trial dataset), it is possible find a subset in the treated group who, once identified and analysed separately, clearly showed a much better outcome than the placebo group.  You can even (erroneously) apply a new statistical test comparing this group to the placebo and find the result massively significant.

Armed with such a finding, its easy to believe the drug works, and demand its approval.

But such a conclusion is completely unreliable.  Why? Because you only selected the responders AFTER you saw the data.  Statisticians use the Latin “post hoc” (meaning ‘after the fact’) to describe such analyses.  A more accessible term for the same trick is “data dredging” which conjures up a mostly accurate image of desperately sifting through sludge to locate a lost wedding ring.

“There are lies, damn lies and post-hoc data analysis”

Statistics absolutely depends for its power on doing your selection of who to include in the analysis before you collect the data – or “a priori” in statistico-Latin.

If you are “allowed” to pick the best responders in the treatment group, why shouldn’t you also be “allowed” to remove the people who got sick in the placebo group?  That would make any drug effect seem smaller.  But, you argue, some of the people in the treated group also would have got sick, so you cant eliminate them from one group and not the other!  Precisely.  Nor can you eliminate the poor-responders from the treatment group and not the placebo group.  But since the definition of ‘responder’ is only based on the response to the drug its impossible to know who they would be in the placebo group.

This kind of “post hoc” data dredging is now so prevalent, it has almost become the norm.  And its not just the statistically unsophisticated who are guilty.  It may be excusable (though no more factually misleading) for patients and advocates to engage in a little data dredging to demand better treatment, but global pharma companies do it too – with devastating consequences for their balance sheets.

The most egregious example, which DrugBaron has highlighted before, is the post hoc analysis of the solaneuzumab dataset collected by Eli Lilly.  The Phase 3 trials of this anti-amyloid failed by the only definition that matters: the pre-defined primary end-point showed no statistically significant difference between the treated and untreated patients.

There are (at least) two possible reasons for that: solaneuzumab does nothing useful at all in Alzheimer’s Disease, or Lilly chose the wrong patient population or end-point.

Convinced (by the strength of pre-existing amyloid hypothesis) it was the latter, Lilly and others powered up the data dredger.  And what did they find?  In one of the studies, a sub-group of patients with the mildest symptoms showed an improvement.  Even without the knowledge that the clinical definition of the sub-group matched the inclusion criteria for another Phase 3 trial that also failed, drawing this conclusion ‘post hoc’ was very unreliable.

The link between approval and a presumption of reimbursement is still too strong to allow approvals before proof of efficacy has been unambiguously obtained

To be fair, Lilly and the regulators have done the right thing: they have treated the ‘post hoc’ analysis as a hypothesis generator.  This is the only permissible use for data dredging: to better formulate the question for next time.  And so Lilly have embarked on yet another large and expensive Phase 3 programme that will properly test if solaneuzumab works in very mild (or pro-dromal) Alzheimer’s Disease.

Maybe they will be right (and for all the many patients just beginning down the track to Alzheimer’s, DrugBaron sincerely hopes they are).  After all, the data collected so far certainly doesn’t prove that they will fail.  BUT that’s not the purpose of early development.  The paradigm for drug development is to reduce the risk at each stage that the next (more expensive) step will fail.  To make economic sense, the chance of success must be above a certain threshold (that’s determined by the cost of development and the future market size if you are successful).  The problem for Lilly (and even more so for their investors) is that they have dramatically over-valued the ‘evidence’ from post hoc analysis.  They may not fail, but it’s a blind gamble at very poor odds.  If you like gambling against the house, go to Las Vegas not Wall Street.

The recent decision by the FDA not to approve the cystic fibrosis drug Bronchitol™ from Pharmaxis has highlighted the same problem.  Unlike the European regulators, the FDA unanimously decided the Phase 3 dataset did not support approval – after all, the improvement in the primary end-point (FEV1) in the whole treated population was not statistically significant.

All kinds of ‘re-analysis’ had then ensued to try and support approval.  Missing data was imputed, sub-groups were analysed and so forth.

Oli Rayer, one of the most eloquent patient advocates for better treatment of CF, wondered “whether the standard of statistical significance should be lowered in this case on the basis that statistically insignificant improvements in FEV1 could still yield important benefits from a clinical perspective in addition to standard of care”.

He went on to say: “The FDA makes its decisions based on a hypothetical average risk tolerance but there is really no such person as the average patient. It is unfortunate that patients in the US will, for now, be deprived of this important treatment option partly because the FDA committee members felt it would be a bad risk/benefit trade for a hypothetical patient that exists only in their minds.”

But these arguments are based on the, as yet unproven, assumption that Bronchitol™ actually does some good in at least some patients.  That is not proven precisely because the a priori analysis of the whole Phase 3 data set did not find a statistically significant effect of the treatment.

Implicit in the advocacy is the assumption that a small, but statistically insignificant, improvement on the average was caused by real improvement in a subset of responders – in other words, a post hoc analysis.

Relying on post hoc analyses is a blind gamble at very poor odds.  If you like gambling against the house, go to Las Vegas not Wall Street

In marked contrast to the solaneuzumab debacle, where DrugBaron suspects the eventual outcome will be failure, here the position is different: it looks quite likely that, if the appropriate trial had been done, Bronchitol™ would have proven effective by the conventional standards of statistical proof.

But does that mean the drug should be approved? Can we blame the regulator for letting down CF patients?  In reality, was it not Pharmaxis who let down the CF patients they were trying to help because the trial design and execution turned out to be inadequate?  That may be harsh – perhaps no-one could have done a better job developing this drug – but the fact remains they didn’t cross the finishing line (with a positive Phase 3 trial) and that remains, rightly, the standard for approval.

DrugBaron has argued many times that regulators are too powerful, and dominate rather than regulate the drug development pathway.  Together with @DShaywitz and @MichelleNMayer, he has argued that regulators should approve any drug that is reasonable safe and reasonably effective (versus placebo) and then leave it to very capable physicians to work with the individual patient in front of them to decide if the drug offers an desirable risk:benefit trade off for that individual.

But, and it is a big but, even such “minimal” approval standards should still require a demonstration of acceptable safety and efficacy versus placebo.  And that means an a priori analysis of a primary end-point Phase 3 trial.

To remove such a standard threatens the entire fabric of our healthcare systems.  Approval carries with it some expectation of reimbursement.  Patient advocacy groups are quick to criticize when payers (whether insurers or national governments) refuse to fund an approved treatment – most recently, the same CF community that supports approval of Bronchitol™ were quick (and in DrugBaron’s opinion right) to challenge the reticence of payers to meet the high price Vertex demanded for Kalydeco™ following its approval.

What if Pharmaxis wanted to charge a hundred thousand dollars per patient per year for Bronchitol™? Once approved, would the bandwagon demand we, the taxpayer, meet those costs too?  Even though there is no definitive evidence that Bronchitol™, unlike Kalydeco™, actually works? For anyone? Surely not.

The link between approval and a presumption of reimbursement is still too strong to allow approvals before proof of efficacy has been unambiguously and robustly obtained.

If it is indeed true that statistics can be used (or more likely abused) to ‘prove’ almost anything, it is equally true that without proper use of statistics almost nothing can be considered proven

The take-home message, then, is this: the danger in statistics lies entirely in the unwarranted generalization that sees the answer to one, very limited, question applied inappropriately to other more general questions that look superficially similar.  Being on guard against inappropriate generalization should not make us throw out statistics altogether though.  Without it, we have no tool to ever prove any hypothesis.

Used properly, in well designed trials addressing the right question, with an a priori analysis of a primary end-point (in other words, the kind of Phase 3 trial that’s acceptable to regulators), statistics are no more dangerous than a Catherine Wheel on bonfire night.  Indeed, they are our only barrier of defense against old-fashioned quackery – where anecdotal evidence from impressed patients and powerful marketing and advocacy, rather than solid scientific data, are the basis for adopting new medicines.

We must not let a culture of skepticism around statistics, nor a misunderstanding of personal medicine, nor even eloquent, heart-rending pleading from patients and their supporters, demolish this last bastion of rationality trumping emotion.

  • Oli Rayner


    You may be surprised to hear that I completely agree with what you say here. As a patient and someone with a interest in seeing new effective treatment options for everyone with CF (and other rare diseases actually), I certainly don’t want regulators to approve medicines unless we know they work and that risks can be managed. I watched the FDA meeting live and it is the first time I have had first-hand experience of these deliberations. What struck me was the highly paternalistic tone and, what seemed to me to be, the non-scientific decision-making process, albeit based on scientific evidence. The FDA, the sponsor and additional experts actually discussed the idea of re-framing the test of statistical significance and they also noted the uncertainty in some of the definitions of adverse events e.g. experts disagree on the definition of an “exacerbation”. In my amateur opinion, the thinking got very fuzzy at times. I actually think the FDA were quite right not to approve Bronchitol on the data in front of them – it would have sent a bad signal and it would have concerned me more if they had approved. The Bronchitol case is even more complicated because the current standard of care is nebulised hypertonic saline (HS) but patients in the clinical trials for Bronchitol had to discontinue this on the basis HS is not FDA approved for the indication. It is approved as a treatment to induce sputum production for samples. Yet is is prescribed off-label by CF physicians as an on-going treatment to help airway clearance and it is standard of care. A more sensible trial would compare Bronchitol to HS.

    Perhaps my blog was clumsily written (although I would point to the word “partly” in the sentence you quote above) and perhaps I was trying too hard to shoehorn the Bronchitol decision into the fabric of a bigger debate. There is a lesson there for me. My real point is more subtle and speculative. Based on the language that was used and the thought processes revealed, I am willing to bet the FDA would not have approved Bronchitol even if the efficacy had been statistically significant. They seemed to be (a) characterising the risks inappropriately since the adverse events were all short-acting and patients could just stop the treatment with no lasting harm; and (b) under-estimating the ability of specialist physicians and patients to manage those risks effectively. I think the specialist care, infrastructure and level of patient expertise/conditioning is highly material and something that should be given more weight. I think this represents an additional firewall in rare diseases like CF that is not present for non-rare condition and that it would justify a less paternalistic treatment of risks. I would not want to see treatments with proven efficacy (including perhaps an enhanced application of Bronchitol in the future) denied approval because risks are given unrealistic weight. That is my main point but I appreciate it is hypothetical since we will never know what the FDA would have decided if efficacy had been statistically significant.


    • http://twitter.com/sciencescanner David Grainger

      Thanks Oli – my comments were aimed specifically at your blog piece, it was just a convenient example to highlight one of the points I was trying to make.

      Both your piece and mine are trying to make wider points: you rightly highlight the overly paternalistic attitude of regulator (and in particular the FDA), and I think we both believe that drugs require less rather than more regulation (although the regulations that remain probably need to be applied even more rigidly and with greater transparency). Patients suffer (not just in rare diseases) if regulators insist on enforcing an ‘average’ risk tolerance – a point I know we agree on.

      My broader point was simply that the recent focus on personalized medicine (and even ‘personalized regulation’) risks undermining the absolute primacy of statistics. Statistics is by definition a game of averages. Personalized medicine is defined in precisely the opposite way. My point was simply that EVEN FOR PRESONALIZED MEDICINES a statistical demonstration of average efficacy is ABSOLUTELY required – and always will be. We must not let a ‘modern’ view of individualized medicines erode that barrier – or we will be back in the days of snake-oil salesmen! Again, it seems, we broadly agree on that point too!

  • Murali Apparaju

    A very compelling argument indeed – One factor though I think needs to be taken into account is the type of indication for which the drug is being pursued.

    While for the more prevalent indications such as metabolic disorders, cardio-vascular diseases et al where the sample size is large statistical rigor is highly relevant & decisive while approving the drug, can it be simultaneously argued that for orphan & other highly specific indications, where the sample is small, a solely statistical model will eliminate a lot of potential treatment options to statistical bias? – particularly since its being increasingly noted that an individual’s genetic make-up (presence or absence of mutations on a specific gene et al in the healthy or diseased tissue) can determine how the patient responds to the drug under evaluation? (case-in-study, Vemurafenib working for BRAFV600-Mutation Positive Metastatic Melanoma patients)

    • davidgrainger

      Thanks for your comment. This goes right to the nub of the argument.

      You are exactly right that the slavish adherence to statistics will deny people (particularly in small indications) access to medicines that do actually work. In the limit, unless you are an identical twin, you are the only person with your genotype and maybe the drug would work brilliantly for you and for no-one else. With statistics as the gatekeeper, you will never get access to that drug.

      BUT the key point of the piece is to point out that without statistics there is no way to know if that drug really did work for you. There is no control. At present (and maybe always) there is no alternative to a statistical test to be sure that a drug works at all.

      Unless you are happy to approve drugs that MIGHT work, then we have no choice but to accept that with a statistically significant phase 3 trial as the “gatekeeper” we will reject drugs that actually work, but which we cannot prove they work.

      For me, I would rather have the current system – where drugs have to be proven to work – than the one that existed prior to regulators, when snake-oil salesmen could sell anything as long as they could assemble a compelling enough argument to persuade the purchaser. That was a bad model – but allowing drugs through that havent passed a statistical test simply because they may work in some people, and there arent enough people to do the proper test, is a big step backwards.

      Yet I see that happening more and more, particularly in the orphan drugs space that you plead as a “special case” – which is precisely why I wrote this article!

      • Murali Apparaju

        Thank you David – let me clarify that I totally agree with your line of thought which was comprehensive, but was a little unclear about the outlier drug candidates that may not generate much statistically significant data. Cheers, Murali (when you can, please visit my blog visrasayan.blogspot.in)