As the number of reports highlighting the difficulty replicating academic studies proliferate, the clamour for better frameworks to ensure the repeatability of published science is becoming deafening.
But is replication the incomparable paragon it is held up to be? Should the scientific community be wasting resources doing the same things over and over again, rather than exploring new avenues?
At first sight, it seems obvious that shoring up important conclusions by independent replication is a good thing. After all, if (for example) costly drug development activities are initiated on the basis of flawed experimental data, then the investment will be wasted.
But there is a problem with replication. When the first two attempts at the same experiment yield different results, who was correct? Without a large number of replicates, the true answer remains unknown. And there is also an alternative approach: weight-of-evidence. By looking at related questions, its possible to determine how likely the original experimental result is to be accurate.
Seeking consistency with neighbouring questions, rather than strict replication, has another important advantage too: it tests whether an observation is generalizable rather than simply true. If a finding applies only under the strict conditions of the original experiment, it is very unlikely to be a useful conclusion at all.
The distinction between verifying a conclusion and replicating an experiment may be subtle, but to DrugBaron it is also very important.
It is important because verification is relatively straightforward, but replication is often extremely difficult. For a start, no matter how carefully a complex scientific experiment is described, the subtleties of the method are likely to be missed – and in most cases the methods sections of published papers are anything but a careful description. Even as journals moved on-line, committing space to a full description of the methods has remained a hurdle for cultural, as much as pragmatic, reasons.
But even armed with a perfect description of the experiment, other factors hamper true replication: at least in the life sciences. Experiments often depend on particular organic reagents (proteins, antibodies, cells, whole animals or even people). And different supplies of these complex reagents are relatively truly identical. If your cells are different from mine, is it any wonder that your result differs from mine?
Then there is the small matter of skill. The individuals publishing a study have very often invested years or even decades refining the skills necessary to execute the experiment with the requisite care and attention to detail necessary to yield useful insights. In many cases, there are no other groups equipped with the same experience and ability. Finding someone capable of replicating the original experiment may therefore be a challenge, and the failure to find the same results originating from a lab with less of the relevant skills risks generating “false negative” concern.
Even if the replication experiment is apparently straight-forward, and it yields a different answer to the original publication, how do you arbitrate between the conclusions? The inference seems to be that the second result is somehow more robust than the first (perhaps because the replication lab has no vested interest in one outcome over the other) but equally the second lab may lack the skill and experience of the first. The only solution is to embark on a sequence of replications until the number of outcomes favouring one side or the other becomes overwhelming (but even that may lead to a flawed conclusion if the skill to execute the experiment properly is really rare).
Lastly, replication (particularly more than once) consumes valuable resources that could be used to learn new things. The human tendency to sheep-like behavior already limits the productivity of the global research endeavor (because scientists are far more likely to probe a small variation of what was previously known, rather than exploring brave new worlds). Encouraging even more resource to be consumed doing ostensibly the same thing over and over risks being counter-productive.
But if replication is not the answer, surely blindly accepting every published result as accurate is not the way forward either?
Absolutely. Accepting what is published as accurate is a sure fire way to disappointment and disillusionment with science.
Fortunately, there is a third way. A way that, more often than not, delivers confidence without performing a single additional experiment. A way that delivers more value than simple replication, with less resources. A way that has been available for decades, requiring no new initiatives or frameworks or the like.
The optimal solution is to adopt a “weight-of-evidence” approach. If a study produces conclusion A, instead of trying to replicate the study to bolster confidence in conclusion A the alternative is to ask “what else must be true if – and only if – A is true?”. This leads to a series of predictions (B,C and D) that each apparently depend on A being true.
One attraction of this approach is that testing these predictions often requires no additional practical work. A search of the literature often locates a suitable experiment supporting (or disproving) some, if not all, of the predictions. And provided these experiments come from independent labs and do not depend on the same methodology as the original study, they provide as much if not more confidence in the original conclusion as would a full, independent replication.
Sometimes, though, there will be no prior published evidence to support or refute predictions B,C or D. Even in these cases, where additional experimentation is required, it is usually possible to select a prediction that can be tested with simpler, cheaper experiments than those in the original study. Any requirement for the “special skills” of the original researchers are thereby neatly side-stepped (and false-negatives due to failed replications will be reduced as a consequence).
But the biggest advantage by far is that any additional resources expended are generating new information – information that not only increases (or decreases) confidence in the prior conclusion but also materially adds value by demonstrating the degree to which the original conclusion is generalizable.
This is very valuable information, because to be useful, a conclusion must not only be true (the only thing tested by multiple replications) but also generalizable beyond the specific conditions of the original experiment. To underpin a therapeutic, for example, a finding in mice must generalize to humans. So making and testing predictions in, say, rats adds value beyond an independent replication in mice, as would examining effects on different end-points or time-points. Using pharmacological approaches to test validity of genetic findings – and vice versa – is similarly empowering.
“If a finding applies only under the strict conditions of the original experiment, it is very unlikely to be a useful conclusion at all”
None of this is rocket science. Indeed, it has been the time-honoured practice in basic science – and even in applied sciences such as toxicology (where “weight of evidence” for lack of toxicity is the accepted gold standard). The scientific literature grows as a body of evidence, inter-connected, self-referential and ultimately robust even if significant numbers of its individual nodes are flawed. Just how consistent new findings are with the rest of the global research oeuvre is the yardstick by which most scientists judge the individual work-product of their colleagues.
Sure, new findings break paradigms – but usually by refining the generalizability of prior models rather than by completely dismantling them (just as relativity showed that Newtonian mechanics is only an approximation valid at ‘normal’ scales). Things that lie at fundamental odds with swathes of other data are usually proven to be wrong (such as famous papers on cold fusion and homeopathy; indeed this correspondence on gamma rays produced by cold fusion provides a perfect example of marshalling consistency arguments to “disprove” the original experiment).
Best of all, even if an exciting new finding clashes with much of the existing knowledge base, it still makes more sense to seek predictions and extrapolations of the original observation to test than it does to attempt a replication.
Next time you hear the shrill, vocal minority castigating the scientific community for lack of repeatability – remember “thus was it ever so”. And the solution, as it always has been, lies in testing the generalizability of a finding rather than simply setting out to replicate it.