The all new, good old-fashioned, solution to the “replication crisis” in science

February 11, 2014 by admin | 7 Comments

As the number of reports highlighting the difficulty replicating academic studies proliferate, the clamour for better frameworks to ensure the repeatability of published science is becoming deafening.

But is replication the incomparable paragon it is held up to be? Should the scientific community be wasting resources doing the same things over and over again, rather than exploring new avenues?

At first sight, it seems obvious that shoring up important conclusions by independent replication is a good thing. After all, if (for example) costly drug development activities are initiated on the basis of flawed experimental data, then the investment will be wasted.

But there is a problem with replication. When the first two attempts at the same experiment yield different results, who was correct? Without a large number of replicates, the true answer remains unknown. And there is also an alternative approach: weight-of-evidence. By looking at related questions, its possible to determine how likely the original experimental result is to be accurate.

Seeking consistency with neighbouring questions, rather than strict replication, has another important advantage too: it tests whether an observation is generalizable rather than simply true. If a finding applies only under the strict conditions of the original experiment, it is very unlikely to be a useful conclusion at all.

The distinction between verifying a conclusion and replicating an experiment may be subtle, but to DrugBaron it is also very important.

It is important because verification is relatively straightforward, but replication is often extremely difficult. For a start, no matter how carefully a complex scientific experiment is described, the subtleties of the method are likely to be missed – and in most cases the methods sections of published papers are anything but a careful description. Even as journals moved on-line, committing space to a full description of the methods has remained a hurdle for cultural, as much as pragmatic, reasons.

But even armed with a perfect description of the experiment, other factors hamper true replication: at least in the life sciences. Experiments often depend on particular organic reagents (proteins, antibodies, cells, whole animals or even people). And different supplies of these complex reagents are relatively truly identical. If your cells are different from mine, is it any wonder that your result differs from mine?

Then there is the small matter of skill. The individuals publishing a study have very often invested years or even decades refining the skills necessary to execute the experiment with the requisite care and attention to detail necessary to yield useful insights. In many cases, there are no other groups equipped with the same experience and ability. Finding someone capable of replicating the original experiment may therefore be a challenge, and the failure to find the same results originating from a lab with less of the relevant skills risks generating “false negative” concern.

Even if the replication experiment is apparently straight-forward, and it yields a different answer to the original publication, how do you arbitrate between the conclusions? The inference seems to be that the second result is somehow more robust than the first (perhaps because the replication lab has no vested interest in one outcome over the other) but equally the second lab may lack the skill and experience of the first. The only solution is to embark on a sequence of replications until the number of outcomes favouring one side or the other becomes overwhelming (but even that may lead to a flawed conclusion if the skill to execute the experiment properly is really rare).

Lastly, replication (particularly more than once) consumes valuable resources that could be used to learn new things. The human tendency to sheep-like behavior already limits the productivity of the global research endeavor (because scientists are far more likely to probe a small variation of what was previously known, rather than exploring brave new worlds). Encouraging even more resource to be consumed doing ostensibly the same thing over and over risks being counter-productive.

But if replication is not the answer, surely blindly accepting every published result as accurate is not the way forward either?

Absolutely. Accepting what is published as accurate is a sure fire way to disappointment and disillusionment with science.

Fortunately, there is a third way. A way that, more often than not, delivers confidence without performing a single additional experiment. A way that delivers more value than simple replication, with less resources. A way that has been available for decades, requiring no new initiatives or frameworks or the like.

The optimal solution is to adopt a “weight-of-evidence” approach. If a study produces conclusion A, instead of trying to replicate the study to bolster confidence in conclusion A the alternative is to ask “what else must be true if – and only if – A is true?”. This leads to a series of predictions (B,C and D) that each apparently depend on A being true.

One attraction of this approach is that testing these predictions often requires no additional practical work. A search of the literature often locates a suitable experiment supporting (or disproving) some, if not all, of the predictions. And provided these experiments come from independent labs and do not depend on the same methodology as the original study, they provide as much if not more confidence in the original conclusion as would a full, independent replication.

Sometimes, though, there will be no prior published evidence to support or refute predictions B,C or D. Even in these cases, where additional experimentation is required, it is usually possible to select a prediction that can be tested with simpler, cheaper experiments than those in the original study. Any requirement for the “special skills” of the original researchers are thereby neatly side-stepped (and false-negatives due to failed replications will be reduced as a consequence).

But the biggest advantage by far is that any additional resources expended are generating new information – information that not only increases (or decreases) confidence in the prior conclusion but also materially adds value by demonstrating the degree to which the original conclusion is generalizable.

This is very valuable information, because to be useful, a conclusion must not only be true (the only thing tested by multiple replications) but also generalizable beyond the specific conditions of the original experiment. To underpin a therapeutic, for example, a finding in mice must generalize to humans. So making and testing predictions in, say, rats adds value beyond an independent replication in mice, as would examining effects on different end-points or time-points. Using pharmacological approaches to test validity of genetic findings – and vice versa – is similarly empowering.

“If a finding applies only under the strict conditions of the original experiment, it is very unlikely to be a useful conclusion at all”

None of this is rocket science. Indeed, it has been the time-honoured practice in basic science – and even in applied sciences such as toxicology (where “weight of evidence” for lack of toxicity is the accepted gold standard). The scientific literature grows as a body of evidence, inter-connected, self-referential and ultimately robust even if significant numbers of its individual nodes are flawed. Just how consistent new findings are with the rest of the global research oeuvre is the yardstick by which most scientists judge the individual work-product of their colleagues.

Sure, new findings break paradigms – but usually by refining the generalizability of prior models rather than by completely dismantling them (just as relativity showed that Newtonian mechanics is only an approximation valid at ‘normal’ scales). Things that lie at fundamental odds with swathes of other data are usually proven to be wrong (such as famous papers on cold fusion and homeopathy; indeed this correspondence on gamma rays produced by cold fusion provides a perfect example of marshalling consistency arguments to “disprove” the original experiment).

Best of all, even if an exciting new finding clashes with much of the existing knowledge base, it still makes more sense to seek predictions and extrapolations of the original observation to test than it does to attempt a replication.

Next time you hear the shrill, vocal minority castigating the scientific community for lack of repeatability – remember “thus was it ever so”. And the solution, as it always has been, lies in testing the generalizability of a finding rather than simply setting out to replicate it.

NJBiologist

“Next time you hear the shrill, vocal minority castigating the scientific community for lack of repeatability – remember ‘thus was it ever so’.”
I’m unable to repeat your results. In fact, I can’t remember a pre-1985 published result I couldn’t replicate; results published after 2000 have been the hardest for me to replicate.
- davidgrainger
  
  Your memory must be letting you down then! The two examples I gave (cold fusion and infinite dilution of antibodies) came from the 1980s. And during my PhD studies in the late 80s and early 90s we found much that we couldn’t replicate.
  
  I don’t think theres much evidence that lack of reproducibility is getting worse with time. But even if it is, then the solutions I suggest remain the same. The way to deal with reproducibility issues is not replication, whether or not its more common today than in the past.
eugeneivanov101

Thanks for the interesting post, David. As I see it, a “replication problem” often morphs into a “replication crisis” because of a bold–if not outright sensational–claim being made by the authors of a controversial paper. Like reporting a molecule that can cure all cancers. When something that you claim can’t bring you millions or the Nobel Prize, no one usually wants to repeat your experiments.

That being said, I take an exception with articles describing new methodologies. If Church & Gilbert’ sequencing method depended on specific reagents they used or Kary Mullis’ PCR protocol needed exclusively tap water running in his lab, both technologies would have never revolutionized molecular biology. It’s their stellar performance independent of the specific circumstances that makes them great.
Ian Skidmore

If you were buying a used car wouldn’t you take it for a test drive or get the AA to give it the once over? If I want to invest time and money trying to develop a finding into a new medicine wouldn’t I want to be able to repeat the finding right at the start? Part of the problem may lie in the original observations. Is n greater than 1? how many negative results were there to balance out the positive ones? If I fail to repeat the finding can the original investigator repeat it now in his/her lab? Your weight of evidence approach is an elegant one and, I think, very suited to the academic environment but for drug discovery I think I would start by “kicking the tyres”.
- davidgrainger
  
  Thanks for your comment, Ian.
  
  The issue of how to interpret different outcomes when you attempt a replicate is interesting. Even one failure may be enough to convince you a finding is not robust (certainly in a drug discovery environment). And thats fine, PROVIDED there was no asymmetric skill or experience that allowed the original observer to achieve their result. In other words, if you can dpi what they did as well (or better) than they did it, thats fine. Do the replication, and if it doesn’t come out the same you should ignore the original data point.
  
  My article was really focussed on the very many times when that isn’t true. The potential replicator doesn’t understand what they are doing as well as the initial team. In these circumstances, the weight of evidence approach provides a much more viable alternative to gain confidence than a strict replication.
  
  Its also worth pointing out that the two approaches are not binary opposites, but the ends of a spectrum. Repeating a study with a slightly different animal model, or end-point, or species might – to some – be counted as replication, while to others is clearly a prediction or extension based on the original. In that sense, I completely agree that what you need to do is “kick the tyres” – a good metaphor, since it captures the idea of doing some testing that may not be a strict replication but which tests the robustness of the concept.
Jakub

Great article, but in my head it just raised more questions! How one can balance the need for replicas vs. weighting the evidence? If one is a (good) researcher one first needs to convince him/herself that the result of an experiment is indeed true. That requires either repeating the experiment ‘n’ times or, as you suggested, doing an experiment ‘B’ showing that the experiment ‘A’ was correct. When one takes small ‘n’ an experiment ‘B’ might fail, but that won’t necessarily disprove an experiment ‘A’ – it will make it less probable. From the other hand, taking large ‘n’ increases the odds of ‘A’ being true AND limits the need to do an experiment ‘B’. I do agree however that the latter is more painful but could also save you some money.
http://www.thedivelab.com Antifragile

“Things that lie at fundamental odds with swathes of other data are usually
proven to be wrong”. Generally agree, but watch out here, there are important exceptions, especially if the building block on which subsequent data expands on is fundamentally flawed, even taken-for-granted-untested, in which case potential gems & potentially profound implications might be overlooked. As example in case, since forever in a day, it’s been a steadfast dogma of physiology & medicine that human core temp & metabolism can’t down-regulate naturally in normal & healthy human beings to below puportedly basal rates (small effects from starvation & circadian rhythms aside); I’m talking about basic vital signs reaching to off-the-chart values & on demand. Indeed the main protagonists screamed these loud & clear with a comparison akin to attempting human flight …without aid. It took an exceedingly, simple experiment (breath-holding!) going against an overwhelming tide, to show we were not only capable of it, but could do so to a substantial extent (N=5) & in most cases some three times faster than any living critter, i.e., diving & hibernating critters. My point in question, you might also care to question the weight of the surrounding evidence, going so far as to question the fundamentals, ’cause that’s were the gems are. I could cite at least to or three other rule-breakers that go completely against a really old & overwhelming tide.

M	T	W	T	F	S	S
« Aug
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31