DrugBaron woke up this morning to the latest chapter in the “blood test for Alzheimer’s Disease” soap opera, with this article from the BBC trumpeting the latest study to be published as a “major step forward”.
The study behind the headlines was published (unusually for a “major breakthrough”) in the journal Alzheimer’s & Dementia, and described a study based on the measurement of 26 candidate protein biomarkers in blood samples from 1,148 individuals, 476 with Alzheimer’s Disease, 220 with mild cognitive impairment (MCI), and 452 controls of similar age but no dementia.
The authors of the study, and their industrial collaborators at Proteome Sciences plc, were not backward at promoting this work as a significant: Dr Ian Pike, Chief Operating Officer at Proteome Science declared “Having a protein test is really a major step forwards. [It] will take several years and need many more patients before we can be certain these tests are suitable for routine clinical use, that process can start fairly quickly now.”
Even independent voices were quick to praise the new research: Eric Karran, Director of Research at Alzheimer’s Research UK, described the study as a “technical tour de force”. Really?
This is, after all, not the first 2014 paper to make such a claim: in March a paper in Nature Medicine made almost identical claims, and BBC article reporting that study even used many of the same stock images! Even the headline claims of the two studies were similar (90% accuracy for the Nature Medicine paper verus 87% accuracy for the new study). And both used similar methodology: multivariate signatures, although the earlier study was focused on metabolic biomarkers and the new study on proteins.
So did the new study justify the hype any more than the previous attempts to solve this important problem? DrugBaron reviewed the primary publication with interest.
Lets start with the positive news: the analytical components of the study seem to be largely in order. The multiplexed ELISA assays employed are not particularly well validated, but in our experience they yield measurements that correlate with robust uniplex assays (although often only in terms of rank order of levels, as opposed to absolute levels). There is no reason to assume, therefore, that the table of 26 measures in 1,148 blood samples are not a reasonable snapshot of the relative levels of these small sample of blood proteins in those people.
The clinical definition of the subjects is also extensive and, as far as one can tell, robust. Its worth noting that only a minority had either rate of cognitive decline determined (n=342) or a brain MRI scan (n=476), so the ‘headline’ study size of over 1,000 only really applies to the crude classifications on the basis of medical history.
Its also worth noting that the study cohort was pooled from three separate studies. While, generally, that’s unlikely to create a spurious positive finding, it does introduce a lot of noise into the measurements (we know that some, at least, of these analytes are very sensitive to the precise methodology used to prepare the blood samples), and so decreases the chance of finding anything useful. However, an important control is not reported: the fraction of the subjects with MCI or AD in the three separate sub-cohorts is not described. This can create a massive artifact, in the same way that including subjects of different races can undermine a GWAS study. Imagine that more of the subjects from ANIM had AD than from KHP-DCR (two of the sub-cohorts used); if differences in the blood sampling and sample handling in ANIM (or KHP-DCR) caused a difference in the level of an analyte, that analyte would now (artifactually) be found associated with AD.
This subtle problem, however, is the least of the concerns
It is with the statistical analysis of the dataset the real problems begin. The impressive-sounding statistical terminology belies some serious flaws.
Lets begin with the simplest possible analysis: univariate (that is, one variable at a time) comparison of the markers between the three symptomatic groups. Only two of the 26 biomarkers were associated with dementia, and one of those (apoE) has been known as a key associate of Alzheimer’s Disease since the 1980s.
The authors did not attempt to determine the extent to which the new information in their apoE and CFH measures added to a predictive model of dementia status based only on APOE genotype and age.
Instead, they tortured the data
Since “many” of the measured biomarkers were associated with APOE genotype (a very strong associate with dementia status in this cohort as in most others ever studied), they corrected the biomarker levels for APOE geneotype and then performed partial correlations between the corrected biomarkers and a range of measures of brain atrophy. With lots of biomarkers and lots of MRI measures, not surprisingly, there were a lot of “significant” correlations (there are 25 in Table 2). They tested 364 correlations, so on average you would expect 18 significant to the level of p<0.05 by chance alone. Consistent with that, they mention, briefly, that only two of the correlations survived correction for multiple testing. One of those was ApoE again. Its quite unclear what value Table 2 serves other than to confuse readers unclear about the perils of multiple testing.
At this point, they should have concluded that there is little extra information in their dataset beyond APOE genotype (and maybe apoE levels).
Undeterred, however, they used multiple linear regression on what is really quite a small dataset and “identified” a signature associated with brain atrophy. The problem with this methodology is that there is no external prediction to assess the generalizability of the model. That variation in six markers “explained” (rather than “predicted”, as they state) 19.5% of the variability in hippocampal volume (their selected measure of brain atrophy) is true within the 476 people for whom they had MRI data but is likely to be a chance finding unique to this cohort.
The associations between the biomarkers and rate of cognitive decline suffered from the same flaws (the univariate correlations were not corrected for multiple testing, and the general linear model was not tested for generalizability).
The solution is to use multivariate statistics with a test set and a hold-out set, something they did for their prediction of conversion from MCI to AD, but inexplicably failed to do for the general linear models associating biomarkers and brain atrophy or cognitive decline. Without the external prediction test of generalizablility, these conclusions carry little or no value. One can only assume the generalizability tests were negative.
Then we come to the piece de resistance: comparing the subset of 51 MCI patients who went on to develop AD in the next year or so with 169 MCI patients who did not. This time, they correctly divided the group into a training set, used to build the model, and a test (or hold-out) set used to test the generalizability of the resulting model.
Unfortunately, the significance of the predictions made in this test set are not reported, and because the test set is now so small (with only 12 or 13 MCI converters to predict the presence of) even an ostensibly quite good prediction rate need not be significant. Its unclear why the p values (and indeed confidence intervals) associated with the AUC of the Receiver Operator Curves are not given, but one is once again left assuming its because they are not significant (particularly after multiple correction, since they assembled a dozen different ROC curves, even before they started playing with the cut-offs).
Of even greater concern, the prediction of conversion to dementia based on APOE genotype alone is indistinguishable from the model using the protein biomarkers. There is no evidence whatsoever for additional value in the protein biomarkers compared to the well-established predictive marker of APOE genotypes. Even if there were a visible difference in the power of the models, there is no power in such a small test-set to ask whether there is really additional predictive information present in the protein biomarker data.
“There all all sorts of headlines today about how there’s going to be a simple blood test for Alzheimer’s soon. Don’t believe them” – Derek Lowe
During the day, there has been some criticism of the hype surrounding this study based on the limited use of a test with 87% accuracy. In reality, to be useful clinically a 90% accuracy is insufficient for population screening, but it could be useful to select patients for admission to clinical trials or for monitoring response to treatment. It would certainly be an advance over what is currently available, and for that reason along would be welcome progress.
But sadly, this study does not deliver a test with such properties. It does not deliver any evidence that the biomarkers they examined contribute any independent diagnostic or prognostic information beyond the simple APOE genotype test that has been known for years. A more accurate title for the paper would be: “Twenty six of the best candidate protein biomarkers have no ability to predict conversion to dementia from prodromal disease beyond the information contained in the APOE genotype test”. That paper, however, may not have been accepted for publication even in Alzheimer’s and Dementia.
Far from being a “technical tour de force”, this study joins a long line of illustrious predecessors that have attempted to create a multivariate biomarker signature to predict disease, including the study published in March in Nature Medicine, which made the same claims while suffering from many similar methodological flaws.
Such is the complexity of multivariate statistics that its unsurprising such studies defeat the ability of journalists to critique them. The BBC, and other news organizations, can hardly be castigated for running “breakthrough” stories. The blame lies with a peer-review system that can allow such blatant flaws to go uncorrected, and with “experts” such as Eric Karren from Alzheimer’s Research UK who should know better before offering an independent opinion on such research, but perhaps most of all with the authors, such as Ian Pike, who, perhaps hopeful of commercial gain, made claims that are completely unsupported by their data.