Some six years ago I wrote an essay about John Ioannidis, now at Stanford, who stirred up the medical research community with a paper arguing that more than half of all medical research papers could not be trusted because the work described in them could not be replicated. Ioannidis’s original work dates from 2005, but he and others have moved into other areas as well as medicine. The amount of money wasted because of poor research, both by private enterprise and by governments, is enormous.
From time to time since I have heard murmurings that positive things are happening in medical science and elsewhere — occasionally even in climate science. Last week the National Association of Scholars in the US published a long paper on the whole issue, The Irreproducibility Crisis of Modern Science. It is sobering reading, if you are interested in the ethical structure of scientific endeavour. The NAS is politically conservative in the context of higher education, opposed to political correctness on campuses, as well as to age, sex and diversity quotas, and in favour of Great Books, rigour in argument and good data. It probably doesn’t need saying that these are by and large my own intellectual values too.
The introduction to the paper starts with the case of a postdoc, Oona Lönnstedt at Uppsala University. She and her supervisor, Peter Eklöv, published a paper in Science in June 2016, warning of the dangers of microplastic particles in the ocean. The microplastics, they reported, endangered fish. It turns out that Lönnstedt never performed the research that she and Eklöv reported.
The initial June 2016 article achieved worldwide attention and was heralded as the revelation of a previously unrecognized environmental catastrophe. When doubts about the research integrity began to emerge, Uppsala University investigated and found no evidence of misconduct. Critics kept pressing and the University responded with a second investigation that concluded in April 2017 and found both Lönnstedt and Eklöv guilty of misconduct. The university then appointed a new Board for Investigation of Misconduct in Research. In December 2017 the Board announced its findings: Lönnstedt had intentionally fabricated her data and Eklöv had failed to check that she had actually carried out her research as described. By then the postdoc had become an international environmental celebrity.
Deliberate fraud of this kind, the paper claims, is uncommon, though it may be increasing. At the heart of the problem is a failure both to follow good research design practices and to understand statistics properly (or at all, in some astonishing cases). Why does it happen? Why does so much research fail to replicate? Bad methodology, inadequate constraints on researchers, and a professional scientific culture that creates incentives to produce new results —innovative results, trailblazing results, exciting results — have combined to create the reproducibility crisis.
I have written about these issues before (for example, here), and have no pleasure in doing so again. Many of the examples cited come from the social sciences, which is most embarrassing to me, and I am sure to others of my old tribe. The extensive use of statistics is now almost universal in the social sciences and some of the natural sciences, and today’s researchers are able to employ computer statistical packages that allow the researcher, having assembled the data, to do little more than press a button. But what does the output mean, and does the researcher understand what the package actually does? According to the paper, far too frequently the answers are ‘Don’t know’ and ‘No’. Anyone who reads academic journal articles will see what looks like an obsession to find low p values (the lower the p value, the more likely it is that the null hypothesis is wrong, and the researcher’s hypothesis therefore more likely to be true). In March 2016 the American Statistical Association issued a “Statement on Statistical Significance and p-Values” to address common misconceptions. The Statement’s six enunciated principles included the warning that by itself a p value does not provide a good measure of evidence about a model or a hypothesis. That was drummed into me fifty years ago.
Also emphasised in those pre-desktop-computer days was the importance of deriving a hypothesis from one body of data and testing it on another. It was an absolute NoNo to test the hypothesis on the same body of data — for obvious reasons. The NAS paper finds that this practice is widespread, and that it leads to other malpractices. Scientists also produce supportive statistical results from recalcitrant data by fiddling with the data itself. Researchers commonly edit their data sets, often by excluding apparently bizarre cases (“outliers”) from their analyses. But in doing this they can skew their results: scientists who systematically exclude data that undermines their hypotheses bias their data to show only what they want to see. Perhaps the BoM, the CSIRO and other climate data-mongers could pause and think harder about what they are doing in homogenisation.
Researchers can also bias their data by ceasing to collect data at an arbitrary point, perhaps the point when the data that has already been collected finally supports their hypothesis. Conversely, a researcher whose data doesn’t support his hypothesis can decide to keep collecting additional data until it yields a more congenial result. Such practices are all too common. [A] survey of 2,000 psychologists noted … found that 36% of those surveyed “stopped data collection after achieving the desired result”.
Another sort of problem arises when scientists try to combine, or “harmonize,” multiple pre-existing data sets and models in their research—while failing to account sufficiently for how such harmonization magnifies the uncertainty of their conclusions. Claudia Tebaldi and Reto Knutti concluded in 2007 that the entire field of probabilistic climate projection, which often relies on combining multiple climate models, had no verifiable relation to the actual climate, and thus no predictive value. Absent “new knowledge about the [climate] processes and a substantial increase in computational resources,” adding new climate models won’t help: “our uncertainty should not continue to decrease when the number of models increases”.
And so it goes. Researchers are allowed too much freedom in current protocols, for example, going back and changing the research design to something more useful given the data. Researchers are reluctant to share their data or their methodology with others, and that makes reproducibility difficult from the beginning. Data sought from researchers are said to be lost, or in the wrong format, or a victim of a shift in computers or offices, and so on. Many journals state they require open-ness with data and methods, but the requirement seems not to be policed. For all researchers there is a premium on positive results, both to get published and to get a grant renewal. So researchers strive to get or discover significant statistical relationships in their data. And of course there is ‘groupthink’, which is abundantly illustrated in climate science but by no means restricted to it.
What can be done? The NAS proposes a long list of recommendations, some of which I agree with, and some that I think somewhat pie-in-the-sky. We cannot, for example, look to governments to solve the crisis, because governments are part of the problem: they search now for policy-based evidence. But there is said to be a general improvement in laboratory practices, at least, Nature says so, while there are a lot of new journals springing up which set out to follow the canons of reproducibility. And let us not forget the efforts of people like Anthony Watts, Judith Curry and others, who keep the doors open for those who want to argue with what should never be called ‘settled science’: if it’s science, it’s never settled, and if it’s settled, it isn’t science.
The NAS paper is long, but well written and most accessible. I recommend it to those who are interested in how we come to know things, and how best to do so. And, though it says little about the issue, the paper points to real problems with peer review, which include groupthink, the replacement of editors with others more favourable to the groupthink, and a sheer failure to think hard hard about what the proposed article is actually saying. The latter has become so obvious that spoof papers have been offered and published without apparently anyone’s even reading them closely. There is even a computer program available which will generate one for you. I doubt that the reproducibility problems are going to be solved quickly, but at least they are being recognised.