An interesting post in the context of data mining. William Easterly‘s Maybe we should put rats in charge of foreign aid research posting on his Aid Watch blog does not only look at how much better rats do at correctly predicting random occurrences; he also reflects on the research practice of data mining. Data mining is the process of extracting hidden patterns from data, but it is often misused by researchers when deliberately looking for apparent but not necessarily representative patterns in large amounts of data (*data dredging* and *data snooping*).

*The post below republishes William Easterly’s post on the rather interesting rat experiment and it’s interpretation; I have also complemented it with some further reflections by Mark Thoma – for the serious statisticians 😉 .*

Laboratory experiments show that rats outperform humans in interpreting data, which is why we have today the US aid agency, the Millennium Challenge Corporation. Wait, I am getting ahead of myself, let me explain.

The amazing finding on rats is described in an equally amazing book by Leonard Mlodinow. The experiment consists of drawing green and red balls at random, with the probabilities rigged so that greens occur 75 percent of the time. The subject is asked to watch for a while and then predict whether the next ball will be green or red. The rats followed the optimal strategy of always predicting green (I am a little unclear how the rats communicated, but never mind). But the human subjects did not always predict green, they usually want to do better and predict when red will come up too, engaging in reasoning like “after three straight greens, we are due for a red.” As Mlodinow says, “humans usually try to guess the pattern, and in the process we allow ourselves to be outperformed by a rat.”

Unfortunately, spurious patterns show up in some important real world settings, like research on the effect of foreign aid on growth. Without going into any unnecessary technical detail, research looks for an association between economic growth and some measure of foreign aid, controlling for other likely determinants of economic growth. Of course, since there is some random variation in both growth and aid, there is always the possibility that an association appears by pure chance. The usual statistical procedures are designed to keep this possibility small. The convention is that we believe a result if there is only a 1 in 20 chance that the result arose at random. So if a researcher does a study that finds a positive effect of aid on growth and it passes this “1 in 20” test (referred to as a “statistically significant” result), we are fine, right?

Alas, not so fast. A researcher is very eager to find a result, and such eagerness usually involves running many statistical exercises (known as “regressions”). But the 1 in 20 safeguard only applies if you only did ONE regression. What if you did 20 regressions? Even if there is no relationship between growth and aid whatsoever, on average you will get one “significant result” out of 20 by design. Suppose you only report the one significant result and don’t mention the other 19 unsuccessful attempts. You can do twenty different regressions by varying the definition of aid, the time periods, and the control variables. In aid research, the aid variable has been tried, among other ways, as aid per capita, logarithm of aid per capita, aid/GDP, logarithm of aid/GDP, aid/GDP squared, [log(aid/GDP) – aid loan repayments], aid/GDP*[average of indexes of budget deficit/GDP, inflation, and free trade], aid/GDP squared *[average of indexes of budget deficit/GDP, inflation, and free trade], aid/GDP*[ quality of institutions], etc. Time periods have varied from averages over 24 years to 12 years to to 8 years to 4 years. The list of possible control variables is endless. One of the most exotic I ever saw was: the probability that two individuals in a country belonged to different ethnic groups TIMES the number of political assassinations in that country. So it’s not so hard to run many different aid and growth regressions and report only the one that is “significant.”This practice is known as “data mining.” It is NOT acceptable practice, but this is very hard to enforce since nobody is watching when a researcher runs multiple regressions. It is seldom intentional dishonesty by the researcher. Because of our non-rat-like propensity to see patterns everywhere, it is easy for researchers to convince themselves that the failed exercises were just done incorrectly, and that they finally found the “real result” when they get the “significant” one. Even more insidious, the 20 regressions could be spread across 20 different researchers. Each of these obediently does only one pre-specified regression, 19 of whom do not publish a paper since they had no significant results, but the 20th one does publish their spuriously “significant” finding (this is known as “publication bias.”)

But don’t give up on all damned lies and statistics, there ARE ways to catch data mining. A “significant result” that is really spurious will only hold in the original data sample, with the original time periods, with the original specification. If new data becomes available as time passes you can test the result with the new data, where it will vanish if it was spurious “data mining”. You can also try different time periods, or slightly different but equally plausible definitions of aid and the control variables.

So a few years ago, some World Bank research found that “aid works {raises economic growth} in a good policy environment.” This study got published in a premier journal, got huge publicity, and eventually led President George W. Bush (in his only known use of econometric research) to create the Millennium Challenge Corporation, which he set up precisely to direct aid to countries with “good policy environments.”

Unfortunately, this result later turned out to fail the data mining tests. Subsequent published studies found that it failed the “new data” test, the different time periods test, and the slightly different specifications test.

The original result that “aid works in a good policy environment” was a spurious association. Of course, the MCC is still operating, it may be good or bad for other reasons.

Moral of the story: beware of these kinds of statistical “results” that are used to determine aid policy! Unfortunately, the media and policy community don’t really get this, and they take the original studies at face value (not only on aid and growth, but also in stuff on determinants of civil war, fixing failed states, peacekeeping, democracy, etc., etc.) At the very least, make sure the finding is replicated by other researchers and passes the “data mining” tests.

*And here is Mark Thoma’s supplement …*

I saw Milton Friedman provide an interesting example of avoiding data mining. I was at a SF Fed conference where he was a speaker, and his talk was about a paper he had written 20 years earlier on “The Plucking Model.” From a post in January 2006, New Support for Friedman’s Plucking Model:

Friedman found evidence for the Plucking Model of aggregate fluctuations in a 1993 paper in *Economic Inquiry*. One reason I’ve always liked this paper is that Friedman first wrote it in 1964. He then waited for more than twenty years for new data to arrive and retested his model using only the new data. In macroeconomics, we often encounter a problem in testing theoretical models. We know what the data look like and what facts need to be explained by our models. Is it sensible to build a model to fit the data and then use that data to test it to see if it fits? Of course the model will fit the data, it was built to do so. Friedman avoided this problem since he had no way of knowing if the next twenty years of data would fit the model or not. It did.

The other thing I’ll note is that there is a literature on how test statistics are affected by pretesting, but it is ignored for the most part (e.g. if you run a regression, then throw out an insignificant variable, anything you do later must take account of the fact that you could have made a type I or type II error during the pretesting phase). The bottom line is that the test statistics from the final version of the model are almost always non-normal, and the distribution of the test statistics is not generally known.

[One more note. I wrote a paper on Friedman’s Plucking Model, and had a revise and resubmit at a pretty good journal. I satisfied all the referee’s objections, at least I thought I had, and it was all set to go. I had sent the first version of the paper to Friedman, and he wrote back with a long, multi-page letter that was very encouraging, and I incorporated his suggestions into the revision (a reason I’ll always have a soft spot for him, his time was valuable, yet he took the time to do this). But the final results weren’t robust, and had come about through trying different specifications until one worked. The final specification worked well, very well in fact, but the results were pretty fragile. As a result, I pulled the paper and did not resubmit it. The paper was completely redone and rewritten, but after thinking it over I decided it wasn’t robust enough to publish. I find myself regretting that sometimes, the referees would have probably taken the paper since the final version satisfied all their objections, and it was a good journal – I told myself I had simply done what everyone else does, etc. But, hard as it was for an assistant professor in need of publications to pull a paper, especially one Friedman himself had endorsed – this was just before going up for tenure so it could have mattered a lot – pulling the paper was the right thing to do. The only way to solve this problem – and data mining in economics is a problem – is for the people involved in the research to self-police the integrity of the process.]

**Update**: Seems like a good time to rerun this graph on publications in political science journals:

Lies, Damn Lies, and

….: Via Kieran Healy, …It is, at first glance, just what it says it is: a study of publication bias, the tendency of academic journals to publish studies that find positive results but not to publish studies that fail to find results. …

The chart on the right shows G&M’s basic result. In statistics jargon, a significant result is anything with a “z-score” higher than 1.96, and if journals accepted articles based solely on the quality of the work, with no regard to z-scores, you’d expect the z-score of studies to resemble a bell curve. But that’s not what Gerber and Malhotra found.

Abovea z-score of 1.96, the results fit the bell curve pretty well, butbelowa z-score of 1.96 there are far fewer studies than you’d expect. Apparently, studies that fail to show significant results have a hard time getting published.So far, this is unsurprising. Publication bias is a well-known and widely studied effect, and it would be surprising if G&M

hadn’tfound evidence of it. But take a closer look at the graph. In particular, take a look at the two bars directly adjacent to the magic number of 1.96. That’s kind of funny, isn’t it? They should be roughly the same height, but they aren’t even close. There are alotof studies that just barely show significant results, and there arehardly anythat fall just barely short of significance. There’s a pretty obvious conclusion here, and it has nothing to do with publication bias: data is being massaged on wide scale. A lot of researchers whoalmostfind significant results are fiddling with the data to get themselves just over the line into significance. … Message to political science professors: you are being watched. And if you report results just barely above the significance level, we want to see your work….