« MLR's Annual Book Review Issue | Main | Avoiding a Biased Exam: Always Expect Students to Know the Law But Never Expect Them to Know the Facts »

Monday, April 13, 2009

More Problems With Hypothesis Testing (From a More Technical Perspective)

I had mentioned on Friday that I wanted to follow up this post with a wonkier one. The arrival of the whirlwind that is my 2-1/2 year old neice pushed that back to today.

In my earlier post, I argued that what we do in the social sciences is not deductively test hypotheses but inductively measure actual effect sizes; we simply mask the inductive endeavor with the appearance of falsification.

In this post, I want to touch on two canaries in the coal mine. The first is more a pet peeve, but it is indicative of the extent to which hypothesis testing is conducted rotely. The second has deeper implications for how we should conduct empirical work in the future.

What, then, are the two problems?

1. You only get one star. It is common to see a table of results with all sorts of stars: one star for the result that is significant at the 90% level, two for 95%, three for 99%. Some go further and differentiate between one- and two-tailed tests. Such results, however, reflect a misunderstanding of what a test statistic is and how it should be interpreted. 

The confidence level is set in advance. The resulting p-value is a random variable. To adjust the level in response to the p-value is an improper ex post move. So if the pre-set level of significance is 95% and the resulting p-value is 0.0001, the proper response is to say "The results are significant at the 95% level (p = 0.0001)." To readjust the claim to "The results are significant at the 99% level (since) p = 0.0001" is simply incorrect.

Thus: there should only be one star per table, at whatever level the analyst sets in advance. The only reason for multiple stars would be if there were a theoretical ex ante reason for setting different levels for different terms.

In the spirit of full disclosure, I will openly admit that I have made this very error myself. It is an easy one to make, and it is clearly a common one. I try to be careful about this now, but it is an issue that is almost wholly unaddressed in the literature.

At some level, this might just look like carping. After all, I'm free to simply ignore the extra stars and read everything as being significant at whatever level I think should be set in advance (assuming the paper provides the p-value). But I raise this because I think it reflects a bigger problem (thus the canary reference above): hypothesis testing is often conducted in a rather automatic manner, without careful reflection about what is really going on and what our tests really mean.

2. There is nothing special about 95%. I once thought I had run into a mathematical impossibility. I had transformed my dependent variable in a way that I thought--mistakenly, I later realized--should have had no effect on the sign of the effect. But the sign reversed. In effect, I thought I had changed y to y + 10 only to find the effect flip. I was talking this over with someone, and her response was "well, is the result significant at the 95% level? No? Then that explains it." That couldn't be right: that my result was insignificant didn't save it from the laws of math and logic. 

It is easy to forget how arbitrary the 95% value is. As Hubbard and Bayarri point out, Neyman and Pearson chose 95% and 99% in their work on hypothesis testing because RA Fisher had used those in his version of such testing. We are thus strongly committed to a 95% level that comes from a choice of expediency drawn from different application--after all, Fisher's approach did not involve an alternate hypothesis. And there is no strong theoretical reason for 95% over any other level; it seems more to be the result of the historical accident that the first mover, Fisher, just happened to choose it.

And so it is disappointing that rarely if ever do we see a paper wrestle with the proper level of significance, such as by asking whether this is a case where a false negative is better or worse than a false positive. After all, it is not always clear that we are best served by the conservatism of the 95% confidence interval. A false positive may be worse than a false negative in criminal law ("better 10 guilty men go free..."), but a false negative may be worse in some medical situations, such as whether a particular pill works.

Given that I am suggesting that we should move away from hypothesis testing altogether, it may seem strange to raise these points: why complain about the flaws in a system that I think should be replaced? Two reasons.

First, these problems indicate that our current hypothesis testing procedures are not things that have been rigorously debated and worked out. They are often misapplied, and some of their core concepts appear to be more the result of quick choices than deep thinking. Thus the failure to address these concerns in current work do not reflect the fact that the issues are proven and settled (i.e., we don't ask geologists to start off by proving that the earth is roughly round). We really need to think much more carefully about these issue.

Second, inductive reasoning still requires us to estimate confidence intervals: we need to put bounds on what the true effect is. And so the debate about how wide to make those intervals will remain. Wide bands are more likely to contain the true value, but they will be noiser. It will be easier to see patterns in narrow bands; plus, the more we synthesize across multiple studies, the easier it will be to see if the bands are scattered or clustered, further reducing the need for wide bands (since we can see uncertainty either through one wide band or several scattered narrow bands). Moreover, it may be easier to see precision with narrow bands.

With this, I will conclude my look at where we've been. I now want to turn my attention to where we need to go. So tomorrow I will start looking at how to develop rigorous evidence-based approaches to empirical evidence in the social sciences.

Posted by John Pfaff on April 13, 2009 at 11:34 PM | Permalink


TrackBack URL for this entry:

Listed below are links to weblogs that reference More Problems With Hypothesis Testing (From a More Technical Perspective):


Loved this post too.

I don't know about the use of the term "true effect" . . . . would you agree that we'll never really know a "true effect" until and unless we have perfect data (like Laplace's demon) and a perfect model? Which isn't going to happen. Isn't the best we can do, "if we ran an infinity of statistical tests on the imperfect model and imperfect data that we have, then the coefficient would probably lie within such-and-such interval"?

Posted by: Stuart Buck | Apr 15, 2009 3:41:48 PM

The comments to this entry are closed.