« The Most Important Thing About Demographics and Distrust | Main | Comparative Con Law v. State Con Law »

Thursday, April 09, 2009

A Misguided Philosophy of Science

In my last post, I discussed the first key problem empirical work is facing: an explosion in the number of empirical studies, with the likely effect that average quality is declining. In this post I want to turn to the second major problem that I think ELS is facing, namely an incorrect philosophy of science.

During my first year as a economics graduate student, I spent at most two minutes thinking about the philosophy behind empirical work. On the first day of my year-long econometrics sequence, our professor quickly reminded us that hypotheses cannot be proven, only disproven. That was it. I don't even think Karl Popper's name came up. This is simply not an issue that social scientists wrestle with. Which is a problem, since what we do is not what we think we do.


We think we are engaged in Popperian refutation. Popper's theory is relatively simple to explain: Induction and confirmation are impossible, and all we can do is refute hypotheses. In other words, it is impossible to prove that all swans are white, no matter how many white swans I see. In fact, to Popper, each additional white swan provides no additional confirmation of that hypothesis. But all it takes is one black swan to refute the hypothesis. To Popper, then, science consists of proposing daring hypotheses, testing them, rejecting them, and then moving on to the next, and even more daring, hypothesis. The best thing one can ever say about a theory was that it has not yet been refuted: surviving a test does not, to Popper, provide any indication of accuracy.

The appeal of Popper's apporach is that it avoids the problems of induction, known to us since the time of David Hume. Popper's is a purely deductive logic. Our theory makes prediction X, we see that X is not so, so our theory is--and logically must be--wrong. (Logically speaking, this is modus tollens--the first time I've used that word since math class during my junior year in high school.)

There are, of course, some obvious problems with Popper's approach on its own terms. For example, he didn't entirely believe it himself (Susan Haack's work on Popper lays out his internal contradictions well). At one point he talks about science as being a cathedral built brick by brick over the centuries. But his theory is more about dynamiting that cathedral over and over and over again. And it may be impossible to know what you've refuted or why. Any meaningful scientific theory is a collection of assumptions and sub-theories. If the model predicts 6 and returns -12, it has been refuted, but what exact is the "it"? What part of the theory is wrong? Popper gives us little guidance here.

But the real problem for me is that the social sciences are not engaged in the Popperian endeavor. If nothing else, our theoretical models are incompatible with it. Compare criminology to physics. In criminology, we may be able to make a guess about the direction of the effect, but that is all: "more people in prison will lead to less crime" is the best we can do. There is no theoretical reason to say "a 4% increase in the prison population will lead to a 7% decrease in crime." Physics theory, however, can do just this. The Big Bang model, for example, predicts that cosmic background radiation will have a particular pattern--not "it will get louder for a while then softer for a bit," but the precise pattern it should hold. So the Big Bang theory can be tested: observe the patterns of background radiation and see if it rejects the prediction. NASA did just this with the CoBE satellite. The observed data fit the theoretical predictions almost perfectly; had it not, physicists would have known that some part of the Big Bang theory was wrong. 

Physics produces genuinely testable predictions. The social sciences do not.

When we take a closer look, the Potemkin Village-ness of our allegedly Popperian approach becomes even clearer. Popper calls on us to test ever-more-daring hypotheses. We not only test the same hypothesis over and over--that of no effect--but the hypothesis we test is the least daring one possible. Superficially, it looks like we're doing what Popper asked: we're trying to reject our hypothesis. But unlike the physicists, we actually want to reject it. (The physics community was very happy the CoBE results were what they were.) 

We've inverted the whole point of the process, in doing so we are trying to disguise confirmation as rejection.

And, most obviously, we treat our results as meaning something more than just that the null hypothesis was wrong. We draw policy conclusions from them. In effect, we treat our coefficients as confirming a true effect size, not as refuting the hypothesis of no effect. And maybe we have to. After all, in many cases the null hypothesis must be wrong. An increase in the prison population must have some effect on crime. Maybe a small one, maybe one too small to be detected in our data given our current techniques, but there must be an effect. In other words, from the very start we essentially know that our hypothesis is wrong, so refutation can't really be the goal of the process. And it isn't. What we're trying to do is estimate an effect size. And this is induction.

Perhaps this seems like a semantic debate. Okay, fine, we don't do what Popper tells us to do. But we've created some sort of hybrid system that works for us. Does it matter what we call it?

Absolutely. Because the hybrid system we have come up with doesn't actually work.

By thinking we're engaged in deductive falsification, we adopt a lone-wolf way of determining what we know. After all, it only takes one black swan to refute the hypothesis that all swans are white. So when my paper returns a statistically significant result, my paper has found the black swan. It doesn't matter what other people find. (Of course, especially in the social sciences, our instrumentation is imperfect, so we may want outside confirmation to make sure the swan really is black instead of our dirty instruments making a white swan look black.)

But this approach is completely antithetical to the inductive project we're actually engaged in. Induction requires us to add up a lot of results to put together the big picture. It lacks the logical purity of falsification, but so be it. This is what we do. Too often we hear empiricists say "My study shows that...." Not really. An individual study in an inductive world tells us very, very little--as compared to in a Popperian world, where it can tell us everything. Knowledge does not reside in a single study but in the synthesis of an entire literature. By convincing ourselves that we are falsifying, we relegate rigorous reviews to the second or third tier of research, rather than bringing them to the forefront where they belong.

Susan Haack provides a nice way to think about how we should approach knowledge production in the social sciences. She equates knowledge to a crossword puzzle. A single word, all alone in the grid, is not well-supported (especially if, as most scientific questions likely are, it is a Saturday NY Times). If 35-down is a seven-letter word for "Oceanographers' references," SEAMAPS fits, but we might not be all that confident it is right. But then we look at 46-across, which starts on the M, and we think that MATED could be the right answer for "Like shoes and socks;" we become a bit more sure that SEAMAPS is right. And our faith in MATED grows when we realize that 47-down, which starts on the D, is probably "DOSE," since the clue is "Recommended intake." So DOSE provides warrant for MATED, which in turn warrants SEAMAPS. Once the whole puzzle is filled, we're pretty sure we're right. We could be completely wrong--I've certainly erased entire quadrants of a crossword puzzle before--but the odds are low.

So too with empirical knowledge. When induction is the goal, we must see how all the pieces fit together. Our Popperian pretenses leave us poorly equipped to do this. 

In fact, while fields like epidemiology and medicine are developing rigorous evidence based practices, including powerful techniques to measure study quality and synthesize entire literatures, the social sciences appear to be taking almost no steps whatsoever in this direction. We occasionally dabble in meta-analyses, but no real effort at systematic reviewing has occurred. I may be wrong--one reason for these posts is find out what articles or points I have missed--but as far as I can tell there is no, absolutely no, work in economics on developing quality guidelines or systematic reviews.

Thus the costs of our failure to think about our philosophy of science are great, and the growth of ELS is only exacerbating the problem. I want to conclude by just briefly enumerating two key concerns:

1. A failure to synthesize. Too often we think one study tells us something. And when we do attempt to take a broader view, we tend to do so in a non-rigorous manner. This can lead to serious problems. For example: I attended a conference several years ago in which one of the speakers said that if we added up every paper that said "factor x is responsible for y% of the crime drop in the 1990s," we would see that we've explained about 250% of the crime decline. Everyone laughed but thought little of it. But this is a huge problem: the words in the crossword puzzle are not crossing properly. Something, somewhere, is wrong, but by taking a corpuscular view of empirical evidence, we miss it.

2. We use the wrong tools. I'll have a slightly wonkier post on this issue later today, but by thinking in terms of refutation, we provide the wrong results in our papers. In an inductionist world, p-values and t-statistics mean very little. What matters are confidence intervals, and these are almost never repoted. Moreover, what we need are the right confidence intervals. Those given by Stata and other packages implicitly assume that the null hypothesis is correct. For small effect sizes these intervals may be sufficiently good approximations. But for larger effects, they are increasingly inaccurate, and alternate confidence intervals, build around different distributional assumptions, are needed.

So it should be clear now why we need rigorous reviews, and why our failure to produce them is a fundamental epistemological failure. A single study can refute, but only an overview can confirm. And confirmation is what we do. The explosion in empirical work makes such overviews all the more important, since the larger the literature the harder it is to see the big picture, especially with an ever-growing pool of poorly-designed studies muddying the waters. Review essays are the very heart of empirical knowledge, and they should be treated as such. 

Looking at how to develop these reviews in the social sciences will, not surprisingly, be the focus of several of my upcoming posts.

Posted by John Pfaff on April 9, 2009 at 10:40 AM | Permalink

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341c6a7953ef01156f164a53970c

Listed below are links to weblogs that reference A Misguided Philosophy of Science:

» Lawyers, Bankers, and Explananda from Legal Profession Blog
Posted by Jeff Lipshaw In this morning's Wall Street Journal, James Freeman, an assistant op-ed page editor, reviews a book about James Dimon (The House of Dimon) written by one Patricia Crisafulli. I came for the gratuitous dig at lawyers,... [Read More]

Tracked on Apr 10, 2009 10:35:35 AM

Comments

We not only test the same hypothesis over and over--that of no effect--but the hypothesis we test is the least daring one possible. Superficially, it looks like we're doing what Popper asked: we're trying to reject our hypothesis. But unlike the physicists, we actually want to reject it.

I'm not sure about the middle sentence, but you have to add into this the problem of the hypothesis itself, which is the result not of induction nor deduction, but of what Peirce called abduction, or inference to the best explanation. As Steve McJohn, my colleague, who with Lorie Graham, just published a little essay that touches on this, said to me in the hallway just a couple days ago, it's abductive reasoning that's still the black hole. So if your paper returns what appears to be the black swan, we have to go back to the hypothesis you are testing and assess it to see whether the black swan has any explanatory power (or in Pinker's terms, OOMPH). Finding a black swan that requires a re-thinking of the downward sloping demand curve strikes me as a completely different animal (bird?) from finding one that refutes Ron Gilson's theory of value creation by lawyers. That is, the former has earned our respect in the way that the latter has not.

John, if you are looking for materials dealing with the meta-issues in the legal literature, I'll recommend my own Models and Games: The Difference between Explanation and Understanding for Lawyers and Ethicists, 56 Clev. St. L. Rev. 613, 636-49 (2008). I've also touched on this in Law's Illusion: Scientific Jurisprudence and the Struggle with Judgment, Beetles, Frogs, and Lawyers: The Scientific Demarcation Problem in the Gilson Theory of Value Creation and Disclosure and Judgment.

Posted by: Jeff Lipshaw | Apr 9, 2009 12:10:53 PM

A few things about the post escape me. First, the comparison to physics seems a bit artificial, as social science cannot be reduced the same way. And that's true of natural sciences, too. When you write:
In criminology, we may be able to make a guess about
the direction of the effect, but that is all: "more people
in prison will lead to less crime" is the best we can
do. There is no theoretical reason to say "a 4% increase
in the prison population will lead to a 7% decrease in crime."
That sounds to me a lot like the process used in FDA approval of drugs, a lot more significant than what we do and without great objection to my knowledge. You have a very good point that we need to use null hypotheses other than "no effect" but that's hardly a huge indictment of ELS as compared to countless other fields.

As for "no synthesis," that is certainly untrue in my primary area of study -- judicial decisionmaking. And it's not true of my secondary area on institutional analysis of legal systems. I'm not an expert on crime research, but I do see a lot of back and forth debate.

As for "wrong tools," this seems a little obscure; are you talking about substantive significance rather than statistical significance? If so, this is an increasingly recognized problem but throughout nearly all fields, not just ELS. And it is increasingly being addressed, in part through the trend to graphic depictions of relationships

Posted by: frankcross | Apr 9, 2009 12:28:42 PM

Frank: Thanks for you comments. My point isn't that we need a different null hypothesis but that we not really proposing and refuting hypotheses at all, whether they are "no effect" or "the elasticity is one." We are trying to measure a the effect directly, and this requires a different set of tools. The t-stat and the p-value are not very helpful, and putting a star next to a result does not tell us much.

And I am certainly not limiting my critique to ELS. I think a lot of fields suffer from the problems I lay out here, even if they don't acknowledge it. Thus my comparison to physics: there are certain fields for which Popperian falsification may make sense, and physics may be one of them. But the social sciences, and perhaps the biological sciences more generally, are not those fields.

(Perhaps in the FDA case, what is taking place is an effort to test the broader null hypothesis of "there is not a particularly large effect." In this case, falsification may make more sense. The null hypothesis is not a straw man, and for regulatory purposes the precise estimate is not so important.)

I'm glad to hear that there is synthesis taking place in other fields. But what is it like? In criminology, for example, there is plenty of back and forth, and plenty of what I will call "informal" or authoritative reviews. But outside of some systematic reviews of experiment work gathered by the Campbell Collaboration, I have seen no effort to develop rigorous evidence-based syntheses in the criminal justice literature, certainly not when it comes to observational work, where such guidelines are most needed.

And perhaps I wasn't so clear about the last point. I'm not really talking about the statistical vs. substantive point. Instead, I'm concerned that the way we calculate our confidence intervals, whether to report them numerically or graphically, is poisoned by the null hypothesis assumption (we use a central, rather than a non-central, distribution). This is an issue I've only started to wrestle with, so I'm not sure in the end how important it will be. But it struck me as one of those ways that Popper's influence slinks into our techniques without us even realizing it.

Posted by: John Pfaff | Apr 9, 2009 12:44:24 PM

Well, I'm a little loathe to get into the philosophy of science, on which I am not well read. But I think you are setting an unduly high standard here. You write re the FDA:

for regulatory purposes the precise estimate is not so important

I think that's true for our research too. I don't think you can get a precise estimate in the noisy world of social science and I don't think it's essential to get one. This can't be physics. Just providing a little more information to nudge our knowledge.

The debates in law involve a question like: "does wrongful discharge law increase unemployment." There's a "no effect" null hypothesis, but its useful to know if it can be rejected. If it can, we won't know precisely how much unemployment is caused but we have some information (that we previously lacked) that needs to be considered in the adoption of wrongful discharge laws. Then, it would be good to bracket the magnitude of that effect and see if it was replicable in further research. That's not everything, but it's a fairly valuable something.

Posted by: frankcross | Apr 9, 2009 1:24:07 PM

Frank: I actually think we're more or less on the same page. In criminal law, for example, we don't just want to know that more prisoners lead to less crime, but we want some sort of estimate of the magnitude. It is impossible to know whether a particular policy makes sense unless we can measure the costs and benefits, and that requires us to know the size of the effect. It is impossible, of course, to measure it exactly, but we need to have a sense of the bounds on it. Just like you said about unemployment.

Perhaps the difference between our positions is that your "then it would be good..." is to me what the real goal is.

But my point is that that is not hypothesis testing. Setting bounds on what we believe to be the true effect size is inductive confirmation, not deductive falsification. And that requires us to use different tools and to be much more careful about how we synthesize large bodies of empirical findings--and much more reliant on well-designed, rigorous syntheses.

In other words, no one study tells us very much, and the actuarial turn we've seen over the past thirty years suggests that, left unguided, most people, even experts, do a surprisingly poor job adding up all the little nudges to develop the big picture.

Evidence based medicine and evidence based policy, which are built on this very concern, are revolutionizing fields such as, well, medicine (clearly) and epidemiology. And so far, I just haven't seen that level of rigor in the social sciences.

But I should be careful not to set too high a goal. You're right that precision like CoBE is definitely impossible. But even deriving bounds is ultimately inductive.

Posted by: John Pfaff | Apr 9, 2009 1:38:43 PM

When I was a fellow at Stanford a couple years ago, Jeff Strnad presented a paper on Bayesian empirical analysis (here: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=991335 ) that I thought was simply fascinating. The idea is that you combine a variety of models and data in a way that allows you to see what role different variables are playing in different models. By doing this, you can see what happens as you insert and remove variables into the model. The idea is that you get a more "objective" look at the data and its results.

Perhaps this is what you have in mind as the tools, although even this will likely not get specific numbers for the values. If I recall correctly, Strnad's point was in part if you are going to "follow the results" rather than test falsifiable theories, then Bayesian analysis was the better way to do that.

Separately, I agree with most of what you write - we were always taught in my econometrics classes to come up with a theory first, and then test it, rather than plugging in variables until you get something that is significant.

However, I do disagree with this statement (which you actually seem to back off from later): "Physics produces genuinely testable predictions. The social sciences do not."

That's just not true. "If prices go up, fewer people will buy x." That's a falsifiable prediction, at least with respect to discrete markets. "If prices go up by 10%, 20% fewer people will buy x" Even this is falsifiable, so long as you have data where prices went up by 10% in the market.

The problem comes, I think, when the predictions get more complex and the external forces that might affect the outcome are greater. If we had a theory for those forces, though, we could test it.

Posted by: Michael Risch | Apr 10, 2009 11:46:30 AM

Michael: I'm a big fan of Jeff's paper. In fact, I've been working on a BMA paper with Jeff Fagan and Ethan Cohen-Cole for a little bit now.

Given your comment and some of Frank's, I think I may have been a bit unclear in my argument. I'm not trying to get more precise results. If anything, I want to focus more on confidence intervals than point estimates for this very reason. One of the things I find appealing about BMA is that it *expands* the confidence intervals, giving us a better sense of the limitations of what we can show.

As for your refutation point, you're right that we can propose falsifiable hypotheses, but I don't think that is what we are really doing. The problem is that, unlike in physics, in the social sciences our theories can't give us predictions like "a 10% increase in price will lead to a 20% decline in demand." If they could, then Popper may be more applicable.

Unfortunately, social science theories remain generic. Sometimes theory may suggest a point estimate--perhaps there is a theoretical reason to think an elasticity is one--but this is likely rare. At best our theories can suggest relationships (x will be larger than y, x will die off over time).

In fact, so generic are our theories that they are rarely wrong--demand falls with prices, crime falls with prison populations (at least in the short run), etc., etc. Peter Schmidt has some interesting articles along these lines suggesting that the focus on false positives in hypothesis testing is misplaced since usually the probability of a false positive is zero: we *know* the no-effect null is wrong. What we're doing isn't falsification.

So, since we can't generate the "10%-20%" theory, we try to find what the relationship is in the data: we try to estimate an effect size. Almost every paper makes the policy jump: "My coefficient of x means that a 1% change in z leads to a y% change in w." This is what we really care about--what is the elasticity of demand, not the generic sign of the elasticity--and this is an inductive question.

Posted by: John Pfaff | Apr 10, 2009 1:23:41 PM

John --

You might remember me from Judge Williams' chambers. Great post. Have you read Ziliak and McCloskey's "The Cult of Statistical Significance"? Perhaps a bit hyperbolic now and then, but still a good read.

Posted by: Stuart Buck | Apr 10, 2009 2:45:22 PM

Post a comment