« Live Chat on Investigating (and/or Prosecuting) Bush Administration Officials | Main | A Major Election, A Court, and (on another note) the Holocaust »

Wednesday, April 22, 2009

What is Quality Empirical Work, Part 1

In this post, I want to start to flesh out some answers to the challenges posed by evidence based legal empirical scholarship I raised earlier, or at least examine how to go about addressing them.


The key problem we face is the methodological complexity of observational work. The appeal of the randomized clinical trial is that (at least in theory, if not always in practice) it frees the analyst from having to think about who is in the various groups. As long as a few key steps are taken--blinding the analyst to treatment status and outcome, for example--random assignment and exogenous treatment eliminate much if not all of the potential sources of bias.

This strong argument in favor of randomized trials has led groups like the Cochrane and Campbell Collaborations to classify non-randomized, observational studies as--for all intents and purposes--unreliable. It is an attitude strongly reflected in the epidemiology literature on evidence-based practices as well


But the unpleasant truth is that experimentation, or its analogs, is not always possible, for several reasons (which I made in my earlier post, but I want to expand on a bit here): 

1. A randomized trial may be politically or ethically impossible. This is, of course, a problem that toxicologists face: they cannot randomly poison some people and not others (though, as evidenced by the Tuskegee experiments, this has happened in the past). But it is also hard--though not impossible--to randomized police interventions.

2. A randomized trial may be technically impossible. RCTs are impractical for studying, say, the effect of exposure to a particular drug or chemical over ten years. Maintaining control conditions for long periods of time is simply too hard a task, even if theoretically possible.

3. A randomized trial can measure only certain effects. As Heckman and Smith demonstrate, RCTs are effective at measuring the average difference between treated and untreated groups, but not other relationships of interest, such as the variation in response.

4. People know they're being tested. Levitt and List have shown that people, sensing that they are being experimented upon, consciously adjust their behavior. As a result, experimental of behavior are harder to interpret.

5. We just don't have time to wait. Experiments often take time to conduct. If we need to know what the relationship is between x and y, and we need to know it quickly, we need observational work if no experiment has already been conducted.


As a result of these factors, it is essential to figure out how to identify and aggregate high-quality observational work rather than facially dismissing it as insufficiently rigorous. To do so is to define large swaths of important issues of public policy to the realm of "unknowable." Perhaps in the end we'll discover that this is the right thing to do, but we are not there yet, and those making this point have by no means met their burden of proof (and I believe that they bear it).

So that forces us to ask the key question: what is a high-quality observational study, and how do we measure it? In this post, I just want to point out the difficulty of the first question. In my next post, I'll turn to the second half.

At first blush, "what is quality?" appears to be a trivial question. At the very least, high-quality study is one that controls for all potential sources of bias, and does so in the most efficient (i.e., precise) way possible. But it quickly becomes clear that the triviality is ephemeral.

In particular, how do we balance the two key traits, accuracy and precision. An instrumental variable approach, for example, reduces or eliminates bias in the presence of certain defects (such as endogeneity), but does so in a way that leads to less precise estimates (larger standard errors). Should we always prioritize unbiasedness? This seems to be the view of most economists, but it need not be the best answer. After all, which is more helpful to a policy planner: an unbiased estimate of 7, with a confidence interval of -10 to 24, or a slightly biased estimate of 5, with a confidence interval of 4 to 6? 

In other words, the tradeoff between bias and efficiency is not a trivial matter, despite the general lack of attention it receives. Thus while the components of quality may be easy to define, assembling those components into a workable and measurable definition of "quality" is much harder. In fact, there is likely no one definition; instead, it will turn on the type of question being asked, the nature of the data, and how the results are intended to be used.

What is essential, then, is that we start thinking about this issue more and addressing it more explicitly in empirical work. 

This is rarely done in original studies. What is surprising is that it appears to be an overlooked issue in the literature on how to develop quality gudielines. Rather than developing a definition of quality and then deducing the standards from it, most quality guidelines simply bring together a collection of terms thought to be correlated with an inchoate sense of "quality," but likely drawn from several different, and possibly contradictory, conceptions of it. As a result, it becomes unclear what, if anything, such guidelines are measuring. And so it should come as no surprise that different guidelines reach different conclusions, given that they almost surely are implicitly measuring different things. 

It is impossible to screen for quality if we do not know what quality is. And if empiricists have not yet reached any sort of definition, how can we expect non-empiricists? At some level, this points to a gaping whole in Daubert. Daubert asks a lay judge to assess whether a result is sufficiently reliable--whether it is of sufficient quality. Yet if a judge asked me "what is a reliable study?" I could not give him a generally agreed-upon answer, and I do not think I am alone in this. I find this very disturbing.

Up next: even if we agree on a definition of quality, how do we evaluate it?




 

Posted by John Pfaff on April 22, 2009 at 03:43 PM | Permalink

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341c6a7953ef01156f48c4c4970c

Listed below are links to weblogs that reference What is Quality Empirical Work, Part 1:

Comments

Hi,

Not sure if you're getting a chance to read the comments here, but there's an interesting recent paper that compares randomized experiments to non-randomized:

http://pubs.amstat.org/doi/abs/10.1198/016214508000000733

"Can Nonrandomized Experiments Yield Accurate Answers? A Randomized Experiment Comparing Random and Nonrandom Assignments," William R. Shadish, M. H. Clark, Peter M. Steiner. Journal of the American Statistical Association.

Posted by: Stuart Buck | Apr 23, 2009 5:09:37 PM

Post a comment