Thursday, April 16, 2009
The Path Ahead: Evidence Based Empirical Work
In my posts so far, I have tried to point out some serious issues I have with how empiricists go about drawing statistical conclusions. My key concern is that while our hypothesis-testing procedures are built around the idea of deductive falsification, what we are actually doing is more inductive confirmation. The null hypothesis is a straw man: we set it up just to knock it down, because what we really believe is the alternative.
Confirmation requires a broader view than refutation. One black swan refutes the claim "all swans are white," but a single white swan does not confirm it. That single white swan actually provides very, very little evidence for the claim "all swans are white." In fact, "are all swans white?" is not the question we are really asking: "what fraction of swans are white?" is. And so confirmation requires us to look at how all the various pieces of evidence fit together: how well do my orinthological observations mesh with yours?
There are two levels at which this matters. First is at that of an individual research question. Different researchers looking at the same issue can reach different conclusions. One, for example, may find a strong effect of incarceration on crime, another a weak one. (Both may be statistically significant, but that's not really the focus on the papers.) We need to think carefully and rigorously about why they reach different conclusions: is one methodology stronger than the other? do the different results reflect differences in the data? or unstable effects? and so on. There has been some work done along these lines, but not enough, and our methods remains too crude.
The second level has been almost wholly unaddressed as far as I can tell: looking at how different research questions fit together. Perhaps there is a concensus about the effect of incarceration on crime. And perhaps there are also concensuses about the effects of policing on crime, of parole supervision on crime, of drug treatment on crime, etc., etc. But how do all these estimated effects fit together? If combined they explain 200% of the crime drop since 1992, that's a clear sign that something is wrong somewhere. Perhaps they all correctly refuted the null--perhaps all of these have some sort of impact--but if we care about effect size we need to make sure all the numbers add up.
For now, however, I just want to focus on the first part: how can accurately assess what we know about a particular research question? The standard approach has been the meta-analysis--gather a lot of studies, try to convert the results into a common metric, and statistically "add up" the results to see what the aggregate effect is.
And at a basic level, this is the right way to approach the problem. But the traditional way of producing them suffers from several flaws, as those working in the evidence based medicine and evidence based policy fields have pointed out. In particular:
1. GIGO. "Garbage In, Garbage Out." A meta-analysis is only as reliable as the work that goes into it. (If the past year has taught us anything, it is that you generally cannot aggregate worthless things into something worthwhile.) Thus what is needed is not a synthesis of everything, but a synthesis of everything that is reliable.
2. We meaure reliability poorly. This is the actuarial turn: our own personal judgments of quality are often biased. What we need is some sort of structured guidance for assessing quality.
3. We miss a lot. A review should include every relevant study, regardless of quality (although the low-quality studies should only be acknowledged for completeness and transparency, not included in the final synthesis). But most meta-analyses rely on the analyst's personal view of the literature.
So we need something better than the meta-analysis. The solution has been the evidence-based systematic review. The process consists of three basic steps:
1. Develop quality guidelines. These guidelines determine what constitutes good or bad (or better or worse) work, and they should be developed before the review process begins. Moreover, to the best extent possible, the quality criteria should themselves be empirically validated, not just asserted. Developing these guidelines is remarkably challenging, and I will talk about this in more detail soon.
2. Gather the entire literature. The goal is to get every article: published in academic or professional journals, published in government reports, as many unpublished articles as possible (which is increasingly doable thanks to sites like SSRN and databases such as Dissertation Abstracts International). There are efforts to create ever-more powerful computer algorithms to aid in these searches as well.
3. Apply the guidelines and draw conclusions. Intuitively, this is straight-forward, though there are some challenging questions of methodology: how should we handle poor studies, for example (exclude them altogether or under-weigh their results)? should we produce a quantitative or qualitative summary? and so on.
The resulting reviews are (believed to be) substantially superior to the traditional meta-analysis or literature review. At least right now, it is hard--and I appreciate the irony of what I am about to say--to empirically test the superiority of evidence-based approaches, since that requires meta- (or possibly even meta-meta-) evidence that we simply do not yet have. But the spreading adoption of this approach, from medicine to epidemiology all the way to baseball and, basically, the Presidential election, is at least some evidence of its superiority.
There are at least three key benefits to such reports.
1. Comprehensiveness. There is no reason to assume that the partial sample of studies in the current literature review is an unbiased sample of the entire literature. Systematic reviews get us much closer to completeness.
2. Transparency. Every paper is reviewed, and the analyst should explain why each paper scores high or low. The guidelines facilitate this, by providing a clear system for demonstrating why different papers are treated differently. The traditional review was much more opaque.
3. Objectivity. Judgment remains in all systems, but it is more greatly cabined here (and, if nothing else, made much more explicit). There are examples of reviewers reaching different conclusions about the same question from the same literature based on disagreements about how to score particular studies. But such disagreements are likely rarer, and when they happen we have a much better understanding of where the disagreement exists.
But there is a profound short-sightenedness to the current evidence based policy movement. Its general focus is on experimental work. Many proponents and developers of evidence based quality guidelines effectively exclude observational work, defining it collectively as methodologically weak. Unforunately, there are a lot of issues that can be examined only by observational work. Experimentation may be politically or ethically impossible. Experimentation is good at estimating differences in average effects but not other variables of interest; it is also not an effective way to measure long-run effects (could we maintain experimetnal controls for 25 years?). And sometimes we simply don't have time to wait for an experiment. This is particularly true in legal settings: the case must be decided, and decided (relatively) quickly.
So quality guidelines must be developed for observational work. But to overcome the problems introduced by the lack of randomization, observational work is substantially more complex methodologically than experimental, and so the guidelines are also much tougher to develop.
What issues demand attention? These are some of the concerns that I will be addressing in future posts:
1. What is quality? A surprisingly unaddressed question anywhere in evidence based policy field. Most guidelines do not define the term and then deduce standards from it, but rather just assemble a collection of "quality terms" that do not necessarily add up to a single coherent view of quality.
2. How do we measure quality? While there is perhaps consensus on the problems observational studies face, there is less on the solutions. This makes guideline design tricky. For example: should guidelines simply point to a general problem and leave it to the reviewer's judgment whether the paper successfully addresses it, or should guidelines point to specific solutions? Thus, should guidelines just ask "Did the paper control for endogeneity?" or should they specify "Did the paper control for endogeneity using technique (1) or (2)?" or something else altogether?
3. How do we verify guideline terms? Terms should not just be asserted, but tested. Yet this requires meta-evidence. Such evidence is difficult to acquire in the observational setting because the number of methodological issues is likely greater than the number of studies on the topic--if we have 10 papers and 15 quality criteria, it becomes very hard to determine what is going on. (Technically, the problem of "overdetermination" arises.)
4. How do we handle low-quality studies? Do we just exclude them altogether? Do we underweigh them? The answer turns in part on what we define as quality--if all that matters is unbiasedness, then we should include only the least-biased study, but if efficiency matters then perhaps we should be more inclusive.
5. How do we handle competing guidelines? Different guidelines applied to the same literature can reach different conclusions. This could appear to be a problem--all we're doing is rolling back the locus of problematic judgment to the choice of guidelines. But there is reason to be optimistic as well. Guideline divergence, if it arises between empircally validated guidelines, is actually informative, helping us appreciate better exactly what it is we know and what it is we do not know.
Systematic reviews have the ability to radically improve how we produce empirical knowledge, but much work lies ahead to develop them for the observational work that makes up the bulk of empirical work in the social sciences and the law. In the posts to come, I will flesh out these issues in more depth, and I will also consider the extent to which we can incorporate such reviews into the legal system.
Posted by John Pfaff on April 16, 2009 at 04:13 PM | Permalink
TrackBack URL for this entry:
Listed below are links to weblogs that reference The Path Ahead: Evidence Based Empirical Work:
Having various factors add up to 200% is only a problem when the factors have been shown to be distinct.
If I prove that accidents are 20% related to green vehicles and 40% related to trucks and 50% related to Volkswagons, that's 110%. So there are 10% green trucks or green Volkswagons or Volkswagon trucks. Or green Volkswagon trucks.
Posted by: Dal Jeanis | Apr 17, 2009 2:02:16 PM
Although, I frequently lovingly criticize the lawyer profession, this post shows promise. I commend it. It shows awareness. The law is human experimentation. Lawmakers, including common law makers, should begin to comply with standards of methods for human experimentation. Science has been rare in Supreme Court decisions (Brown v Bd of Ed, and Mass v USA, and that is it, as far as I know).
Posted by: Supremacy Claus | Apr 18, 2009 10:37:55 AM