Saturday, January 02, 2010
Data collection, the pursuit of knowledge, and intellectual property rights
First, I'd like to thank Dan and the rest of the Prawfs gang for inviting me to guest blog here - it is truly an honor. My posts are usually relegated to a blog with a much smaller following that I run with co-editor Andy Whitford. As Dan noted I am a professor of political science at Binghamton University. However, before I went to graduate school and began my second career as a social scientist, I was an attorney. It is this intersection of law and social science that has always intrigued me and my question for this post has a lot to do with both topics.
In conducting social science research I perform a lot, and I mean a lot, of data collection and coding. This is a process that is annoyingly both mundane and challenging (at least at times). Since most of you have at least some experience with this process I will not bore you with the challenges and pitfalls of performing quality data collection and coding. What I do want to stress though is that it is work. It requires time, expertise (this varies of course by context), and effort (i.e. it's not fun). Finally, this work adds value.
In political science there are very strong professional mores to share data with other researchers. In fact, it is usually expected immediately after publication of your first article using the data if not before that time (e.g. after presenting a working paper at a conference). I profess some ignorance of the social mores on data sharing in empirical legal studies, but from my few conversations on this point, I think that they might be somewhat different. In political science this norm of sharing your data is usually rationalized along the lines of "don't you want to aid the pursuit of knowledge?" or "surely you support the advancement of science, right?" or similar call to a higher good. Now, don't get me wrong, I have asked people for data and they have shared it with me (and vice versa) - this process is all fine and good. But isn't this generalized rationalization a bit simplistic? If people begin losing incentives to spend significant time collecting data, then doesn't that inhibit the pursuit of knowledge? It seems that within this norm there is a tacit winner (data analyzers) and an implicit, well, loser-chump (data collectors). Isn't a more nuanced discussion in order? Can't we find some way to satisfactorily compensate/protect data collection efforts? I'd be very interested to hear what intellectual property scholars think about this situation. Okay, a few observations to clarify my discussion here:
- Of course, I am not talking about data collected with the help or aid of a funding entity such as the National Science Foundation - clearly such data should be shared and my understanding is that it is part of the grant agreement.
- I am also not talking about a situation in which a researcher has access to information that other researchers do not have. The usual situation is publicly available data that has been collected, organized, and put into a spreadsheet - all with some toil and sweat.
- I am not suggesting that data be kept forever - just that we might think about a protection period or establishing guidelines that adequately compensates the investment of time that data collection warrants and perhaps that we have some uniformity in the process and an enforcement mechanism.
- Re the duty to further the pursuit of knowledge - aren't there competing duties to the entities providing the opportunity to collect the data. I'm wondering if your university pays you summer money to collect data (or for a research assistant to do so), then doesn't the institution have a proprietary interest in that data and shouldn't it be compensated when the effort it funded is used? Isn't this what happens when science professors get patents on things they discover or develop on university funded projects? It is my understanding that the entities producing Pacer and Trac data charge users for their products.
- One defense of requiring "quick" data sharing is that we must do this to make sure that the researcher has competently collected and analyzed the data. Really? I'm pretty sure that a journal editor could require that authors submit data for peer review with an agreement that the person performing the robustness/accuracy testing does not publish with the data or release it - simple enough.
- Another defense is that data collectors are adequately compensated by the social capital and/or citations (acclaim) that they receive by making their data available to other researchers - curiously, we don't apply this rationale (at least to the same degree) to music, writing, trademarks, or product design.
I'll shut up for now, but I would be interested in hearing other peoples' thoughts on these matters. As I said before, I've been on both sides of the data sharing situation, so I actually have somewhat mixed feelings on the subject. It just seems to me that this is a very under explored area of IP, especially given the increase in the importance of data driven processes in recent years in the private sector. A (very) brief search on Lexis revealed some law review articles on data and IP, but not as many as I had expected.
TrackBack URL for this entry:
Listed below are links to weblogs that reference Data collection, the pursuit of knowledge, and intellectual property rights:
In my view one of the more difficult problems of IP is what to do with non-secret sweat of the brow data. I have argued in a couple of places that courts should protect such information in some cases. At the same time, publicly available information should be, well, publicly available, such that we shouldn't allow some contracts that purport to limit data use. Several people have attacked this problem, but I've yet to be convinced by any proposed solution. I've planned to write an article on this topic, but haven't solved it in my own mind yet.
That said, data collection for research is a much easier topic. Just don't share it if you don't want to. I guess your peers might get mad at you, but I suppose that's a cost worth bearing if it keeps you incentivized to go get the data. I've gotten data from some folks, not gotten it from others, and worked out when (in the future) data will be pooled. I think it's completely reasonable to have such discussions, especially when the money is coming from a general research fund of the school. My dean, for example, has gone above and beyond in supporting my current empirical project - I think I owe at least some duty to the school to expand the wealth of knowledge through my attributional use of the data for a little while before everyone else uses it to expand knowledge.
Posted by: Michael Risch | Jan 2, 2010 7:31:03 PM
Something to add to the discussion: what about crowd-sourced data? I don't have a link handy, but there is at least one example in the NYT Magazine's "Year in Ideas" issue that talks about a proof of an extremely mathematical theorem derived not by a mathematician, but by the comments section of a mathematician's blog. They're going to publish the resulting paper under the name D.H.J. (for the theorem's name) Polymath.
So who has the rights to that sort of data? Or the data collected by other mechanisms, like Tenure-Matic (I read your blog) or RECAP (the crowd-sourced circumvention of PACER)?
Posted by: Matthew Reid Krell | Jan 2, 2010 10:27:33 PM
Re the duty to further the pursuit of knowledge - aren't there competing duties to the entities providing the opportunity to collect the data. I'm wondering if your university pays you summer money to collect data (or for a research assistant to do so), then doesn't the institution have a proprietary interest in that data and shouldn't it be compensated when the effort it funded is used? Isn't this what happens when science professors get patents on things they discover or develop on university funded projects? It is my understanding that the entities producing Pacer and Trac data charge users for their products.
If you are taking university funding to collect and collate data, then you should have a conversation with your university about releasing the data. If they approve of releasing it, then everything is hunky-dory. If they want to commercialize it, you should have a conversation with them about why that's an inappropriate condition that thwarts their and your academic and public missions. If they still refuse, you should return their money, so that you can release your data freely.
Pacer charges, but it should not. It also doesn't limit the redistribution of data once the downloader has paid any applicable per-page fees.
It's unfortunate that Trac has to charge, but it claims to need to do so in order to make the datasets it produces available at all. For an academic researcher to sell access to her own data is an unethical conflict of interest.
The right way to "compensate" academic data collectors is credit.
Posted by: James Grimmelmann | Jan 2, 2010 11:55:52 PM
Thanks so much to Michael, Matthew, and James for their thoughtful comments. You have certainly given me some food for thought on this topic and raised some interesting new questions.
Posted by: Jeff Yates | Jan 3, 2010 12:05:15 PM
I understand your point, but in an academic context, the "money" usually comes from sharing the data not selling it directly. For instance, if I discover how to decipher an ancient text, I can do the following once I announce my discovery:
* Obtain a better academic position based on my expertise
* Attract grants and tuition paying graduate students to my department
* Obtain a visiting fellowship somewhere enticing
* Write a book and possibly obtain some royalties
* Become an invited keynote speaker with an honorarium
For this to happen though, my colleagues have to agree that my work is valuable and the accepted method for doing this is publishing in the "right" journals and allowing others to analyze my data. The more my work is cited, the more prestige I (the academic) obtain and thus open more opportunities.
If the discovery is attractive to the public, then you can also
* Obtain larger speaker fees
* Obtain book royalties
* Appear on TV
In one sense you still own the data because you have the first analysis of it. Anyone else who works with it will usually have to mention you as the source (even if they are trying to debunk you).
In many disciplines, keeping data secret gives you a paranoid or flaky reputation, so it's not recommended. A famous case in archaeology was a debate over the slow pace of releasing the Dead Sea Scrolls.
There are some conflicts in this model. For instance, instructors usually want their publications widely disseminated even if it's free. The publisher, on the other hand, wants a profit, so they want more traditional corporate copyright restrictions.
Posted by: E. Pyatt | Jan 21, 2010 11:31:07 AM
There's a lot of incentive to release data.
1. The person you give it to owes you a favor.
2. You get cited for it--- and this is a way to potentially get far more cites than with an ordinary published article.
3. People trust your work more.
4. People check your work and give you "good" cites-- that is, the kind that interacts with it, whether positive or negative.
5. Once you've written using the data, you have little reason to keep it secret.
I've been wondering: is there any field *except* pro-warming climate science where data sharing post-publication is not the norm?
Posted by: Eric Rasmusen | May 17, 2010 11:33:18 AM
The comments to this entry are closed.