Thursday, March 24, 2011

Google Takes One on the Chin

Unless you’ve been living under a rock for the past five years, you’ve heard about Google Book Search, an online database that Google is filling by scanning books from the collections of multiple academic libraries. Google partnered up with the libraries for access to their collections, concluding that it didn’t need to ask permission from copyright owners to copy the books or make them available online (with restrictions at Google’s discretion). Author and publisher groups brought a class action law suit, and Google sat down with the plaintiffs to hammer out a settlement agreement. The agreement, as presented to Judge Chin, then of the Southern District of New York, not only proposed to settle claims of prior copyright infringement, but also set up future business relationship between Google and copyright owners, using an opt-out mechanism instead of securing a license.

Judge Chin, sitting by designation as a newly appointed Second Circuit judge, issued an opinion Tuesday denying the motion to approve the settlement agreement. While the opinion does not rule on the merits of Google’s fair use defense, some of the court’s language suggests it would cast a skeptical eye on Google’s behavior in a case on the merits. Judge Chin used words like “blatant” to describe Google’s copying activity, and quoted hostile language from objectors to the settlement, who called the opt-out strategy “a shortcut” and “a calculated disregard of authors’ rights.”

I disagree that there was calculated disregard of authors’ rights. The way many scholars have viewed the case, it was a close call whether Google could copy whole books and place them in an online database for search (and eventual commercialization). Google took a calculated risk, under a standard view of fair use analysis, and one that could still pay dividends if the case gets litigated. I don’t think Google's fair use argument is a close call at all, but not for the standard reasons.

In a paper now available for consideration by your local law review editorial board, I argue that built into the Copyright Act are the seeds of a limited right of first online publication. Historically, the right of first publication protected the ability of the copyright owner to decide when to bring a work to market, and the right was strong enough to trump an otherwise reasonable fair use defense. A right of first online publication would thus dispose of (or at least weigh heavily against) a fair use defense raised to excuse the unauthorized dissemination of a work online—even those works previously disseminated in a more restricted format.

Courts and scholars tend to treat publication as a “one-bite” right, exhausted once the work was released in any format. A reexamination of its history indicates instead that the right of first publication often protected successive market entries. Where courts perceived a significant difference in scope and exposure to risk between a limited initial publication and more expansive subsequent publication, the right of first publication protected that subsequent entry.

In addition to the historical analysis, networks theory sheds light on the difference in scope between print and online publication. The dissemination of print books occurs in a conserved spread: while the physical embodiment of the content moves from point to point, the total number of copies in the network remains stable allowing the owner to correctly assess the risks inherent with market entry. Online dissemination occurs as nonconserved spread: any holder of a digital copy can instantly disseminate it to any point online while retaining the original. The differences are significant enough that print dissemination should not be held to exhaust or abandon the right of first online dissemination.

Copyright law will likely need to adapt to multiple format changes over the effective life of a copyrighted work. Recognizing the right of first publication as a rule governing transitions into new formats will provide courts, copyright owners, and technology innovators with firm rules allowing the copyright owner to decide if and when to adopt a new technology. Intra-format fair use is an important part of the bargain between copyright owners and society. We should be solicitous of intra-format fair use, but much less solicitous on inter-format fair use, particularly in those cases where unauthorized use imports the work into a previously unexploited format, and the new format is significantly broader in scope.

There are close fair use calls in disputes over unauthorized use of copyrighted works. Google’s opt-out copyright strategy, for books never before made available online, was no close call at all.

Posted by Jake Linford on March 24, 2011 at 10:05 PM


Great post title. Your points about the choice to enter or not to enter digital markets are interesting; you have brought some useful rigor to informal arguments being thrown around about the differences between print and online.

But I'm concerned that your post and your paper conflate Google Book Search (which displays only short snippets) with the Google Partner Program and the proposed settlement, which involve full copies of works. Google claimed fair use only as to scanning, indexing, and snippet display. In that context, where Google is publishing only short excerpts, the unique risks associated with online publication are significantly attenuated. You can't use the full-text settlement programs -- which Google only would have attempted if the settlement were approved -- to measure the fairness of the search engine.

In the paper, you discuss security concerns, and you cite two examples of people downloading full books. But the first example, from Kuro5hin, involves a user who could download full books that were available for partial preview -- that is, books that were part of the Partner Program, which lets users view up to 20% of the pages in a given book. His hack is a circumvention of the page limit. (I'm not certain, as I haven't run the software, but the Google Book Downloader appears to be the same.) Google never makes entire pages available for books whose copyright owners haven't opted in, only Snippet View. So these hacks don't appear to apply to the case in which the fair use issues would actually be relevant at trial: the scanning and searching of books without explicit opt-in permission. True, there is the possibility of a wholesale database breach -- but (a), I suspect (though don't know for sure) that books in Snippet View are far less accessible from user-facing computers than books in Partial Preview, and (b) if we're debating the risks from databases rather than what's shown to users, this isn't about online publication any more.

Posted by: James Grimmelmann | Mar 25, 2011 12:20:08 AM

James, thanks for your comments. I’m sure you noticed how Judge Chin relied on two of the articles you gathered for the recent New York Law School symposium on the topic. Kudos!

To respond to your concern, I’m not sure that the risks posed by snippet view are substantively different from the risks posed by limited preview view. [Google discusses the differences here.]

When you search Google Books for a term, like the wingardium leviosa spell that J.K. Rowling concocted for use by wizards in her Harry Potter novels, you get information about the novel where it appears: Harry Potter and the Sorceror’s Stone. You also get a snippet of how the words you searched appear in context.

This appears to me to be a partial reproduction of the scan of the full page where the word appears – a portion of what you’d get if Google had provided limited preview and you could see the full page. The book is thus “available” for snippet view in the same context—a scan from the relevant page—as it is in limited preview or full view. It's just a smaller slice of the page.

Google decides you get part of a page instead of a full page, but I don’t see any indication that what’s behind the curtain is functionally different than what’s behind the curtain in limited page view. I could be wrong about that, but for the search engine to do it’s job, the text of the whole book is most likely available to search queries. If the text of the whole book can be disclosed in snippets, one snippet at a time, and the snippets are scans from the actual page, as they appear to be in my search above, then the text of the whole book can be hacked with the same result as the hack of a book accessible through limited preview.

Let’s assume that Google is a particularly responsible unauthorized user, and it invests a significantly greater amount of energy in securing from breach the books for which it provides a snippet view than it does those books available for limited page view. [Mind you, I’m skeptical about the second assumption because Google has a contractual relationship with the owners of limited preview books that it does not have with owners of books accessible only through snippet view.] Even assuming Google is a good online citizen, I’m still troubled that Google is the one making the call about how and when the book is first presented online. Every day we see more indications that connecting to the Internet dramatically increases the risk of harmful unauthorized exploitation. For example, the New York Times recently reported that cars with Internet connectivity may be subject to unauthorized remote control.

I'd like to think copyright owners will get better insulation against those risks if those who put the work online have to license the right to do so from owners. Owners would at least be able to negotiate with distributors over security ex ante. I’m more certain that owners will get the opportunity to make a reasoned decision about the risks posed when they place a work online if everyone knows that the decision rests with them, and not with an unauthorized user, regardless of how otherwise fair the unauthorized use appears to be. Of course, using the right of first publication as a transitional rule has all the advantages and flaws that come with rules over standards, but I think the clarity would be a net plus in this instance.

Posted by: Jake Linford | Mar 25, 2011 7:40:26 AM

My comment doesn't question your argument about online first publication, which I found interesting and need to think more about. I'm taking it as given and questioning some of the details about its application to Google Books. Maybe the best way to put it is that there are some unacknowledged questions of degree when it comes to snippets and full pages, and that your paper would be stronger if they were acknowledged questions.

For example, your hypothetical that Google invests a "significantly greater amount of energy in securing from breach the books for which it provides a snippet view" is actually fairly close to reality, for reasons that depend on the details of how the current Google Books programs work. If you search and find a book in snippet view, Google embeds the snippets as PNG images, which are completely noninteractive. If you search and find a book in Preview, Google embeds images of the pages within a much more full-featured Javascript viewer, which supports scrolling, navigation, and other features. It can also make additional requests back to Google's servers for additional pages as necessary. It relies on cookies on the user's computer (and possibly other things) to tell it when the user has reached a preview limit for the book. The downloader programs exploit this by doing clever things with cookies to prevent Google from realizing that a user has exceeded the limits Google is trying to set. This is much, much easier to automate, first because the actual display and download can use Google's own Javascript to simplify the task, and second because the limits consist of attempts to enforce a cap, rather than not making some material easily available in the first place.

Posted by: James Grimmelmann | Mar 25, 2011 8:18:00 AM

Very helpful, James. Based on your description, it may be much more difficult to hack the snippet view than the limited preview, based on the differences between PNG and Javascript to populate the search results and provide access to book text.

Does anyone have a sense [assuming this hasn't become a dialogue between James and I) for whether Google is populating the PNG images directly in response to each search query, or whether they've archived two and three line PNG snippets for the text of every book available in search? I suspect it's the former. If so, that indicates a direct connection between the search query and the PNG image.

The PNG image appears to populate from the actual scan of the text. This leads me also to wonder whether Google needed to scan two different versions of each book to create PNG and Javascript access to text, or could one be derived from the other? Both from a third format?

Posted by: Jake Linford | Mar 26, 2011 4:16:48 PM

This is not a complete answer to your question, but some technical details are discussed here, and some of the technical issues came up during questions at the keynote of the D is for Digitize conference.

One minor correction: PNG is an actual image file format, whereas Javascript is a programming language that Preview uses to request and display the images. I have not checked what the actual file format is used within Preview.

Posted by: James Grimmelmann | Mar 27, 2011 7:29:56 PM

