Image 01

Google Books: Department of Justice and ReCaptcha

September 21st, 2009 by Chris Strauber

Department of Justice Brief Filed on Google Books Settlement

Not surprisingly there has been a lot of activity around the Google Books settlement right around the deadlines for filing. The US Department of Justice filed a brief (PDF) (link via searchengineland) with the court on Friday. DOJ objects to the settlement as proposed on several grounds, while recognizing the cultural significance of what Google is working on. The objections are: 1) the scope and value of the settlement are public interests also, and may not be appropriately settled by a private lawsuit–really, Congress should be drafting legislation to do this, 2) the structure of the proposed Book Rights Registry may create a situation which makes it impossible for anyone else to compete in the new market, 3) not all of the interested parties may have been adequately notified of the settlement and possible changes to their rights, 4) it would be more appropriate for Google to have rights-holders opt in to the agreement rather than the current arrangement, which assumes consent. More detailed summaries at the New York Times and Search Engine Land.

Significantly, DOJ is not rejecting the deal outright, and is apparently working with the parties to modify the agreement. A hearing is scheduled for October 7th.

Improving OCR

As an example of how quickly things can change, while I was composing my book-length post on problems with optical character recognition, Google was buying a company which has been working on that problem in a really innovative way. Captchas are the odd squiggly text websites force you to log in with in order to screen out spam bots. ReCaptcha takes advantage of this common mechanism to proofread troublesome documents. Instead of one word, users are offered two. The first word is known to be correct, the second is one that was flagged as questionable from an online archive of texts. If you get the first one right your reading of the second one is likely to be right, too, and that data can be fed back to improve both the source text and the OCR software. The technology has already been used to improve the New York Times historical archive. Google announcement (via CNET). How ReCaptcha works.
I suspect it will be a while before Gothic typeface German or Ancient Greek will be prioritized, though since the ReCaptcha technology is designed to be installed on a variety of different websites there’s no reason, for example, specialized web communities like H-Net or Voice of the Shuttle couldn’t install it and let their expert users contribute their expertise.

Comments are closed.