Image 01

Google Books Practicalities (Part II)

September 8th, 2009 by Chris Strauber

google-books.png(from a Greek/Latin New Testament on Google Books)

Professionally speaking I have mixed feelings about the Google Books project, for reasons which I will try to explain below. It has the potential to completely change how research in the Humanities is done. I’m not given to hyperbole, simply saying that a full-text searchable database of book-length texts the size of a research library (much less two dozen) has never existed, and there are all sorts of possible unintended consequences. Most of them are good for scholarship. The proposed settlement was a surprise–I had been placidly assuming that the lawsuits would go on forever–and I quite literally spent about forty-five minutes staring at my computer screen the day last fall when it was announced trying to figure out what the ramifications were. This series of posts contains some of my thoughts, and I’ll share more as I read more.

What Google Books Does Well

Full text search. It provides searchable access to a vast quantity of published literature. Library catalogs are built to facilitate browsing. Books are described in general terms, and once you’re in the right spot on the shelf you’re surrounded by related materials. This is exactly what you need for some projects. But for other projects, ones where you’re looking for an obscure fact or name, or (and this is a library school classic) identify the original version of a particular quote or phrase, it doesn’t work as well. Going that extra step requires you to use the index and table of contents of the book. It’s entirely possible to flip through an entire shelf of books in a few minutes if all you’re looking for is references to a person or idea…but standard indexing never covers every word or concept mentioned. A computerized index (ideally at least, see below) does. A computer can search thousands or millions of items in the time it takes you to open one.

Scale. There are ebook collections which can be searched in bulk, but most built for scholarly purposes are fairly small. Even Early English Books Online, a massive collection, includes only about 125,000 items. Google is scanning the entire contents of dozens of research libraries worldwide: books, journals, everything. Current estimates are that ten million items have been scanned so far.

Scope. Modern research has been becoming more and more interdisciplinary for years. In contrast, most ebook collections, like the 1700 titles in ACLS Humanities, are small and fairly narrowly scoped.

What Google Books Does Poorly

Again, I’ll refer you to the article by Geoff Nunberg I mentioned in my previous post. He eloquently makes the case for metadata, which is to say that it really does matter how well you can describe the book and relate it to others.

Typefaces and Languages
The search index is generated based on Optical Character Recognition (OCR) run on the PDF images of the texts produced by scanners. OCR works best on clean texts with modern typefaces. Old books fare less well. Books in foreign languages (especially non-Roman alphabets) fare less well. Old books in foreign languages are very, very hard to do. Here are a couple of extreme examples. Google gets credit for providing access to the raw text of the search index, which most traditional library vendors do not.

The process is similar in concept to what JSTOR and ACLS do, but because JSTOR (about 1500 journals) and ACLS (about 1700 books) cover vastly smaller collections of texts their descriptive and cataloging data tends to be of much higher quality. From conversations with JSTOR reps at conferences they are aware of the problem I am about to describe, but none of the database aggregators and vendors works particularly well with this kind of material. The only solution on a large scale is Project Gutenberg‘s collaborative proofreading model.

From Google Books:
A Greek-Latin parallel text New Testament (1821)
what the page looks like
what the search engine sees
Comment: I love several things about this, while being really sympathetic to how hard it is to get a machine to make any sense of what’s going on here. The first is that, as is not uncommon, the introduction to this Greek-Latin parallel New Testament is in Latin. As is half the text. Yet the language that seems to have been used for OCR is Greek. So all the Latin gets processed into Greek gibberish (as is most of the Greek), and the division of the page is not noted by the computer interpreter at all.
Liddell and Scott Greek Dictionary (1848)
what the page looks like
what the search engine sees

Lewis and Short’s Elementary Latin Dictionary
(1894)
what the page looks like
what the search engine sees

Comment: This is almost usable, though it’s not a completely reliable version. Latin seems to work better, but italics and all the other tricks typesetters use to make a huge body of text legible to humans really aren’t to a computer.

Immanuel Kant’s Gesammelte Schriften, volume 5
what the page looks like
what the search engine sees

Comment: This is, to borrow Nunberg’s phrase, a complete train wreck. For practical purposes it’s not searchable…and it’s the canonical version of Kant’s complete works.

Things That Reassure Me I Will Continue To Have A Job

1) Optical Character Recognition cannot do everything yet. In particular, until it can reliably search a Latin text from the 19th century or prior, much less anything in Greek or in German Gothic type, it is not yet a replacement for the tools traditionally in use.

2) Google does not know that volume five of Kant’s Collected Works is related to anything else in the eight volume edition Tufts has. So if I wanted the Critique of Pure Reason, volume 3, Google thinks I will be able to find that by searching for it. Except #1.

3) The settlement only applies to out of print books. Current editions will still only be searchable on whatever terms Google agrees to with publishers. The library will still be your most convenient free source.

And yet, like Nunberg, I am optimistic. There are explicit terms in the agreement which allow Google to make the data available for research and study. Once the huge collection of data exists it will be possible to make it better.
Questions or concerns about Google Books or the future of libraries? Ask in the comments.

Comments are closed.