Image 01

The Book, Terms of Service

September 10th, 2009 by Chris Strauber

A nice thought experiment in what the licensing terms for a book would be, if it were spelled out in the terms used for computer software, music, and movies. (via librarian.net).

Example:

I. Privacy
What takes place in the exchange between your brain and the contents of The Book is your exclusive private concern. The Book will never download the contents of your brain, either whole or in part.

Google Books Practicalities (Part II)

September 8th, 2009 by Chris Strauber

google-books.png(from a Greek/Latin New Testament on Google Books)

Professionally speaking I have mixed feelings about the Google Books project, for reasons which I will try to explain below. It has the potential to completely change how research in the Humanities is done. I’m not given to hyperbole, simply saying that a full-text searchable database of book-length texts the size of a research library (much less two dozen) has never existed, and there are all sorts of possible unintended consequences. Most of them are good for scholarship. The proposed settlement was a surprise–I had been placidly assuming that the lawsuits would go on forever–and I quite literally spent about forty-five minutes staring at my computer screen the day last fall when it was announced trying to figure out what the ramifications were. This series of posts contains some of my thoughts, and I’ll share more as I read more.

What Google Books Does Well

Full text search. It provides searchable access to a vast quantity of published literature. Library catalogs are built to facilitate browsing. Books are described in general terms, and once you’re in the right spot on the shelf you’re surrounded by related materials. This is exactly what you need for some projects. But for other projects, ones where you’re looking for an obscure fact or name, or (and this is a library school classic) identify the original version of a particular quote or phrase, it doesn’t work as well. Going that extra step requires you to use the index and table of contents of the book. It’s entirely possible to flip through an entire shelf of books in a few minutes if all you’re looking for is references to a person or idea…but standard indexing never covers every word or concept mentioned. A computerized index (ideally at least, see below) does. A computer can search thousands or millions of items in the time it takes you to open one.

Scale. There are ebook collections which can be searched in bulk, but most built for scholarly purposes are fairly small. Even Early English Books Online, a massive collection, includes only about 125,000 items. Google is scanning the entire contents of dozens of research libraries worldwide: books, journals, everything. Current estimates are that ten million items have been scanned so far.

Scope. Modern research has been becoming more and more interdisciplinary for years. In contrast, most ebook collections, like the 1700 titles in ACLS Humanities, are small and fairly narrowly scoped.

What Google Books Does Poorly

Again, I’ll refer you to the article by Geoff Nunberg I mentioned in my previous post. He eloquently makes the case for metadata, which is to say that it really does matter how well you can describe the book and relate it to others.

Typefaces and Languages
The search index is generated based on Optical Character Recognition (OCR) run on the PDF images of the texts produced by scanners. OCR works best on clean texts with modern typefaces. Old books fare less well. Books in foreign languages (especially non-Roman alphabets) fare less well. Old books in foreign languages are very, very hard to do. Here are a couple of extreme examples. Google gets credit for providing access to the raw text of the search index, which most traditional library vendors do not.

The process is similar in concept to what JSTOR and ACLS do, but because JSTOR (about 1500 journals) and ACLS (about 1700 books) cover vastly smaller collections of texts their descriptive and cataloging data tends to be of much higher quality. From conversations with JSTOR reps at conferences they are aware of the problem I am about to describe, but none of the database aggregators and vendors works particularly well with this kind of material. The only solution on a large scale is Project Gutenberg‘s collaborative proofreading model.

From Google Books:
A Greek-Latin parallel text New Testament (1821)
what the page looks like
what the search engine sees
Comment: I love several things about this, while being really sympathetic to how hard it is to get a machine to make any sense of what’s going on here. The first is that, as is not uncommon, the introduction to this Greek-Latin parallel New Testament is in Latin. As is half the text. Yet the language that seems to have been used for OCR is Greek. So all the Latin gets processed into Greek gibberish (as is most of the Greek), and the division of the page is not noted by the computer interpreter at all.
Liddell and Scott Greek Dictionary (1848)
what the page looks like
what the search engine sees

Lewis and Short’s Elementary Latin Dictionary
(1894)
what the page looks like
what the search engine sees

Comment: This is almost usable, though it’s not a completely reliable version. Latin seems to work better, but italics and all the other tricks typesetters use to make a huge body of text legible to humans really aren’t to a computer.

Immanuel Kant’s Gesammelte Schriften, volume 5
what the page looks like
what the search engine sees

Comment: This is, to borrow Nunberg’s phrase, a complete train wreck. For practical purposes it’s not searchable…and it’s the canonical version of Kant’s complete works.

Things That Reassure Me I Will Continue To Have A Job

1) Optical Character Recognition cannot do everything yet. In particular, until it can reliably search a Latin text from the 19th century or prior, much less anything in Greek or in German Gothic type, it is not yet a replacement for the tools traditionally in use.

2) Google does not know that volume five of Kant’s Collected Works is related to anything else in the eight volume edition Tufts has. So if I wanted the Critique of Pure Reason, volume 3, Google thinks I will be able to find that by searching for it. Except #1.

3) The settlement only applies to out of print books. Current editions will still only be searchable on whatever terms Google agrees to with publishers. The library will still be your most convenient free source.

And yet, like Nunberg, I am optimistic. There are explicit terms in the agreement which allow Google to make the data available for research and study. Once the huge collection of data exists it will be possible to make it better.
Questions or concerns about Google Books or the future of libraries? Ask in the comments.

Google Books Settlement (Part I)

September 4th, 2009 by Chris Strauber

Image of digital scanner Today (September 4th, 2009) is the last day for authors to opt out of the proposed class action settlement between Google and publishers concerning the Google Books project. Since the settlement was announced last fall there has been a blizzard of press coverage and arguments for and against the settlement. Part I is an introduction to the project, the players, and what’s under discussion.

The Story So Far
Starting in 2004 Google, in cooperation with several libraries worldwide, including Stanford, the University of Michigan, Oxford, Harvard, and the New York Public Library, began the largest digitization project in history. The goal was to comprehensively scan and make available online the collections of these major research institutions. Google claimed that “fair use” allowed it to make copies of the millions of works involved as long as it displayed no more than brief excerpts online. Publishers disagreed, claiming that the act of making the copies was itself copyright infringement on a massive scale, and sued Google in 2005. While the case was pending a number of other libraries have joined the project, including the Bibliothèque Nationale, the Bavarian State Library, and the National Library of Catalonia. A proposed settlement was announced last fall, with a time limit for objections set by the court overseeing the case.

The Settlement
The settlement would set up a Book Rights Registry to manage royalties from the sale of digital books and advertising. Current publications and pre-1923 publications would be handled the same way they are now. A major problem with past digitization efforts has been determining who owns the copyright to large numbers of “orphan works”, those whose copyright holders cannot easily (or at all) be identified. The settlement would hold royalties in trust where copyright is unclear, and therefore provide an incentive for publishers to claim their orphan works.
Objections to the settlement largely center on monopoly power, pricing, and privacy. Robert Darnton of Harvard started the discussion of serious objections to the settlement in an article in the New York Review of Books in February. While the settlement is not an exclusive one, the path Google has taken is not one open to many others. Anyone wanting to replicate the Google Books project would have to begin scanning books, get sued by every author and publisher on the planet, and come to a settlement. Google could also abuse its monopoly position by raising prices to whatever level it wanted. Objections by the Internet Archive and Amazon say the way to do this is to change the law to make it possible, not use a court case to completely change how copyright law works to the benefit of one company. The FTC and the American Library Association have expressed concerns about the privacy implications of such a large digital library.

A second set of concerns, not related to the legalities, are concerns about the quality of the data produced by Google’s massive and rapid scanning (almost 10 million books in five years). A recent article in the Chronicle of Higher Education by Geoff Nunberg of UC-Berkeley points out a range of problems which need to be addressed to make the Google library useful for scholarship.

Additional overviews
Tome Raider (brief). By the Economist. Good coverage of the European angle, which is important but not being discussed in US publications as much, because the settlement would apply to US publishers only.
Google’s Moon Shot (lengthy). By Jeffrey Toobin, in the New Yorker.

Ontological Proof and hedonism

August 24th, 2009 by Chris Strauber

Anselm_of_Canterbury.jpg
(photo from Wikimedia Commons)

Coverage of the “self-thinking thought” in the New York Times online. It’s been a while since I’ve seen discussion of this proof of God’s existence, but describing it as a hedonistic argument got my immediate attention.

For more on arguments for proof of God’s existence, see coverage in these sources, which Tufts subscribes to. Note: if you’re off-campus or on wi-fi you’ll be asked to log in.

Routledge Encyclopedia of Philosophy

Encyclopedia of Religion

For non-Tufts readers, and for comparison, you might also be interested in the (freely available and web-based) Stanford Encyclopedia of Philosophy’s coverage of Anselm and ontological arguments, and Wikipedia’s article on the ontological proof.

Future of Newspapers (again)

August 10th, 2009 by Chris Strauber

Rupert Murdoch recently announced in an earnings call with investors that he intends to charge for access to all his news websites. No details on how this would be managed have been announced, nor any time frame. The Wall Street Journal, a recent acquisition, is one of very few online publications which have successfully charged for access.

Which raises from a different angle what I discussed on this blog recently. Is this the future of newspapers?
I think the question here is about what kinds of information people are willing to pay for. Readers of the Wall Street Journal are paying for information they think will make them money. Most newspaper stories do not fit into this category. Robert Andrews, writing for the Guardian, comments that a better approach might be to charge for unique and special interest things like crossword puzzles and soccer memorabilia. Why charge for that and not the paper as a whole? Because putting the paper behind a pay wall renders it invisible to search (and therefore makes any advertising on the page, probably, less valuable). Also because most of the news in any given newspaper is not unique and can easily be found elsewhere. Charging for what makes your paper unique *does* seem like a reasonable strategy.

If you followed any of the links in paragraphs one or two, your exposure to the advertising on the pages I linked to makes money for, respectively, the New York Times, Motley Fool, and The Guardian. Free to the reader doesn’t necessarily mean free–see discussion by John Gruber.

Don’t worry: Tisch Library has you covered either way. Most of News Corporations newspapers are in our collection, typically via Lexis-Nexis and/or Factiva–including less well-known publications like the Fiji Times and the Sunday Tasmanian. One limitation of Lexis-Nexis and Factiva is that they don’t include photographs, just text, which makes tabloids like The Sun and the Daily Mirror much less exciting (and renders almost pointless baby elephant stories like this). So for historical purposes, we also keep microfilm copies of key newspapers like the Wall Street Journal (1959-present) and London Times (1785-present). Yes, you read that correctly: our coverage of the London Times starts during the reign of Louis XVI.

We subscribe to thousands of newspapers in a variety of formats. Examples for this article are exclusively from News Corporation properties. For others, just do a title search in the library catalog.

40th Anniversary of Apollo 11

July 17th, 2009 by Chris Strauber

6642.jpg

(image courtesy of NASA)
The 40th anniversary of the first moon landing is kicking up a storm on the Internet. There are a variety of sites and services to look at if you’re interested in more information about this particular piece of history.
The most spectacular is We Choose The Moon, a project of the JFK Presidential Library, which is streaming telemetry and communications between ground control and the spacecraft in real time–which is to say that as of this moment, the crew will land on the moon in a little over seventy-seven hours and they are two hours away from a course-correction burn. There is a live audio stream of conversations between Houston and the spacecraft, but you can also follow the conversation on Twitter (Capcom or the capsule).

NASA also has a photo gallery; some of these photos appear in a photo essay in this week’s Boston Globe’s Big Picture column. My favorite is above.

For the sordid story of how NASA for several years lost the original footage of the moon landing and then found it in time to have it restored, see this Associated Press story. When asked whether restoring the original tapes would contribute to conspiracy theories about the landing, the president of Lowry Digital (the company doing the restoration) said “if there had been a conspiracy to fake a moon landing, NASA surely would have created higher-quality film.” The argument from bureaucracy is almost always the conclusive one.

You can also watch a digital facsimile of the original news coverage of the landing at kottke.org, starting at about 4:10 EDT on Monday July 20th. Kottke also has a much more comprehensive list of resources than this meager post.
Tisch Library has a variety of things which might be of interest:

eBay and the Economics of Looting

May 8th, 2009 by Chris Strauber

Charles Stanish of UCLA, writing for archaeology.org, has a fascinating description of eBay’s surprisingly positive effect on the black market for antiquities. It turns out the eBay, rather than making it easier to run a black market for real antiquities, has instead helped flood the market with reasonably high quality yet inexpensive fakes. By the operation of Gresham’s Law, the fakes are destroying the market for the really expensive actual antiquities…which are expensive because they are risky and expensive to acquire and sell.

Tisch owns several of Stanish’s works on Peruvian archaeology.

Sources for Audio Books

April 27th, 2009 by Chris Strauber

One question I get routinely and frequently toward the end of a semester is what the library offers in the way of audiobooks. We have about 500 titles here, but I’m also happy to recommend outside websites I’ve found to be good for this sort of thing.

Tisch Library’s Media Center has an assortment of audio books and recordings which might be of interest. Complete list of spoken word titles, in alphabetical order. About 500 titles.

Librivox provides volunteer recordings (often of very high quality) of works in the public domain. For practical purposes this means “published prior to 1923″. This can be very good for classic works like Thucydides’ History of the Peloponnesian War (to which your author contributed a couple of chapters), or Shakespeare’s plays and poetry, or most of Dickens’ novels, or several versions of Jane Austen’s major works. Search the Librivox catalog. About 2000 titles.

The Internet Archive’s audio archive contains a random and wonderful assortment of music, audiobooks and poetry, old time radio shows, Grateful Dead recordings, recorded sermons and religious teaching, even a few recordings from wax cylinders and 78s, and a variety of ancient and modern philosophical works and lectures. Thousands and thousands of titles.

The Future of Newspapers

March 30th, 2009 by Chris Strauber

The economic downturn seems to be accelerating conversations about the future of the newspaper business. Recent announcements that the Christian Science Monitor and Seattle Post-Intelligencer would no longer produce print editions, but would instead be online-only publications have accompanied news that other newspapers will simply close. The Rocky Mountain News simply ceased publication, as have many smaller newspapers across the country.

The implications of this have spawned a series of conversations on the web. Here’s a small sampling of my favorite additions to the standard theme, which tends to run: “The death of newspapers will be the death of an important part of our democracy and civilization.”

Clay Shirky argues quite forcefully in Thinking The Unthinkable that a business based on the expectation that the means of publishing and distribution are expensive and scarce cannot survive when, with the Internet, neither of those things is true. He suggests distinguishing between journalism, which is essential, and newspapers, which are merely one means to that end.

Caveat Lector points out that libraries face similar questions as they manage a transition to a future which is already here–and points to Tom Scheinfeldt at Found History, who suggests that conversations among humanists about whether technology is relevant to teaching and research are disturbingly similar to a divide between pragmatists and realists Shirky describes.

Newspapers are an important part of our common culture, but from the perspective of a library which manages dozens of formats, I am less concerned about the form information takes than I am about the content. My last library had music recorded on wax cylinders. The National Archives has a working version of the machine used to record the Nixon tapes. Every library in the country has VHS tapes which increasingly few patrons have the equipment to handle. Formats change. Newspapers have a long history, but ultimately they are just a way of displaying and distributing a particular kind of information. Hang on, we’ll all work this out together.

Mapping Mutual Incomprehension

March 18th, 2009 by Chris Strauber

Classicists love it when people say “It’s all Greek to me” at parties. Really.

The blog Strange Maps has a diagram showing which languages speakers of various other languages consider gibberish. For the French, Javanese. For Croatians, Spanish.