Getting to open data for Classical Greek and Latin: breaking old habits and undoing the damage — a call for comment!

Gregory Crane
Professor of Classics and Winnick Family Chair of Technology and Entrepreneurship
Tufts University

Alexander von Humboldt Professor of Digital Humanities
Open Access Officer
University of Leipzig

March 4, 2015

Philologists must for at least two reasons open up the textual data upon which they base their work. First, researchers need to be able to download, modify and redistribute their textual data if they are to fully exploit both new methods that center around algorithmic analysis (e.g., corpus linguistics, computational linguistics, text mining, and various applications of machine learning) and new scholarly products and practices that computational methods enable (e.g., on-going and decentralized production of micro-publications by scholars from around the world, as well as scalable evaluation systems to facilitate contributions from, and learning by, citizen scientists). In some cases, issues of privacy may come into play (e.g., where we study Greek and Latin data produced by our students) but our textual editions of, and associated annotations on, long-dead authors do not fall into this category. Second, open data is essential if researchers working with historical languages such as Classical Greek and Latin are to realize either their obligation to conduct the most effective (as well as transparent) research and or their obligation to advance the role that those languages can play in the intellectual life of society as a whole. It is not enough to make our 100 EUR monographs available under an Open Access license. We must also make as accessible as possible the primary sources upon which those monographs depend.

This blog post addresses two barriers that prevent students of historical languages such as Classical Greek and Latin from shifting to a fully open intellectual ecosystem: (1) the practice of giving control of scholarly work to commercial entities that then use their monopoly rights to generate revenue and (2) the legacy rights over critical editions that scholars have already handed over to commercial entities. The field has the rights, the skills, and the labor so that it can immediately and permanently address the first challenge. The second challenge is much less tractable. We may never be able to place recent work in a form where it can fully support new scholarship. That form includes not only the rights that restrict its distribution and, often, the digital format in which textual editions have been produced (e.g., where editors used word processing files rather than best practices such as well-implemented Text Encoding Initiative XML markup). Both the rights and the format together make it unlikely that we will be able in the immediate future (if ever) to make recent critical editions fully available (under a CC-BY-SA license, with TEI XML markup representing the logical structure of both the reconstructed text and the textual notes). The question before us is to determine how much we can in the immediate future recover for the full range of scholarly use and public discourse.

First, the decision to stop handing over ownership of new textual data (and especially any textual data produced with any significant measure of public funding) is, in 2015, a purely political one. There is no practical reason not to make this change immediately. If it takes editors an extra six months or a year (and it should not) because they need to learn how to produce a digital edition, the delay is insignificant in comparison to the damage that scholars suffer when they hand over control of the reconstructed text for 25 years and of the textual notes, introduction and other materials for 70 years after their death.

The Text Encoding Initiative began publishing interoperable methods for machine actionable digital editions in the late 1980s (Historical Editing was already a topic at the 1987 Poughkeepsie Planning meeting that laid the foundations for the TEI: http://www.tei-c.org/Vault/ED/edp01.htm). Students of Classical Greek and Latin, the largest community of historical philologists, have already all the resources in expertise and infrastructure with which to conduct this shift immediately. The second problem is recovering, insofar as possible, textual data that researchers have already given over to commercial interests which, in turn, exploit monopoly ownership to generate revenue. How many textual decisions in this commercial zone do we need to reference within the open data upon which we base our analysis of Greek and Latin and the cultures that these languages directly influenced? This blog post proposes a two-fold strategy (1) beginning a series of openly licensed (CC-BY-SA) textual commentaries, that are aligned to openly licensed editions and to which members of the community can suggest inclusion of important new editorial choices or conjectures only available in editions controlled by commercial interests; (2) identifying, if absolutely necessary, a small list of editions that commercial entities control but that are of such compelling importance that funding should be solicited to buy the rights to digitize, markup with TEI XML, and distribute their contents.

Many traditional scholars may argue that we should preserve the present system (1) because only specialists in Greek and Latin philology need access to new editions and (2) because students of Greek and Latin have no need of the computational methods that require open data for their full expression as instruments of scholarship. Scholars are free to argue that the primary goal of humanities research is to enable specialist publication along small, effectively closed networks of intellectual exchange, that the results of our work on Greek and Latin do not really have enough broader impact to warrant worrying about open access and open data, that the study of historical languages does not require that researchers have the ability to download, analyze, modify and redistribute textual data, and that publicly funded scholarship is not ultimately answerable to the public which provides that funding.

From a pragmatic point of view, such arguments would be problematic for anyone who wishes to replace retiring faculty in Greek or Latin, to attract the most ambitious minds to the study of these languages or to justify research support for the study of Greek and Latin from any private foundation or governmental agency that could invest its research support elsewhere. There is never enough money to support all the research that would advance human understanding, much less so-called STEM disciplines (science, technology, engineering and mathematics — the corresponding German acronym is MINT) that materially advance the economic prosperity and biological health of society. But the privilege of academic freedom and the right of free expression that we enjoy in nations such as Germany and the United States exist so that we can follow our principles and add our opinions to public debate.

There are two fundamental reasons for scholars to make openly useful both their conclusions (open access publications) and the data upon which those conclusions depend.

The first bears most directly upon those of us who receive most, if not all all, of our salary and research support either from public money or from private foundations that require us to make our results available under an open license. There is our obligation as humanists to advance the intellectual life of humanity. Of course, in 2015, this point of view is finding its way into regulations of government research funding in various countries while private foundations increasingly insist that the results from work that they fund be published under an open license. Ironically, the smallest and the largest disciplines seem to have adapted most rapidly to this much more open model of research. Students of Greek papyrology, for example, have already made the transition to open data and on-going, decentralized editing — those who feel that commercial entities provide the only channel by which to publish Greek and Latin textual editions need first to understand fully the infrastructure to which the papyrologists already have access (http://papyri.info/). In fact, the services at http://papyri.info/ go beyond what editors need if they wish to create individual, single-authored, static editions. For editors of Latin editions, help is on the way from the Digital Latin Library Project. If editors wish to work on their own to create editions of Greek and Latin texts, they should buy a TEI-aware XML editor and learn how to produce a modern edition. Anyone smart enough to edit an edition of Greek and Latin is smart enough to understand the necessary TEI XML (or EpiDoc subset of TEI XML: epidoc.sourceforge.net/). My colleagues at the Humboldt Chair of Digital Humanities are also there to do what we can to help.

Second, there is the scholarly need for open data. This need is not new. More than a decade ago, pioneering philologists badgered me to release the textual data that we had accumulated at Perseus. Licenses for private use were not enough. They argued tirelessly that they needed, as part of their fundamental research, the right to analyze, modify, and then redistribute some or all of those texts in their altered form. After dragging my feet for years, I finally began to open up the TEI XML source for Perseus texts. The initial release of the TEI XML Greek and Latin texts under a CC-BY-SA-NC license (now simplified to a CC-BY-SA license) took place in March 2006, almost a decade ago. The Classicists who demanded that open data — Chris Blackwell, Gabby Bodard, Helma Dik, Tom Elliott, Sebastian Heath, Ross Scaife, and Neel Smith, among others — were pioneers and earned for themselves by their visionary work a permanent place in the history of Greco-Roman studies. In 2015, we are beyond the vision thing. We Greek and Latin Philologists are playing catch-up as a field as we struggle to integrate into our work the best methods available for analyzing textual data.

We have gone beyond the point where we can any longer reasonably argue that computational methods are unimportant, or even optional, instruments within Greek and Latin philology as a whole. Not every professional student of Greek and Latin will master the foundational new methods already available to us from fields such as corpus linguistics, computational linguistics, text mining and various applications of machine learning. But those who do master the results of such new fields will play a crucial role in determining what all students of Greek and Latin at all levels will be able to do in their personal learning and published research. Open textual data is a foundational need for modern scholarship. The question before us is how to free ourselves from our dependence upon closed data and to establish a comprehensive, open, extensible textual space for the study of Greek and Latin. It is time to return, yet again, ad fontes — back to the sources.

It is not difficult to see how the field of Greek and Latin can, and will shift, so that new textual editions appear in proper TEI XML under an open license (ideally CC-BY-SA). For commercial — and especially for for-profit — companies, the shift to an open publication model simply reflects a shift in business models and the most profitable presses have already begun to build new (and reportedly quite profitable) open access tracks. Of course, the editors of Greek and Latin as a whole are perfectly capable of providing the editorial support for each other — the ability to write is a selling point of liberal arts degrees and professors of Greek and Latin would be ill-advised to argue that they needed professional editors in the same way as their colleagues in Computer Science or Physics. We can also build publishing workflows that simplify the use of TEI XML (such as the Leiden plus front end that papyrologists have been using for years). But such a streamlined system is a convenience, not a necessity.

The real problem is, of course, one of academic politics. Many faculty believe that they need to publish their work under an established corporate brand name if they are to receive formal academic credit. In some institutions, this belief may even be true, but I think that many faculty would find that their administrations were not only supportive but relieved to see their humanities faculty taking a stand on behalf of open access and open data, especially where faculty are public servants and/or their universities have strong policies in support of Open Access and open data.

I am confident that the administrations at Tufts University (where I am in the department of Classics) and at Leipzig, for example, (where I am the Open Access officer) would enthusiastically work with any department that wanted to establish a framework for fairly assessing an edition that was published under a CC-BY-SA license. If anything, editors at these institutions would have a chance to earn even more prestige by taking an (apparent) risk to advance the role of Greek and Latin in the intellectual life of society beyond specialist researchers and to enable Greek and Latin philology to exploit evolving new forms of research based on progress in various computational fields. When senior faculty with permanent positions hand over their work to corporate entities, the situation is much more problematic. Certainly, as a senior professor who is not subject to existential pressures that junior scholars may feel, I don’t see how I can justify handing my work over to commercial entities. I feel that I have an obligation to help the next generation have the freedom to keep the results of their work open and available both to the intellectual life of society as a whole and to the most advanced analytical methods available to researchers.

But even when our field does the right thing for scholarship and society (and I would be disingenuous if I put it any other way), we face the consequences of our past actions. Commercial interests now control a substantial amount of the work that we have done, whether or not we did that work with public money or even if we may have ignored clear conditions on research funding that the results needed to be available under an open access license. (A review of funding decisions at various agencies may reveal a systematic pattern where domain experts voted to fund research projects that they knew would be handed over to commercial interests even when the regulations governing that funding prioritized, even where they did not explicitly mandate, publishing research results under an open license).

I was fortunate in that I began my own work developing corpora after legal issues began to emerge from the first efforts at sharing digital corpora. When humanists first began developing textual databases in the 1970s and 1980s, scholars had little understanding of copyright law (which, one could argue, really means that copyright law often does not reflect scholarly standards). Many assumed that the reconstructed texts in Classical Greek and Latin critical editions are in the public domain. The fact that a preponderance of experts in the field made this decision — in fact, operated under this assumption — provides evidence about what copyright law should dictate. In fact, explicit legislation does enable editors in some countries to exercise monopoly control over reconstructed texts for a period of time. I don’t know any editors who personally use that right to restrict access to their work — all the editors I know want their work to circulate as widely as possible. But editors sign contracts that give commercial publishers exclusive rights to their work. These publishers have lawyers and, if the perceived loss justifies the investment in legal fees, they can sue individual scholars. Even when textual data is in the public domain, commercial vendors (whether belonging to a for-profit corporation or a non-profit university) can (and often will) sue those who redistribute that public domain data on the basis of contract law. We work hard to make sure that we respect both copyright and contract law.

Given sufficient funding, the following categories of data can be digitized and made available as open data under the kind of CC license upon which modern philology must depend:

Reconstructed texts: Reconstructed texts constitute the running text as reconstructed in an edition without accompanying textual notes, modern language translations introduction, etc. We can use scientific editions from Germany that were published 25 or more years ago (thus, in early 2015 we can use scientific editions published through the beginning of 1990). The EU has passed a regulation allowing its member nations to exert such copyright for up to 30 years but Germany has not taken advantage of this EU opportunity nor has any other major producer of Greek and Latin texts. For pragmatic purposes, we will initially assume that every other nation but Germany (where support for open access and open data have strong public and political support) is liable to enact such a law. We will thus focus in 2015 on digitizing European editions outside of Germany published through 1985, in 2016 through 1986 etc. Here the goal is to have as many TEI XML transcriptions as possible and to help researchers visualize the degree to which different editions differ and to be able to compare different editions.

Textual notes: The argument has been made that the textual notes are not part of the reconstructed text and constitute a separate copyrightable work. Insofar as textual notes are a scholarly activity, they should aspire to be an annotated database and thus should be receivng only 15 years of protection under EU database regulations (http://ec.europa.eu/internal_market/copyright/prot-databases/). The argument has also been made that the textual notes not only do not belong to a scientific edition but also constitute another form of creative expression and that commercial publishers should be able to monopolize them for the life of the editor plus 70 years. We will, for now, focus on mining textual notes from editions where the editor died 70 or more years ago. In practice, that means that we are working with the apparatus criticus of editions published in the 1920s and 1930s. Here our goal is to have a maximally clean searchable text but not to add substantive TEI XML markup that captures the structure of the textual notes — the structure of these notes tend to be complicated and inconsistent. Our pragmatic goal is to support “image front searching,” so that scholars can find words in the textual notes and then see the original page images.

Given the legal constraints outlined above and assuming that we had the resources to create machine actionable versions of all publicly accessible textual data, what is the best way of representing the data commercial licenses restrict?

Strategy one: Support advanced graduate students and a handful of supervisory faculty to go through reviews of recent editions, identifying those editorial decisions that were deemed most significant. The output of this work would be an initial CC-BY-SA series of machine-actionable commentaries that could automatically flag all passages in the CC-BY-SA editions where copyrighted editions made significant decisions. In effect, we would be creating a new textual review series. Because the textual commentaries would be open and available under a CC-BY-SA, members of the community could suggest additions to them or create new expanded versions or create completely new, but interoperable, textual commentaries that could be linked to the CC-BY-SA texts.

Here the goal is to create an initial set of data about textual decisions in copyrighted editions and a framework that members of the community can extend. If members of the community feel that important textual data should be made available, then they can make it available, they can do so. If no one feels that it is important to make the data available, then the data is, by definition, not that important. The plan is to create a self-regulating environment. An open framework can evolve as members of the community wish. In this plan, we start a light-weight, easily expanded and duplicated process that others can copy.

We can summarize this as a Darwinian strategy. We may have to take a step and lose some more recent textual data to open up the overall corpus, but the lost textual data is not, itself, subject to copyright (copyright protects original expression). The hypothesis is that an open field will outperform a closed field and that the open field will replace what it considers to be lost textual data and ultimately (perhaps very quickly) outperformed the closed system.

This strategy has at least two advantages. First, if funding were secured, that funding could help rising Greek and Latin philologists perform the task of creating the initial textual commentaries, thus immersing a new generation in the basic methods of representing textual data in a machine actionable form (and giving them a position where they have an opportunity to learn quite a bit of Greek and/or Latin). Second, we do not need to create a comprehensive set of textual commentaries. We need to create a critical mass that demonstrates the utility of such commentaries.

Strategy two: How many editions that are owned by commercial entities are so crucial to the mainstream study of Greek and Latin that it is worth trying to negotiate the rights and expend the time/money to produce CC-licensed TEI XML versions? The upper bound for such a purchase might be the cost of paying for production of a new open access book (up to 10.000 British pounds). Since commercial publishers have published several hundred editions in the last 25 or 30 years, paying for the rights for all recent editions would cost millions of euros and is clearly not a reasonable option. If publishers do not offer reasonable terms and the new editions are of critical importance, then members of the community will simply have to create new editions that integrate the most valuable findings from the restricted editions — that is, after all, the sort of thing that we are paid to do. But it might be possible to justify purchasing the rights to a few.

What editions might warrant such special treatment and why?

Conversely, how worthwhile is it for us to worry about editions published after c. 1985? Would it be better to focus on providing comprehensive coverage of editions through 1985 with the assumption that if the recent data is sufficiently important, then we can let members of the community fill in the gaps?

Ironically, I think that the best way to liberate textual data from corporate control is to demonstrate that life will go on without it and thus to destroy its value as a revenue-generating asset. We can use the reconstructed texts from Germany through 1990 and from the rest of Europe at least through 1985. While much has been done since then and it would be a shame if we could not immediately use it in our analysis of the ancient world, I became a professor in 1985 and I do not think that the quality of the textual editions available to us was a major limiting factor on the quality of our research at the time. We can start the process of identifying significant textual decisions in copyrighted editions. Where editors have produce radically new editions, we can try to secure the rights but the best way to free commercialized controlled texts is to move forward with what we have.

Members of the community are, of course, free to make a case that research funding from private and public sources should be used to subsidize commercial services or even websites that provide free services but do not make their data available. Those who feel this way should make the case as fully as possible. I have heard the argument that we must under no circumstances go backwards and lose access to the most up-to-date texts but, unfortunately, we have already lost control over that access and have done so for years after it was possible that we could do otherwise (the Text Encoding Initiative was documenting methods for machine actionable editions in the late 1980s) and after generalized models for open licenses had appeared (CreativeCommons.org released its first licenses in 2002). We could have acted differently a decade ago and we have, for the most part, not chosen to produce editions that are modern in format and accessible to a global audience. If we think that specialists at well-funded academic institutions alone need access to the best textual data, we should express that position clearly so that the federally funded agencies and private foundations know where we stand.

I don’t see an easy solution for rescuing data that we have given to commercial organizations but we should hear the arguments and proposals — and then act. Business as usual simply digs us into a deeper hole. Even if some of us may disagree with the case as a whole, a well-articulated case for sticking with privatized textual data may more clearly articulate issues that we need to address in shifting to an open philology.

Please send your suggestions to crane@informatik.uni-leipzig.de — or, better still, send a link to a public version of your thoughts. I will summarize initial suggestions in a subsequent blog post in May 2015.

Perseus Digital Library Updates

News and announcements from the Perseus Digital Library

Getting to open data for Classical Greek and Latin: breaking old habits and undoing the damage — a call for comment!

Archives