Where did holders of German Chairs in Greek, Latin, Ancient History and Classical Archaeology get their PhDs?

Gregory Crane
June 28, 2015

As a by-product to a larger study of Greco-Roman studies in Germany and the United States, I have published some figures on where faculty from ranked PhD-granting departments got their PhDs.

For the data, click here.

Posted in Uncategorized | Comments Off on Where did holders of German Chairs in Greek, Latin, Ancient History and Classical Archaeology get their PhDs?

Update: Where did faculty in US Classics Departments with top-ranked graduate programs get their own PhDs?

Gregory Crane
Leipzig and Tufts Universities

Maryam Foradi
Leipzig University

June 25, 2015

[For the earlier version, see: http://tinyurl.com/pg7lloc]

On June 11, we published a discussion of the programs from which the top ten departments in a particular ranking of Classics PhD programs chose their faculty. We here publish an analysis based on all thirty-one departments in that ranking. We have sorted the results based upon numbers of assistant professors — this reflects activity over the past decade or so (if we consider both the PhD training itself and the period that assistant professors have served).

The statistics do change a bit. Four PhD programs — all in the US — now stand out, each having placed 6 or 7 of 65 assistant professors in these 31 programs and, altogether, accounting for 40% (26 out of 65) of all the assistant professors in this group. A second cluster of 11 programs (including Oxford and Cambridge) placed 2 or 3 of their PhDs in these departments. A third cluster of 12 programs (8 of them situated outside the US) placed one of their PhDs in these departments.

For the full update, check here.

Posted in Uncategorized | Comments Off on Update: Where did faculty in US Classics Departments with top-ranked graduate programs get their own PhDs?

Where did faculty in US Classics Departments with top-ranked graduate programs get their own PhDs?

Gregory Crane
Leipzig and Tufts Universities

Maryam Foradi
Leipzig University

June 11, 2015

Abstract: As part of another study, we analyzed websites for the ten US graduate programs in Classics most highly rated at http://www.phds.org/rankings/classics. These ten departments represent 3.6% of the Classics Departments in the US but the 159 Assistant, Associate and full Professors that we identified would (if we count all Assistant Professors and Associate Professors as tenured or tenure-track) account for 11.3% of 1,410 tenured and tenure track faculty in Classical Studies Programs identified in a 2014 American Academy of Arts and Sciences. Roughly half of all Assistant and Associate Professors from these ten departments (13 of 26 and 17 of 36, respectively) come from five institutions, only three of which (Harvard, Berkeley and Princeton) appear in the US rankings — the other two consistent members of the top five, Cambridge and Oxford do not, of course, appear on the US list. For full professors, the bias towards the top five was even more noticeable, with more than 60% (58 of 97) full Professors coming from this set of five institutions. Of the Assistant Professors (i.e., junior faculty hired in the past six years), just over a third, 9 out of 26, had PhDs from Harvard (5) or Berkeley (4). Whether this patterns will hold true in coming years is unclear, given the changing nature of United States higher education. Programs would do well to pay close attention to the ways in which the new graduate program at the Institute for Study of the Ancient World had begun developing a broader view of the ancient world and including new digital methodologies in the research that they support (http://isaw.nyu.edu/).

Details here.

Posted in Uncategorized | Comments Off on Where did faculty in US Classics Departments with top-ranked graduate programs get their own PhDs?

The Big Humanities, National Identity and the Digital Humanities in Germany

Gregory Crane
June 8, 2015

Alexander von Humboldt Professor of Digital Humanities
Universität Leipzig (Germany)

Professor of Classics
Winnick Family Chair of Technology and Entrepreneurship
Tufts University (USA)


Alexander von Humboldt Professors are formally and explicitly “expected to contribute to enhancing Germany’s sustained international competitiveness as a research location”. And it is as an Alexander von Humboldt Professor of Digital Humanities that I am writing this essay. Two busy years of residence in Germany has allowed me to make at least some preliminary observations but most of my colleagues in Germany have spent their entire careers here, often in fields where they have grown up with their colleagues around the country. I offer initial reflections rather than conclusions and write in order to initiate, rather than to finish, discussions about how the Digital Humanities in Germany can be as attractive outside of Germany as possible. The big problem that I see is the tension between the aspiration to attract more international research talent to Germany and the necessary and proper task of educating the students in any given nation in at least one of their national languages, as well as their national languages and histories. The Big Humanities — German language, literature and history — drive Digital Humanities in Germany (as they do in the US and every other country with which I am familiar).

In my experience, however, the best way to draw new talent into Germany is to develop research teams that run in English and capitalize on a global investment in the use of English as an academic language — our short term experience bears out the larger pattern, in which a large percentage of the students who come to study in Germany enjoy their stay, develop competence in the language and stay in Germany. Big Humanities in Germany, however, bring with them the assumption that work will be done in German and have a corresponding — and entirely appropriate — national and hence inwardly directed focus.

But if it makes sense to have a German Digital Humanities, that also means that Germany may have its own national infrastructure to which only German speaking developers may contribute — 77% of the Arts and Humanities publications in Elsevier’s Scopus Publication database are in English, very few developers outside of the German speaking world learn German and the Big Humanities in the English speaking world tend to cite French as their second language (only 0.3% of the citations of the US Proceedings of the Modern Language Association pointed to German, while the Transactions of the American Philological Association, with 10% of its citations pointing to German, made the most use of German scholarship).

The best way to have a sustainable digital infrastructure is to have as many stakeholders as possible and, ideally, to be agile enough to draw on funding support from different sources, including (especially including) internationally sources of funding. We also need to decide what intellectual impact we wish German investments in Digital Humanities to have outside of the German speaking world and the related question of how the Digital Humanities can expand the role that German language, literature and culture play beyond the German speaking world.

Details and the full text are available here.

Posted in Uncategorized | Comments Off on The Big Humanities, National Identity and the Digital Humanities in Germany

And the News for Greek and Latin in France is not good either

Gregory Crane
Comments to gcrane2008@gmail.com
May 2015

Just as I had finished off a blog about bad news on enrollments for Greek and Latin in the US (and Germany), I came saw a story on Al Jazeera about big cuts being planned for Latin and Ancient Greek in France. The BBC news reports that “the government wants to reduce teaching of Latin and ancient Greek, scrap an intensive language scheme and change the history curriculum.”

BBC reports that the plan to reduce the teaching of Ancient Greek and Latin in France have been among the most disputed proposals.

[Image drawn from the BBC story: http://www.bbc.com/news/world-europe-32792564]

There does seem to have been some good news on this, with Greek and Latin reemerging in at least one proposal, but the fact that Greek and Latin are so vulnerable is the issue — if not now, when will they be hit?

I don’t know the details of what is happening in France (and I would welcome pointers to blog coverage) but, whatever the details, I don’t see how “business-as-usual” is going to help us. The time for change was ten years ago. Let’s not go down without a fight — but a fight must mean fighting to use the new tools at our disposal to reimagine and redesign what our students — and what society as a whole — can get from the study of Ancient Greek and Latin.

Posted in Uncategorized | Comments Off on And the News for Greek and Latin in France is not good either

Bad News for Latin in the US, worse for Greek

A note on Modern Language Association’s Enrollments in Languages Other than English in United States Institutions for Higher Education: Fall 2013 (released, February 2015)
Gregory Crane
Comments to gcrane2008@gmail.com
May 2015

Summary: According to statistics published by the Modern Language Association in February 2015, between fall 2009 and fall 2013, enrollments in Ancient Greek and Latin at US postsecondary institutions suffered their worst decline since 1968, the earliest year for which the MLA offers such statistics. The number of enrollments in Greek and Latin declined from 52,484 to 40,109, a drop of 24%. This precipitous and rapid decline may reflect the lingering aftershocks of the financial crisis of 2008, which certainly raised student anxiety levels and may have driven students away from intellectually idealistic activities such as the study of Ancient Greek and Latin. The study of Greek and Latin in the United States weathered a tremendous challenge between 1960 and 1976, when secondary school enrollments in Latin declined from a steady state of c. 7.5% of all secondary school students to c. 1 – 1.5%. The opening years of the twenty-first century saw an overall surge in Greek and Latin enrollments (from 42,000 in 1998 to 55,000 in 2006) but the current total of c. 40,000 is the lowest since the MLA began providing data in 1968. Even if enrollments prove to have recovered in 2014 and 2015, no supporter of these languages — whether we are professionalized faculty members or not — should assume that the downturn is temporary and that we have developed a model for survival in what has always been, and always will be, the Darwinian space of human intellectual life. If we weathered the challenges of the late 20th century, we now must face the challenges and seize the opportunities of the early twenty-first century.

The full text of the discussion is here.

The Modern Language Association’s report on “Enrollments in Languages Other Than English
in United States Institutions of Higher Education,
Fall 2013” is here.

Posted in Uncategorized | Comments Off on Bad News for Latin in the US, worse for Greek

The road to Perseus 5 – why we need infrastructure for the digital humanities

Bridget Almas
Tufts University

For the last few years we have been laboring to bring the Perseus Digital Library into the next generation of digital environments, freeing the data it offers for others to easily use, reuse, and improve upon, while continuing to offer highly curated, contextualized and optimized views of the data for the public, students and researchers. This has proven challenging in an environment where standards for open data are still evolving, the infrastructure to support it is still nascent or non-existent, and funding to improve and sustain pre-existing solutions is hard to come by. We have been tackling it bit by bit, and have been making real progress, but still have far to go.

Today’s vision of the ultimate solution might look something like this:


In this vision, Perseus 5 is an up-to-date version of the current Perseus 4 interface, which still offers all the texts and related analytic and search services across highly curated collections (and readily supports this on a variety of different screen sizes and devices) but also:

  • provides the ability for users to annotate and add their own contextual information
  • can easily incorporate data as it comes off the OCR pipeline
  • seamlessly incorporates data from other open access platforms
  • in return easily makes its own data available to these platforms (both inside and outside the Perseus ecosystem)
  • archives all data used by and in Perseus for the long term in institutional repositories

Making this happen requires thinking of each and every item in the Perseus Digital Library as a distinct, addressable, shareable and preservable piece of data. This includes:

  • Primary source texts and their translations
  • Secondary source texts and reference works
  • Bibliographic metadata
  • Lexical entities
  • Lexical tokens
  • Person entities
  • Place entities
  • Dates
  • Images
  • Artifacts
  • Linguistic and Textual Analyses
  • Assertions of relationships between any of the above data types
  • Assertions of occurrences of any of the above data types as a fragment of or region of interest in another type
  • Collections of like and disparate groupings of any of the above data types

At any given point in time, any data object in the library may be in a different stage of its digitization or curation lifecycle. We want to make what we have available as soon as it can be put online, and offer progressive improvements to the data as they become available, so the targets and scope of our publications and citations are constantly changing. We want to represent this data in a way that allows us to incorporate the millennia of accumulated data and scholarship on classical texts and languages seamlessly with the newly generated representations of today. And we at Perseus want not to be the only stewards of this data — we want our institutional repositories to help us preserve it for the next generations to come.

To do this scalably, we need an infrastructure which offers us general purpose solutions for the things that are common about each of the data types, while at the same time giving us the flexibility to treat each type of data as distinct when the need arises. For example, when dealing with a citation of a passage in a text, we need a solution that understands canonical citation schemes (Hom. Il. 1.1) and how to translate those into a string of lexical tokens from a specific version of the cited text. And when dealing with a citation of a region of interest on an image, we need a solution that can translate x and y coordinates into a box or polygon on an image itself. But we would also like the systems that manage the persistent identifiers for our data, and those that retrieve the metadata and objects associated with those identifiers, to be general enough to apply to all our objects, regardless of type. In this way, we don’t have to constantly reinvent core common functionality for each data type. And we would like the interfaces to such systems to be consistent not only within Perseus itself, but also across the ecosystem of data providers and consumers in the wider world of classical and modern texts, as well as linguistic and humanities data, so that we can share and interoperate.

The Center for Hellenic Studies’ Homer Multitext project did pioneering work in developing the CITE architecture to define machine-actionable, technology independent standards for identifying, citing and retrieving texts and text-related data objects. This has given us a solid framework within which to begin addressing some of these needs, especially when it comes to working with canonical texts and citations to them. Implementing the Canonical Text Services (CTS) URN specification component of CITE allows us to produce a semantically meaningful identifier which represents the position of a text in the hierarchy in which it is traditionally cited. This same identifier scheme can also be used to cite into the text at the passage level, within a specific version or instance of that text, or within the notional work the text represents. So, for example, while a traditional reference to Book 1 Line 1 in Homer’s Iliad as cited in literature might be “Hom Il. 1.1”, this can be represented as urn:cts:greekLit:tlg0012.tlg001:1.1, as a citation to the notional work The Iliad, or as urn:cts:greekLit:tlg0012.tlg001.perseus-grc1:1.1 in the specific ‘perseus-grc1’ edition of this work. (The CTS specification and the Perseus Catalog documentation explain these components more fully, but briefly, the other components of the URN here are a namespace, greekLit, a textgroup identifier, tlg0012, for the group of texts attributed to the author Homer, a work identifier, tlg001, for the work The Iliad and a passage identifier, 1.1).

But as a domain-specific protocol, CITE has also introduced interoperability questions, particularly when we want to leverage more general data management solutions from other domains and to be interoperable with institutional repositories. It is essential that we be able to leverage software and tools for working with our data that come both from within and outside our domain, and that are backed by communities of developers, in order to ensure the long term sustainability of those solutions. We need solutions which allow us to implement the domain-specific advantages of CITE within the context of a broader, more general framework.

The following use cases offer a closer look at a few of our highest priorities.

Persistent, domain-sensitive, identification of a text throughout its lifecycle

We need to be able to apply the aforementioned CTS URN scheme through the entire lifecycle of a text in the digital library, starting at the point at which a digital image of a manuscript exits the OCR process and is available in uncurated HOCR. At this point we should be able to put this text online, have it catalogued and assigned a CTS URN identifier so that it can be citable and reusable as data under this identifier. As the text is further curated, whether by the crowd or by individual scholars or groups of students, and undergoes revision and change on its way to a fully curated TEI XML edition, new versions are created, requiring new version level identifiers, all of which should be resolvable backwards or forwards to their ancestors or descendants. Annotations and derivative versions and analyses which are made on the early versions should be easily and automatically portable to the newer, improved versions as they come online. Citations which reference fragments of the text should be robust and automatically resolvable across versions. And while perhaps not every distinct version of a text in this lifecycle should be preserved for the long term, certain points will be flagged as requiring an archive copy – for example, at the end of a semester after a classroom of students has undertaken collaborative curation as a scholarly exercise. These archive copies which might normally be assigned handles as persistent identifiers by the institutional repository need nonetheless to retain a link to the CTS URN based identity.

Concurrent annotation and curation in a distributed architecture

Going hand-in-hand with the need to be able to persistently identify a text and its derivative versions throughout its lifecycle is a need for annotation and curation tools which can operate both independently and together on texts as data, retaining the identity of the original data source(s), capturing and adding to the provenance chain details of any transformation of other operations the tool or its user performed on the text, and returning the improved data immediately back to its source repository for versioning and archiving. The following diagram depicts such a workflow in which a text is identified in a repository, a passage of it is extracted for annotation (in this case treebanking and translation alignment), in the process of annotation corrections to the underlying text are made, the improved text is returned to its source and the annotations are preserved separately:


Thinking about the requirements implied by these use cases, there is a core set that can be applied regardless of which type of data we are talking about. We want actors (be they people or machines) to be able to:

  • assign a persistent identifier to a data object
  • associate descriptive metadata with a data object
  • reference a data object
  • reference a fragment of a data object
  • associate provenance information with a data object
  • aggregate like objects
  • aggregate disparate objects
  • create templates of object types for reuse
  • reference a specific version of an object
  • reference an object before it has been published and have the reference be valid throughout the object’s lifecycle
  • create data objects which reference other data objects
  • reference a data object which comes from an external source
  • update a data object which comes from an external source
  • update a data object which we create
  • assert relationships between data objects
  • reference assertions of relationships between objects
  • preserve data objects
  • perform analyses across sets of data objects
  • produce visualizations of collections of data objects and their relationships
  • reference visualizations of collections of data objects and their relationships
  • preserve visualizations of collections of data objects and their relationships
  • notify consumers when new versions of data objects are available
  • consume updated information about data objects from external sources
  • provide data object metadata in a variety of standard output formats
  • associate users with data objects
  • search and filter data objects by various criteria, including access rights and provenance data
  • identify users
  • authenticate users

This is by no means an exhaustive list, but it’s enough to give us an idea of what we might need when we talk about infrastructure for managing this data, particularly when we look at them in the context of our workflows. And we are not alone in these needs. The Data Fabric Interest Group of the Research Data Alliance (RDA) recently issued a call for use cases for data management across a variety of different research domains in the sciences and humanities. Analysis of these use cases resulted in a position paper identifying core components of a data management infrastructure. These include:

  • Persistent Identifier (PID) Systems
  • Identity Systems for Actors
  • Registry Systems for Trusted Repositories
  • Metadata Systems and Registries
  • Schema Registries
  • Category/Vocabulary Registries
  • Data Type Registries
  • Practical Policy Registries
  • Reusable Policy Modules
  • Distributed Authentication Systems
  • Authorization Record Registries
  • Protocols for Aggregating and Harvesting Metadata
  • Workflow engines and components
  • Conversion/Transformation Tool registries
  • Repository APIs
  • Repository Systems
  • Training on and Documentation of Solutions

There are some things that may be missing from this list as well, particularly around the needs for dealing with collections, referencing data fragments, and annotations on data which is undergoing curation or change, but the point is that the need is real, it transcends domains, and solutions will be developed.

If at Perseus we had access to these solutions today, we could focus on the things that are unique about our data, designing the user interfaces, visualizations, annotation, curation and analytical services that would drive new research, without worrying about building the underlying infrastructure to support the data. But in order to take advantage of the solutions as they are built, we must be part of the discussion about the requirements, push for our use cases to be considered in their design, and take part in testing, implementing and sustaining the solutions.

I see participation in RDA’s interest and working groups, presenting our use cases and helping to build the collective solutions, as a long-range tactic, but we also need to make concrete progress today with those tools and services that are available now and that might be able to become part of a broader digital infrastructure supporting the humanities. With the Perseids project we are building a platform for collaborative editing, annotation and publication from a core of existing tools, services and standards. In the process we are experimenting to see how we can use APIs and data transformations to connect the tools and produce sharable data. It is a messy process at times, but we are beginning to see real results.

Our strategy therefore is to participate at both ends of the spectrum, so that when things meet up in in the middle we will have a solution that is sustainable for the future.

Posted in Uncategorized | Comments Off on The road to Perseus 5 – why we need infrastructure for the digital humanities

Seven reasons why we need an independent Digital Humanities

[Full draft available as a Google Doctinyurl]

Gregory Crane
[DRAFT as of April 28, 2015]

Alexander von Humboldt Professor of Digital Humanities
Department of Computer Science
Leipzig University

Professor of Classics
Winnick Family Chair of Technology and Entrepreneurship
Tufts University


This paper describes two issues, the need for an independent Digital Humanities and the opportunity to rethink within a digital space the ways in which Humanists can contribute to society and redefine the social contract upon which they depend.

The paper opens by articulating seven cognitive challenges that the Humanities can, in some cases only, and in other cases much more effectively, combat insofar as we have an independent Digital Humanities: (1) the assumption that new research will look like research that we would like to do ourselves; (2) the assumption that we should be able to exploit the results of new methods without having to learn much and without rethinking the skills that at least some senior members of our field must have; (3) we focus on the perceived quality of Digital Humanities work rather than the larger forces and processes now in play (which would only demand more and better Digital Humanities work if we do not like what we see); (4) we assume that we have already adapted new digital methods to existing departmental and disciplinary structures and assume that the rate of change over the next thirty years will be similar to, or even slower than, that we experienced in the past thirty years, rather than recognizing that the next step will be for us to adapt ourselves to exploit the digital space of which we are a part; (5) we may support interdisciplinarity but the Digital Humanities provides a dynamic and critically needed space of encounter between not only established humanistic fields but between the humanities and a new range of fields including, but not limited to, the computer and information sciences (and thus I use the Digital Humanities as a plural noun, rather than a collective singular); (6) we lack the cultures of collaboration and of openness that are increasingly essential for the work of the humanities and that the Digital Humanities have proven much better at fostering; (7) we assert all too often that a handful of specialists alone define what is and is not important rather than understanding that our fields depends upon support from society as a whole and that academic communities operate in a Darwinian space.

The Digital Humanities offer a marginal advantage in this seventh and most critical point because the Digital Humanities (and the funders which support them) have a motivation to think about and articulate what they contribute to society. The question is not whether the professors in the Digital Humanities or traditional departments of Literature and History do scholarship of higher quality. The question is why society supports the study of the Humanities at all and, if so, at what level and in what form. The Digital Humanities are important because they enable all of us in the Humanities to reestablish the social contracts upon which we always must depend for our existence.

The Digital Humanities provides a space in which we can attack the three fundamental constraints that limited our ability to contribute to the public good: the distribution problem, the library problem, and the comprehension problem. First, all Humanities have the power to solve the distribution problem by insisting upon Open Access (and Open Data) as essential elements of modern publication. Here the Digital Humanities arguably provide a short-term example of leadership because of the greater prevalence of open publication. The second challenge has two components. On the one hand, we need to rethink how we document our publications with the assumption that our readers will, sooner or later, have access to digital libraries of the primary and secondary sources upon which we base our conclusions. At the same time, developing comprehensive digital libraries requires a tremendous amount of work, including fundamental research on document analysis, optical character recognition, and text mining, as well as analysis of the economics and sociology of the Humanities. Third, the comprehension problem challenges us to think about how we can make the sources upon which base our conclusions intellectually accessible — what happens when people in Indonesia confront a text in Greek or viewers in American view a Farsi sermon from Tehran, artifacts of high art from Europe or of religious significance from Sri Lanka, a Cantata of Bach or music played on an Armenian duduk?

The basic questions that we ask in the Humanities will not change. We will still, as Livy pointed out in the opening to his History of Rome, confront the human record in all its forms, ask how we got from there to where we are now and then where we want to go. And we may still, like Goethe, decide that the best thing about the past is simply how much enthusiasm it can kindle within us. But the speed and creativity with which we answer the distribution, library and comprehension problems determines the degree to which our specialist research can feed outwards into society and serve the public good.

The more we labor to open up our work — even the most specialized work — and to articulate its importance, the better we understand ourselves what we are doing and why. Non-specialists include other professional researchers as well as the general public. We may think that we are giving up, in practice if not in law, something of our perceived (and always only conditional and always short-term) disciplinary autonomy but, in so doing, to win the freedom to serve, each of us according to the possibilities of our individual small subfields within the humanities, the intellectual life of society.

For the full text, see the Google Doctinyurl.

Posted in Uncategorized | Comments Off on Seven reasons why we need an independent Digital Humanities

Getting to open data for Classical Greek and Latin: breaking old habits and undoing the damage — a call for comment!

Gregory Crane
Professor of Classics and Winnick Family Chair of Technology and Entrepreneurship
Tufts University

Alexander von Humboldt Professor of Digital Humanities
Open Access Officer
University of Leipzig

March 4, 2015

Philologists must for at least two reasons open up the textual data upon which they base their work. First, researchers need to be able to download, modify and redistribute their textual data if they are to fully exploit both new methods that center around algorithmic analysis (e.g., corpus linguistics, computational linguistics, text mining, and various applications of machine learning) and new scholarly products and practices that computational methods enable (e.g., on-going and decentralized production of micro-publications by scholars from around the world, as well as scalable evaluation systems to facilitate contributions from, and learning by, citizen scientists). In some cases, issues of privacy may come into play (e.g., where we study Greek and Latin data produced by our students) but our textual editions of, and associated annotations on, long-dead authors do not fall into this category. Second, open data is essential if researchers working with historical languages such as Classical Greek and Latin are to realize either their obligation to conduct the most effective (as well as transparent) research and or their obligation to advance the role that those languages can play in the intellectual life of society as a whole. It is not enough to make our 100 EUR monographs available under an Open Access license. We must also make as accessible as possible the primary sources upon which those monographs depend.

This blog post addresses two barriers that prevent students of historical languages such as Classical Greek and Latin from shifting to a fully open intellectual ecosystem: (1) the practice of giving control of scholarly work to commercial entities that then use their monopoly rights to generate revenue and (2) the legacy rights over critical editions that scholars have already handed over to commercial entities. The field has the rights, the skills, and the labor so that it can immediately and permanently address the first challenge. The second challenge is much less tractable. We may never be able to place recent work in a form where it can fully support new scholarship. That form includes not only the rights that restrict its distribution and, often, the digital format in which textual editions have been produced (e.g., where editors used word processing files rather than best practices such as well-implemented Text Encoding Initiative XML markup). Both the rights and the format together make it unlikely that we will be able in the immediate future (if ever) to make recent critical editions fully available (under a CC-BY-SA license, with TEI XML markup representing the logical structure of both the reconstructed text and the textual notes). The question before us is to determine how much we can in the immediate future recover for the full range of scholarly use and public discourse.

First, the decision to stop handing over ownership of new textual data (and especially any textual data produced with any significant measure of public funding) is, in 2015, a purely political one. There is no practical reason not to make this change immediately. If it takes editors an extra six months or a year (and it should not) because they need to learn how to produce a digital edition, the delay is insignificant in comparison to the damage that scholars suffer when they hand over control of the reconstructed text for 25 years and of the textual notes, introduction and other materials for 70 years after their death.

The Text Encoding Initiative began publishing interoperable methods for machine actionable digital editions in the late 1980s (Historical Editing was already a topic at the 1987 Poughkeepsie Planning meeting that laid the foundations for the TEI: http://www.tei-c.org/Vault/ED/edp01.htm). Students of Classical Greek and Latin, the largest community of historical philologists, have already all the resources in expertise and infrastructure with which to conduct this shift immediately. The second problem is recovering, insofar as possible, textual data that researchers have already given over to commercial interests which, in turn, exploit monopoly ownership to generate revenue. How many textual decisions in this commercial zone do we need to reference within the open data upon which we base our analysis of Greek and Latin and the cultures that these languages directly influenced? This blog post proposes a two-fold strategy (1) beginning a series of openly licensed (CC-BY-SA) textual commentaries, that are aligned to openly licensed editions and to which members of the community can suggest inclusion of important new editorial choices or conjectures only available in editions controlled by commercial interests; (2) identifying, if absolutely necessary, a small list of editions that commercial entities control but that are of such compelling importance that funding should be solicited to buy the rights to digitize, markup with TEI XML, and distribute their contents.

Many traditional scholars may argue that we should preserve the present system (1) because only specialists in Greek and Latin philology need access to new editions and (2) because students of Greek and Latin have no need of the computational methods that require open data for their full expression as instruments of scholarship. Scholars are free to argue that the primary goal of humanities research is to enable specialist publication along small, effectively closed networks of intellectual exchange, that the results of our work on Greek and Latin do not really have enough broader impact to warrant worrying about open access and open data, that the study of historical languages does not require that researchers have the ability to download, analyze, modify and redistribute textual data, and that publicly funded scholarship is not ultimately answerable to the public which provides that funding.

From a pragmatic point of view, such arguments would be problematic for anyone who wishes to replace retiring faculty in Greek or Latin, to attract the most ambitious minds to the study of these languages or to justify research support for the study of Greek and Latin from any private foundation or governmental agency that could invest its research support elsewhere. There is never enough money to support all the research that would advance human understanding, much less so-called STEM disciplines (science, technology, engineering and mathematics — the corresponding German acronym is MINT) that materially advance the economic prosperity and biological health of society. But the privilege of academic freedom and the right of free expression that we enjoy in nations such as Germany and the United States exist so that we can follow our principles and add our opinions to public debate.

There are two fundamental reasons for scholars to make openly useful both their conclusions (open access publications) and the data upon which those conclusions depend.

The first bears most directly upon those of us who receive most, if not all all, of our salary and research support either from public money or from private foundations that require us to make our results available under an open license. There is our obligation as humanists to advance the intellectual life of humanity. Of course, in 2015, this point of view is finding its way into regulations of government research funding in various countries while private foundations increasingly insist that the results from work that they fund be published under an open license. Ironically, the smallest and the largest disciplines seem to have adapted most rapidly to this much more open model of research. Students of Greek papyrology, for example, have already made the transition to open data and on-going, decentralized editing — those who feel that commercial entities provide the only channel by which to publish Greek and Latin textual editions need first to understand fully the infrastructure to which the papyrologists already have access (http://papyri.info/). In fact, the services at http://papyri.info/ go beyond what editors need if they wish to create individual, single-authored, static editions. For editors of Latin editions, help is on the way from the Digital Latin Library Project. If editors wish to work on their own to create editions of Greek and Latin texts, they should buy a TEI-aware XML editor and learn how to produce a modern edition. Anyone smart enough to edit an edition of Greek and Latin is smart enough to understand the necessary TEI XML (or EpiDoc subset of TEI XML: epidoc.sourceforge.net/). My colleagues at the Humboldt Chair of Digital Humanities are also there to do what we can to help.

Second, there is the scholarly need for open data. This need is not new. More than a decade ago, pioneering philologists badgered me to release the textual data that we had accumulated at Perseus. Licenses for private use were not enough. They argued tirelessly that they needed, as part of their fundamental research, the right to analyze, modify, and then redistribute some or all of those texts in their altered form. After dragging my feet for years, I finally began to open up the TEI XML source for Perseus texts. The initial release of the TEI XML Greek and Latin texts under a CC-BY-SA-NC license (now simplified to a CC-BY-SA license) took place in March 2006, almost a decade ago. The Classicists who demanded that open data — Chris Blackwell, Gabby Bodard, Helma Dik, Tom Elliott, Sebastian Heath, Ross Scaife, and Neel Smith, among others — were pioneers and earned for themselves by their visionary work a permanent place in the history of Greco-Roman studies. In 2015, we are beyond the vision thing. We Greek and Latin Philologists are playing catch-up as a field as we struggle to integrate into our work the best methods available for analyzing textual data.

We have gone beyond the point where we can any longer reasonably argue that computational methods are unimportant, or even optional, instruments within Greek and Latin philology as a whole. Not every professional student of Greek and Latin will master the foundational new methods already available to us from fields such as corpus linguistics, computational linguistics, text mining and various applications of machine learning. But those who do master the results of such new fields will play a crucial role in determining what all students of Greek and Latin at all levels will be able to do in their personal learning and published research. Open textual data is a foundational need for modern scholarship. The question before us is how to free ourselves from our dependence upon closed data and to establish a comprehensive, open, extensible textual space for the study of Greek and Latin. It is time to return, yet again, ad fontes — back to the sources.

It is not difficult to see how the field of Greek and Latin can, and will shift, so that new textual editions appear in proper TEI XML under an open license (ideally CC-BY-SA). For commercial — and especially for for-profit — companies, the shift to an open publication model simply reflects a shift in business models and the most profitable presses have already begun to build new (and reportedly quite profitable) open access tracks. Of course, the editors of Greek and Latin as a whole are perfectly capable of providing the editorial support for each other — the ability to write is a selling point of liberal arts degrees and professors of Greek and Latin would be ill-advised to argue that they needed professional editors in the same way as their colleagues in Computer Science or Physics. We can also build publishing workflows that simplify the use of TEI XML (such as the Leiden plus front end that papyrologists have been using for years). But such a streamlined system is a convenience, not a necessity.

The real problem is, of course, one of academic politics. Many faculty believe that they need to publish their work under an established corporate brand name if they are to receive formal academic credit. In some institutions, this belief may even be true, but I think that many faculty would find that their administrations were not only supportive but relieved to see their humanities faculty taking a stand on behalf of open access and open data, especially where faculty are public servants and/or their universities have strong policies in support of Open Access and open data.

I am confident that the administrations at Tufts University (where I am in the department of Classics) and at Leipzig, for example, (where I am the Open Access officer) would enthusiastically work with any department that wanted to establish a framework for fairly assessing an edition that was published under a CC-BY-SA license. If anything, editors at these institutions would have a chance to earn even more prestige by taking an (apparent) risk to advance the role of Greek and Latin in the intellectual life of society beyond specialist researchers and to enable Greek and Latin philology to exploit evolving new forms of research based on progress in various computational fields. When senior faculty with permanent positions hand over their work to corporate entities, the situation is much more problematic. Certainly, as a senior professor who is not subject to existential pressures that junior scholars may feel, I don’t see how I can justify handing my work over to commercial entities. I feel that I have an obligation to help the next generation have the freedom to keep the results of their work open and available both to the intellectual life of society as a whole and to the most advanced analytical methods available to researchers.

But even when our field does the right thing for scholarship and society (and I would be disingenuous if I put it any other way), we face the consequences of our past actions. Commercial interests now control a substantial amount of the work that we have done, whether or not we did that work with public money or even if we may have ignored clear conditions on research funding that the results needed to be available under an open access license. (A review of funding decisions at various agencies may reveal a systematic pattern where domain experts voted to fund research projects that they knew would be handed over to commercial interests even when the regulations governing that funding prioritized, even where they did not explicitly mandate, publishing research results under an open license).

I was fortunate in that I began my own work developing corpora after legal issues began to emerge from the first efforts at sharing digital corpora. When humanists first began developing textual databases in the 1970s and 1980s, scholars had little understanding of copyright law (which, one could argue, really means that copyright law often does not reflect scholarly standards). Many assumed that the reconstructed texts in Classical Greek and Latin critical editions are in the public domain. The fact that a preponderance of experts in the field made this decision — in fact, operated under this assumption — provides evidence about what copyright law should dictate. In fact, explicit legislation does enable editors in some countries to exercise monopoly control over reconstructed texts for a period of time. I don’t know any editors who personally use that right to restrict access to their work — all the editors I know want their work to circulate as widely as possible. But editors sign contracts that give commercial publishers exclusive rights to their work. These publishers have lawyers and, if the perceived loss justifies the investment in legal fees, they can sue individual scholars. Even when textual data is in the public domain, commercial vendors (whether belonging to a for-profit corporation or a non-profit university) can (and often will) sue those who redistribute that public domain data on the basis of contract law. We work hard to make sure that we respect both copyright and contract law.

Given sufficient funding, the following categories of data can be digitized and made available as open data under the kind of CC license upon which modern philology must depend:

    Reconstructed texts: Reconstructed texts constitute the running text as reconstructed in an edition without accompanying textual notes, modern language translations introduction, etc. We can use scientific editions from Germany that were published 25 or more years ago (thus, in early 2015 we can use scientific editions published through the beginning of 1990). The EU has passed a regulation allowing its member nations to exert such copyright for up to 30 years but Germany has not taken advantage of this EU opportunity nor has any other major producer of Greek and Latin texts. For pragmatic purposes, we will initially assume that every other nation but Germany (where support for open access and open data have strong public and political support) is liable to enact such a law. We will thus focus in 2015 on digitizing European editions outside of Germany published through 1985, in 2016 through 1986 etc. Here the goal is to have as many TEI XML transcriptions as possible and to help researchers visualize the degree to which different editions differ and to be able to compare different editions.

    Textual notes: The argument has been made that the textual notes are not part of the reconstructed text and constitute a separate copyrightable work. Insofar as textual notes are a scholarly activity, they should aspire to be an annotated database and thus should be receivng only 15 years of protection under EU database regulations (http://ec.europa.eu/internal_market/copyright/prot-databases/). The argument has also been made that the textual notes not only do not belong to a scientific edition but also constitute another form of creative expression and that commercial publishers should be able to monopolize them for the life of the editor plus 70 years. We will, for now, focus on mining textual notes from editions where the editor died 70 or more years ago. In practice, that means that we are working with the apparatus criticus of editions published in the 1920s and 1930s. Here our goal is to have a maximally clean searchable text but not to add substantive TEI XML markup that captures the structure of the textual notes — the structure of these notes tend to be complicated and inconsistent. Our pragmatic goal is to support “image front searching,” so that scholars can find words in the textual notes and then see the original page images.

Given the legal constraints outlined above and assuming that we had the resources to create machine actionable versions of all publicly accessible textual data, what is the best way of representing the data commercial licenses restrict?

Strategy one: Support advanced graduate students and a handful of supervisory faculty to go through reviews of recent editions, identifying those editorial decisions that were deemed most significant. The output of this work would be an initial CC-BY-SA series of machine-actionable commentaries that could automatically flag all passages in the CC-BY-SA editions where copyrighted editions made significant decisions. In effect, we would be creating a new textual review series. Because the textual commentaries would be open and available under a CC-BY-SA, members of the community could suggest additions to them or create new expanded versions or create completely new, but interoperable, textual commentaries that could be linked to the CC-BY-SA texts.

Here the goal is to create an initial set of data about textual decisions in copyrighted editions and a framework that members of the community can extend. If members of the community feel that important textual data should be made available, then they can make it available, they can do so. If no one feels that it is important to make the data available, then the data is, by definition, not that important. The plan is to create a self-regulating environment. An open framework can evolve as members of the community wish. In this plan, we start a light-weight, easily expanded and duplicated process that others can copy.

We can summarize this as a Darwinian strategy. We may have to take a step and lose some more recent textual data to open up the overall corpus, but the lost textual data is not, itself, subject to copyright (copyright protects original expression). The hypothesis is that an open field will outperform a closed field and that the open field will replace what it considers to be lost textual data and ultimately (perhaps very quickly) outperformed the closed system.

This strategy has at least two advantages. First, if funding were secured, that funding could help rising Greek and Latin philologists perform the task of creating the initial textual commentaries, thus immersing a new generation in the basic methods of representing textual data in a machine actionable form (and giving them a position where they have an opportunity to learn quite a bit of Greek and/or Latin). Second, we do not need to create a comprehensive set of textual commentaries. We need to create a critical mass that demonstrates the utility of such commentaries.

Strategy two: How many editions that are owned by commercial entities are so crucial to the mainstream study of Greek and Latin that it is worth trying to negotiate the rights and expend the time/money to produce CC-licensed TEI XML versions? The upper bound for such a purchase might be the cost of paying for production of a new open access book (up to 10.000 British pounds). Since commercial publishers have published several hundred editions in the last 25 or 30 years, paying for the rights for all recent editions would cost millions of euros and is clearly not a reasonable option. If publishers do not offer reasonable terms and the new editions are of critical importance, then members of the community will simply have to create new editions that integrate the most valuable findings from the restricted editions — that is, after all, the sort of thing that we are paid to do. But it might be possible to justify purchasing the rights to a few.

What editions might warrant such special treatment and why?

Conversely, how worthwhile is it for us to worry about editions published after c. 1985? Would it be better to focus on providing comprehensive coverage of editions through 1985 with the assumption that if the recent data is sufficiently important, then we can let members of the community fill in the gaps?

Ironically, I think that the best way to liberate textual data from corporate control is to demonstrate that life will go on without it and thus to destroy its value as a revenue-generating asset. We can use the reconstructed texts from Germany through 1990 and from the rest of Europe at least through 1985. While much has been done since then and it would be a shame if we could not immediately use it in our analysis of the ancient world, I became a professor in 1985 and I do not think that the quality of the textual editions available to us was a major limiting factor on the quality of our research at the time. We can start the process of identifying significant textual decisions in copyrighted editions. Where editors have produce radically new editions, we can try to secure the rights but the best way to free commercialized controlled texts is to move forward with what we have.

Members of the community are, of course, free to make a case that research funding from private and public sources should be used to subsidize commercial services or even websites that provide free services but do not make their data available. Those who feel this way should make the case as fully as possible. I have heard the argument that we must under no circumstances go backwards and lose access to the most up-to-date texts but, unfortunately, we have already lost control over that access and have done so for years after it was possible that we could do otherwise (the Text Encoding Initiative was documenting methods for machine actionable editions in the late 1980s) and after generalized models for open licenses had appeared (CreativeCommons.org released its first licenses in 2002). We could have acted differently a decade ago and we have, for the most part, not chosen to produce editions that are modern in format and accessible to a global audience. If we think that specialists at well-funded academic institutions alone need access to the best textual data, we should express that position clearly so that the federally funded agencies and private foundations know where we stand.

I don’t see an easy solution for rescuing data that we have given to commercial organizations but we should hear the arguments and proposals — and then act. Business as usual simply digs us into a deeper hole. Even if some of us may disagree with the case as a whole, a well-articulated case for sticking with privatized textual data may more clearly articulate issues that we need to address in shifting to an open philology.

Please send your suggestions to crane@informatik.uni-leipzig.de — or, better still, send a link to a public version of your thoughts. I will summarize initial suggestions in a subsequent blog post in May 2015.

Posted in Uncategorized | Comments Off on Getting to open data for Classical Greek and Latin: breaking old habits and undoing the damage — a call for comment!