The Humanities, RDA, and Infrastructure

Earlier this year I blogged about why I think Perseus, and the digital humanities in general, needs infrastructure. In that post I discussed one strategy we’ve been following at Perseus – that of participating in the efforts of the Research Data Alliance (RDA).

I was recently fortunate to be able to attend the Triangle Scholarly Communications Institute, an Andrew W. Mellon Foundation funded workshop where the theme was “Validating and valuing digital scholarship.” While we did not spend much time talking specifically about “infrastructure”, it was implicit in all the solutions we discussed, from specific data models for representing scholarly assertions as a graph, to taxonomies for crediting work, to approaches for assessing quality. Cameron Neylon, a researcher at Curtin University and one of the institute participants blogged some thoughts upon returning home, including the following:

“Each time I look at the question of infrastructures I feel the need to go a layer deeper, that the real solution lies underneath the problems of the layer I’m looking at. At some level this is true, but it’s also an illusion. The answers lie in finding the right level of abstraction and model building (which might be in biology, chemistry, physics or literature depending on the problem). Principles and governance systems are one form of abstraction that might help but it’s not the whole answer. It seems like if we could re-frame the way we think about these problems, and find new abstractions, new places to stand and see the issues we might be able to break through at least some of those that seem intractable today.” (Cameron Neylon, “Abundance Thinking“)

I think this neatly sums up what I hope to gain from participation in the multidisciplinary community of RDA. It’s easy, especially when time and resources are constrained, to get locked into thinking that our problems are unique and that we need to design custom solutions, but when we examine the problem from other perspectives, the abstractions begin to rise to the surface.

One of the challenges I took on when I agreed to serve as a liaison between RDA and the Alliance of Digital Humanities Organizations (ADHO) was to try to engage humanities researchers in participating in designing solutions for data sharing infrastructure. Collaboration is not easy, even when you’re all working on the same team. We have been struggling with this even in our own small group of developers at Perseus and our sister community in Leipzig, the Open Philology Project. In a recent developers’ meeting we talked about how hard it is to find the time to consider another developer’s solution before going off and designing your own.

But the rewards might be great. Imagine if by adding support to your project to acquire Persistent Identifiers for your data objects and for communicating the types of data represented by those identifiers you got for free the ability to, with the same code, interoperate with data objects from thousands of other projects. Or if implementing support for building and managing collections of objects meant that your data could participate seamlessly with collections of other projects. These are not abstract use cases. The New York Times just reported on the millions of dollars museums worldwide are spending to digitize their collections. What if we had a standard approach to managing digital objects in collections that allowed us to easily write software that could build new virtual museums from these collections? And are the requirements for such a solution very different from the needs of the Perseids project to manage collections of diverse types of annotations?

A disillusioned colleague said to me not long ago that nothing we build is going to save the world. While this might be true, I think that we all work too hard to keep reinventing the wheel. It’s going to take us a little longer to build a solution for managing our collections of annotations if we do it in the context of an RDA working group, but to me the benefit of having the expertise and perspective of colleagues from other communities and disciplines, such as researchers from DKRZ (climate science), NoMaD (materials science) and the PECE project (ethnography) is worth the effort. Even if we don’t save the collections of the world, at the very least we’re certain to end up with a solution that is a little better than one we would have built on our own. I urge everyone to be part of the conversation and the solutions. Feel free to start by commenting on the RDA Research Data Collections Working Group case statement or subscribing to contribute to the effort!

Posted in Uncategorized | Comments Off on The Humanities, RDA, and Infrastructure

Altertumswissenschaften in a Digital Age: Egyptology, Papyrology and Beyond

Dear Colleagues,

We are very pleased to invite you to attend the conference and workshops
Altertumswissenschaften in a Digital Age: Egyptology, Papyrology and Beyond

Leipzig, November 4-6, 2015
Hashtag: #DHEgypt15
Annotated Corpora | 3D | Input of Hieroglyphics, Demotic, Greek, Coptic

Felix-Klein-Hörsaal, Paulinum (4th floor)
Augustusplatz 10, 04109 Leipzig, Germany

Wednesday, November 4


08:30-09:30 – Registration
09:30-10:15 – Welcome (Monica Berti and Franziska Naether)
                        Keynote (Gregory R. Crane)

Research Area 1: How to Structure and Organize Data? Workflow
Chair: Felicitas Weber
10:15-10:45 – Simon Schweitzer (Berlin): The Text Encoding Software of the Thesaurus Linguae Aegyptiae
10:45-11:15 – Frank Feder (Göttingen): Cataloguing and editing Coptic Biblical texts in an online database system
11:15-11:45 – Tom Gheldof (Leuven): Trismegistos: identifying and aggregating metadata of Ancient World texts

11:45-12:00 – Coffee Break

12:00-12:30 – Stephan Seidlmayer (Berlin/Kairo): Medienuniversum Aswan
12:30-13:00 – Monica Berti, Franziska Naether, Julia Jushaninowa, Giuseppe G.A. Celano, Polina Yordanova (Leipzig/New York): The Digital Rosetta Stone: textual alignment and linguistic annotation

13:00-15:00 – Lunch Break and time for individual appointments

15:00-15:30 – Camilla Di Biase-Dyson, Stefan Beyer, Nina Wagenknecht (Göttingen/Leipzig): Annotating figurative language: Another perspective for digital Altertumswissenschaften
15:30-16:00 – Jochen Tiepmar (Leipzig): Release of the MySQL based implementation of the CTS protocol

16:00-16:15 – Coffee Break

Chair: Holger Essler
16:15-16:45 – Simon Schweitzer (Berlin), Simone Gerhards (Mainz): Auf dem Weg zu einem TEI-Austauschformat für ägyptisch-koptische Texte
16:45-17:15 – Lajos Berkes (Heidelberg): Integrating Greek, Coptic and Arabic in the Duke Databank of Documentary Papyri
17:15-17:45 – Nicola Reggiani (Heidelberg/Parma): The Corpus of Greek Medical Papyri and Digital Papyrology: new perspectives from an ongoing project
18:15-19:30 – Public Lecture introduced by Monica Berti and Franziska Naether (Hörsaal 8):
– Keynote by Gregory R. Crane
– Felix Schäfer (DAI Berlin, IANUS): Ein länges Leben für Deine Daten!

19:30 – Reception in the Egyptian Museum, Kroch-Hochhaus, Goethestraße 2
Welcome address by Dietrich Raue, Buffet, Get together, short guided tours
(by Dietrich Raue and Franziska Naether)

Thursday, November 5


Chair: Camilla Di Biase-Dyson
09:15-09:45 – Marc Brose, Josephine Hensel, Gunnar Sperveslage, (Leipzig/Berlin): Von Champollion bis Erman – Lexikographiegeschichte im Digitalen Zeitalter, Projekt “Altägyptische Wörterbücher im Verbund”
09:45-10:15 – Lucia Vannini (London): Virtual reunification of papyrus fragments
10:15-10:45 – Matthias Schulz (Leipzig): What remains behind – on the virtual reconstruction of dismembered manuscripts

10:45-11:00 – Coffee Break

Research Area 2: Which Fields of Research are Relevant? Established and Emerging Use Cases
11:00-11:30 – Anne Herzberg (Berlin): Prosopographia Memphitica. Individuelle Identitäten und Kollektive Biographien einer Residenzstadt des Neuen Reiches
11:30-12:00 – Felicitas Weber (Swansea): The Ancient Egyptian Demonology Project: Second Millennium BCE
12:00-12:30 – Holger Essler, Vincenzo Damiani (Würzburg): Anagnosis – automatisierte Buchstabenverknüpfung von Transkript und Papyrusabbildung

12:30-14:30 – Lunch Break and time for individual appointments

Chair: Simon Schweitzer
14:30-15:00 – So Miyagawa (Göttingen/Kyoto): An Intuitive Unicode Input Method for Ancient Egyptian Hieroglyphic Writing: Applying the Input Technology of the Japanese Writing System
15:00-15:30 – Mark-Jan Nederhof (St. Andrews): OCR of hand-written transcriptions of hieroglyphic text
15:30-16:00 – Claudia Maderna-Sieben, Fabian Wespi, Jannik Korte (Heidelberg): Deciphering Demotic Digitally

16:00-16:15 – Coffee Break

16:15-16:45 – Christopher Waß (München): Demotisch, Hieratisch und SQL: Ein Beispiel für die Anwendung von DH in der Ägyptologie

Research Area 3: How to Train Next Generations? Teaching
16:45-17:15 – Julia Jushaninowa (Leipzig): E-learning Kurs “Verarbeitung digitaler Daten in der Ägyptologie”

Research Area 4: How to Impact Society? Citizen Science and Public Engagement
17:15-17:45 – Usama Gad (Heidelberg/Cairo): The Digital Challenges and Chances: The Case of Papyri and Papyrology in Egypt
17:45-18:15 – Aris Legowski (Bonn): The Project is completed! What now? The Ancient Egyptian Book of the Dead – A Digital Textzeugenarchiv

19:00 – Dinner at “Pascucci”

Friday, November 6

10:00-10:15 – Introduction to workshops

10:15-12:15 – Workshops

Workshop 1: Disruptive Technologies: Feature on 3D in Egyptian Archaeology
(Chair: Felix Schäfer)
short 10-min-presentations by:
Hassan Aglan (Luxor): 3D tombs modeling by simple tools
Rebekka Pabst (Mainz): Neue Bilder, neue Möglichkeiten. Chancen für die Ägyptologie durch das 3D-Design
Room: Seminar Room next to Felix-Klein-Hörsaal

Workshop 2 – Annotated Corpora: Trends and Challenges
(Chair : tba)
Room: Felix-Klein-Hörsaal

12:15-13:00 – Summary and final Discussion, Outlook

13:00 – Lunch Break & Departure of Participants

Poster Presentations

Isabelle Marthot (Universität Basel):
Papyri of the University of Basel (together with Sabine Huebner and Graham Claytor)
University of Minnesota Project: Ancient Lives, a crowd-sourced Citizen Science project

Uta Siffert (Universität Wien)
Project Meketre: From Object to Icon (together with Lubica Hudakova, Peter Jánosy and Claus Jurman)

Journal “Digital Classics Online”
Organizers & Contact

Dr. Monica Berti
Alexander von Humboldt-Lehrstuhl für Digital Humanities – Institut für Informatik
Augustusplatz 10, 04109 Leipzig, Germany

Dr. Franziska Naether
Ägyptologisches Institut/Ägyptisches Museum – Georg Steindorff
Goethestraße 2, 04109 Leipzig, Germany
Telefon 0341 97-37146
Telefax 0341 97-37029
September 1, 2015 – August 31, 2016
Volkswagen Visiting Research Fellow
Institute for the Study of the Ancient World (ISAW), New York
Posted in Uncategorized | Comments Off on Altertumswissenschaften in a Digital Age: Egyptology, Papyrology and Beyond

Collaborating Courses on Fifth-Century Greek History in Spring 2016?

Call for Collaboration

Gregory Crane
Leipzig and Tufts Universities
September 1, 2015

This is a preliminary call for comment and for participation.

I expect to be teaching an advanced Greek course in Spring 2016, quite possibly on Thucydides. I would like to explore the possibility of coordinating that teaching with others so that our students can interact and, ideally, collaborate across institutions and even across countries. Courses may also be in advanced Greek but can be on history, archaeology or any other subject relevant to the period. This model of collaboration can be applied quite broadly and others may pursue such collaborations in my subjects. My particular goal is to get something started in North America that focuses on Fifth-Century Greek History and that could feed into the Sunoikisis DC efforts that my colleague Monica Berti began in 2015.

The goals of this collaboration are

(1) to connect students, who may be in small and somewhat isolated classes or in larger lecture classes and who often have little sense that they are participating in a larger community of learners.

(2) to enable students to contribute something as they learn and to leave behind contributions with their names attached. Larger lecture classes could, for example, contribute by analyzing people and places in English translations and thus participate in social network and/or geospatial analysis. Advanced language courses could, for example, contribute by treebanking texts.

(3) to link courses in Europe, North America and elsewhere by exploiting the differing academic schedules in different countries. Students in North America who begin their semesters in January 2016 can aim to develop courses projects and presentations that will feed into courses that begin at Leipzig in April, with US students presenting via videoconference to introduce topics and methods to students in Germany. In January and early February 2017, European students, who began their classes in October 2016, can reciprocate, presenting topics and methods to North American students, who can, in turn, present in April to the next semester of Europeans students. Here we hope not only to help students develop ties across national boundaries but to recognize that learning is an on-going and cumulative process.

One method of collaboration would be to participate in the 2016 version of (Sunoikisis Digital Classics), which in turn builds up on the long term efforts of the program that Harvard’s Center for Hellenic Studies supports. Collaborations can, however, take various forms. Different classes could, for example, focus upon a single task (e.g., Treebanking selections of Greek historical sources or focusing upon comprehensive Treebanking of a particular author). Different classes might create shared discussion lists or complementary projects (e.g., one class focusing on language and another on the material record). I particularly welcome anyone from the Boston area who would be interested in the possibility of having our students meet jointly in person one or more times.

I welcome both public discussions in venues such as the Digital Classicist and private inquiries (which can be sent to

Posted in Uncategorized | Comments Off on Collaborating Courses on Fifth-Century Greek History in Spring 2016?

Open Patrologia Graeca 1.0

August 8, 2015
Comments to

Federico Boschetti, CNR, Pisa
Gregory Crane, Leipzig/Tufts
Matt Munson, Leipzig/Tufts
Bruce Robertson, Mount Allison
Nick White, Durham (UK) (and Tufts during 2014)

A first stab at producing OCR-generated Greek and Latin for the complete Patrologia Graeca (PG) is now available on GitHub at This release provides raw textual data that will be of service to those with programming expertise and to developers with an interest in Ancient Greek and Latin. The Patrologia Graeca has as much as 50 million words of Ancient Greek produced over more than 1,000 years, along with an even larger amount of scholarship and accompanying translations in Latin.

Matt Munson started a new organization for this data because it is simply too large to put into
the existing OGL organization. Each volume can contain 250MB or more of .txt and .hocr files, so it is impossible to put everything in one repository or even several dozen repositories. So he decided to create a new organization where all the OCR results for each volume would be contained within its own repository. This will also allow us to add more OCR data as necessary (e.g., from Bruce Robertson, of Mt. Allison University, or from nidaba, our own OCR pipeline) at the volume level.

The repositories are being created and populated automatically by a Python script, so if you notice any problems or strange happenings, please let us know either by opening an issue on the individual volume repository or by sending us an email. This is our first attempt at pushing
this data out. Please let us know what you think.

Available data includes:

Greek and Latin text generated by two open source OCR engines, OCRopus ( and Tesseract ( For work done optimizing OCRopus, see For work done optimizing Tesseract, see The output format for both engines in hOCR (, a format that contains links to the coordinates on the original page image from which the OCR was generated.

OCR results for as many scans of each volume of the Patrologia Graeca that we could find in the HathiTrust. We discovered that the same OCR engine applied to scans of different copies of the same book would generate different errors (even when the scans seemed identical to most human observers). This means that if OCR applied to copy X incorrectly analyzed a particular word, there was a good chance that the same word would be correctly analyzed when the OCR engine was applied to copy Y. A preliminary study of this phenomenon is available here: In most cases, the OCRopus/Lace OCR contains results for four different scanned copies while the Tesseract/AncientGreekOCR output contains results for up to 10 different copies. All of the Patrologia Graeca volumes are old enough that HathiTrust members in Europe and North America can download the PDFs for further analysis. Anyone should be able to see the individual pages used for OCR via the public HathiTrust interface.

Initial page-level metadata for the various authors and works in the PG, derived from the core index at columns 13-114 of Cavallera’s 1912 index to the PG (which Roger Pearse cites at A working TEI XML transcription, which has begun capturing the data within the print source, is available for inspection at: All figures are preliminary and subject to modification (that is one motivation for posting this call for help), but we do not expect that the figures will change much at this point. At present, we have identified 658 authors and 4,287 works. The PG contains extensive introductions, essays, indices etc. and we have tried to separate these out by scanning for keywords (e.g., praefatio, monitum, notitia, index). We estimate that there are 204,129 columns of source text and 21,369 columns of secondary sources, representing roughly 90% and 10% respectively. Since a column in Migne contains about 500 words and since the Greek texts (almost) always have accompanying Latin translations, the PG contains up to 50 million words of Greek text but many authors have extensive Latin notes and in some cases no Greek text, so there should be even more Latin. For more information, see

Next Steps

Developing high-recall searching by combining the results for each scanned page of the PG. This entails several steps. First, we need to align the OCR pages with each other — page 611 for one volume may correspond may correspond to page 605 in another, depending upon how the front matter is treated and upon pages that one scan may have missed. Second, we need to create an index of all forms in the OCR-generated text available for each page in each PG volume. Since one of the two OCR engines applied to multiple scans of the same page is likely to produce a correct transcription, a unified index for all the text for all the scans of a page will capture a very high percentage of the words on that page.

Running various forms of text mining and analysis over the PG. Many text mining and analysis techniques work by counting frequently repeated features. Such techniques can be relatively insensitive to error rates in the OCR (i.e., you get essentially the same results if your texts is 96% accurate or if your texts are 99.99% accurate). Many methods for topic modelling and stylistic analysis should produce immediately useful results.

Using the multiple scans to identify and correct errors and to create a single optimized transcription. In most case, bad OCR produces nonsense forms that are not legal Greek or Latin. When one OCR run has a valid Greek or Latin word and others do not, that valid word is usually correct. Where two different scans produce valid Greek or Latin words (e.g., the common confusion of eum and cum), we can use the hOCR feature that allows us to include multiple possibilities. We can do quite a bit encoding the confidence that we have in the accuracy of each transcribed word.

Providing a public error correction interface. One error correction interface already does exist and has been used to correct millions of words of OCR-generated Greek but two issues face us. First, we need to address the fact that we cannot ourselves serve page images from HathiTrust scans. HathiTrust members could use the system that we have by downloading the scans of the relevant volumes to their own servers but that does not provide a general solution. Second, our correction environment deals with OCR for one particular scanned copy. Ideally, the correction environment would allow readers to draw upon the various different scans from different copies and different OCR engines.

Posted in Uncategorized | Comments Off on Open Patrologia Graeca 1.0

Essays on Digital Classics and Digital Humanities

Gregory Crane
Professor of Classics and Winnick Family Chair of Technology and Entrepreneurship
Tufts University

Alexander von Humboldt Professor of Digital Humanities
Open Access Officer
University of Leipzig

Comments to

This link points a list and short summary of essays that I have written in 2014 and 2015 on Digital Classics and Digital Humanities. While most of these essays concentrate on Greco-Roman studies, I consider Digital Classics to include all Classical languages (and in practice I would include all historical languages, whether they are considered Classical or not) and the distinction between Greco-Roman and Classical Studies emerges as a periodic subtheme. The general theme of these essays is the challenge that professional students of Greco-Roman culture face in opening up their field so that it can serve the needs of our students and that it can participate in a global dialogue among civilizations.

Posted in Uncategorized | Comments Off on Essays on Digital Classics and Digital Humanities

“Privashing” vs. publishing — in search of an accurate term for a problematic academic tradition

Gregory Crane
Comments to

[Draft as of July 28, 2015]

Alexander von Humboldt Professor of Digital Humanities
Leipzig University

Professor of Classics
Winnick Family Chair of Technology and Entrepreneurship
Tufts University

This is a short piece reflecting on the need to distinguish publication that actually maximizes the degree to which we make academic work public from the traditional practice of handing control over to commercial, often for-profit, companies that make money by restricting publication. Of course, publishing under an open license has no necessary relation to any particular editorial workflow — researchers can, of course, carry out peer review or any other mechanism they choose. Certainly professors have wide latitude in deciding what does and does not count in a field and we can reward editorial work if we so choose. (I don’t know many academic administrators or members of the public who feel that we don’t produce enough publication and that we could not divert some of that labor to editorial work.) The point is that publishing should (in my view) mean that we have chosen that instrument that will help us advance the intellectual life of society as broadly and as deeply as possible.

I suggest, somewhat tongue-in-cheek, using the term “privashing”, instead of publishing for this traditional habit. Of course, it is an ugly term but that certainly suits the traditional practice in a digital age. The full (not very long) text is here.

Posted in Uncategorized | Comments Off on “Privashing” vs. publishing — in search of an accurate term for a problematic academic tradition

Revision to “The Big Humanities, National Identity and the Digital Humanities in Germany”

I have integrated a number of comments and clarifications from my colleagues in Dariah-DE and TextGrid in revising “The Big Humanities, National Identity and the Digital Humanities in Germany” — a good deal more has gone into making Dariah-DE international than had been clear to me when I first wrote this draft. I also have integrated information from Domenico Fiormonte’s essay, “Towards monocultural (digital) Humanities?” This essay outlines some of the problems of relying upon the Scopus database and rightly emphasizes the challenge of maintaining linguistic (as well as cultural and intellectual diversity). I have no good answers to this last problem except to say that we have our work cut out for us. For more on the Fiormonte essay, see here. This document remains subject to revision in light of comments and suggestions. — GRC July 20, 2015.

Posted in Uncategorized | Comments Off on Revision to “The Big Humanities, National Identity and the Digital Humanities in Germany”

The Work of Classical Studies in a Digital Age

Gregory Crane
[Draft as of July 20, 2015]

Alexander von Humboldt Professor of Digital Humanities
Leipzig University

Professor of Classics
Winnick Family Chair of Technology and Entrepreneurship
Tufts University

[This is hopefully the last essay, at least for a while, in a series that I have published on Digital Classics.]

Abstract: Who is the audience for the work that we professional researchers conduct on Greco-Roman culture? Frequently heard remarks, observed practices and published survey results indicate most of us still assume that only specialists and revenue-generating students really matter. If the public outside of academia does not have access to up-to-date data about the Greco-Roman world, whose problem is it? If we specialists do not believe that we have a primary responsibility to open up the field as is now possible in a digital age, then I am not sure why we should expect support from anyone other than specialists or the students who enroll in our classes. If we do believe that we have an obligation to open up the field, then that has fundamental implications for our daily activities, for our operational theory justifying the existence of our positions, and for the hermeneutics (following a term that is stil popular in Germany) that we construct about who can know what.

The full text is here.

Posted in Uncategorized | Comments Off on The Work of Classical Studies in a Digital Age

Resisting a monocultural (digital) Humanities

Gregory Crane
July 17, 2015

I would very much recommend reading Domenico Fiormonte’s piece “Towards a monocultural (digital) Humanities” (with reference as well to the plea by Miran Hladnik for linguistic diversity)” First, Fiormonte’s piece gives some very specific correctives to the Scopus data that I used when I wrote “the Big Humanities, National Identity and the Digital Humanities in Germany” a month ago. Second this piece articulates, in its title and its content, the fraught question of how we support linguistic and cultural diversity in a transnational space. The question before us is, I think, how we maintain our various linguistic and cultural identities while communicating with one another in a dynamic and enriching dialogue across boundaries of language and culture.

For more on this, look here.

Posted in Uncategorized | Comments Off on Resisting a monocultural (digital) Humanities