Call for Papers – “Open Conference on Digital Infrastructures for Global Philology”

February 20-23, 2017 Leipzig, Germany

The Alexander von Humboldt Chair for Digital Humanities at the University of Leipzig, Germany, will host an “Open Conference on Digital Infrastructures for Global Philology” from February 20-23, 2017 at the University of Leipzig. The purpose of this conference is to bring together members of the larger scholarly community, both within and outside of Germany with a focus on, but not limited to, those scholars working with historical languages. This conference should help both to advance the discussions already happening between large and medium-sized infrastructure-building projects on the one hand and (digital) humanities scholars on the other and to introduce new topics that have yet to find a forum for public discussion.

Possible topics for proposed papers include, but are not limited to, the following questions:

●  What digital services, collections and curricula have emerged from particular funded projects that are of such general utility that they can be adopted as part of a long-term infrastructure upon which students of a field, at every level of expertise, can depend for years and decades?

●  What infrastructure developments within larger fields (including large European infrastructure projects such as Clarin, Dariah and Europeana but also substantive efforts in the natural and life sciences) provide foundations upon which historical languages can build?

●  What digital services, collections or curricula need to be developed so that a field of study can flourish in a digital society?

●  What funding mechanisms and organizational structures are in place/need to be put in place in libraries, computing centers, and academic departments?

The deadline for paper submissions is November 15, 2016. Submissions and review will be handled through the EasyChair system.

Please visit if you wish to submit a paper for review. Decisions about submissions will be made by November 30, 2016. Limited funding will be available for reimbursement of the travel expenses of presenters.

The context for this conference is a planning project, funded by the German Ministry of Education and Research ( An English version of the proposal is available at

Posted in Uncategorized | Comments Off on Call for Papers – “Open Conference on Digital Infrastructures for Global Philology”

Considering a post-bac in Classics? Think about the new MA in Digital Tools for Premodern Studies at Tufts.

This blog focuses upon what this program offers to students who have traditionally participated in post-baccalaureate programs to prepare for a PhD program in Greco-Roman studies. Two years ago I published a blog entitled “So you want to become a professor of Greek and/or Latin? Think hard about a PhD in Digital Humanities.” Here I talk about something that we have done at Tufts to improve the situation, creating an MA in Digital Tools for Premodern Studies that allows students to address two common challenges: the need to read more Greek and Latin and to familiarize themselves with the digital methods upon which their teaching and research will increasingly depend in the decades to come. You would then be in a position to pursue a PhD in those more traditional departments where faculty realize that junior scholars must adapt and that their own programs are not yet in a position to provide that training.

Before focusing on this particular topic, I do want to emphasize that the new MA in Digital Humanities for Premodern Studies, of course, also provides opportunities for a range of different subsequent career tracks. Libraries are being reinvented and demand personnel who can work with born-digital data about the past. All PhD Programs that engage with the human textual record need students who can exploit the latest digital methods. And the methods that students encounter in this program come from fields such as corpus and computational linguistics, text mining and visualization, geospatial and social network analysis, citizen science and other areas of general and emerging importance. The MA is also intended to support a growing range of historical languages and contexts; the Tufts Department of Classics already offers classes in Sanskrit (thanks to Anne Mahoney) as well as Greek and Latin and supports research in Classical Arabic (thanks to Riccardo Strobino). The two chairs of Classics who led the development of this program, past-chair Vickie Sullivan and current-chair Ioannis Evrigenis, are political philosophers with primary appointments in Political Science and their research offers opportunities for students who wish to explore early modern culture and its connections to the ancient world. Certain this connects to my own belief that we must redefine the meaning of Classics to include all Classical languages from the around the world (if we don’t just jettison this value-laden term in favor of historical languages or something more descriptive).

Our hope is to support an increasing range of languages and faculty will work with potential applicants to find ways to address their interests. But for those students who are looking for a program to prepare them for PhD programs in Greco-Roman studies or in fields where advanced knowledge of Greek or Latin are particularly helpful, the new MA in Digital Humanities for Premodern Studies offers a new approach.

Over the past generation a number of post-bac programs have emerged to help students expand their knowledge of Greek and Latin in preparation for PhD study. More recently, a new challenge has emerged: to exploit the possibilities and meet the challenges of a digital age, the study of Greek, Latin and all historical languages needs to be rebuilt from the ground up. In a very real sense, we have no modern editions, no modern lexica, no modern commentaries, no modern encyclopedias and no modern publications because our scholarship and the infrastructure upon which it resides still reflects, even when it appears in digital form, the limitations of print rather the possibilities of digital media. The study of historical languages — even languages like Greek and Latin, which have been the object of analysis for thousands of years — is in the process of reinventing itself. The challenge is to exploit the best from millennia of work, but to do so critically, identifying and transcending problematic assumptions about what we do and why. And if we are to do so, we need a new generation of researchers and teachers who have a command of emerging digital methods. Few PhD programs in Greek, Latin, or any other historical language are in a position to provide such expertise — the Digital Classicists who have emerged have been largely self-taught and many of those considered to be Digital Classicists (myself included) wish that they had had an opportunity for more formal training.

The new MA in Premodern Studies at Tufts thus addresses two different challenges, and does so in a way where work on each challenge reinforces the other. If students wish to improve their command of texts in historical languages such as Greek and Latin, one of the best ways is to take charge of a text and create the beginnings of its first truly digital edition.

What constitutes a truly digital edition?

  • A truly digital edition does not simply have digitized textual notes, modern language translation, and indices for people, places and primary sources that quote a text (e.g., the Greek texts that quote a particular passage of the Iliad) or that the text itself quotes (e.g., the authors that such as Plutarch or Athenaeus quote). A truly digital edition contains links to digital representations of the manuscripts, papyri, inscribed stones or other textual witnesses.
  • A truly digital edition does not simply add upper- and lower-case, paragraph breaks, and modern punctuation but explicitly encodes the morphological, syntactic, and semantic judgments upon which these print-culture conventions of annotation depend and to which they loosely allude. A truly digital edition encodes the best available data about which Alexander or which Alexandria a particular passage in a particular text designates and then captures social and geographical relationships in a format that can be automatically analyzed and dynamically visualized.
  • A truly digital edition encodes quotations within and references to a text as hypertextual links among evolving digital editions.
  • A truly digital edition can accommodate translations into multiple different modern languages, with each translation aligned, as appropriate, at the word and phrase level, both to help readers more effectively work with the original and to support new forms of scholarly analysis (e.g., using translation alignments to study changes in word sense over time).
  • At Tufts you can work with the emerging digital publication environment developed by the Perseids Project, create geospatial publications with Pelagios Commons, develop a project within the collaborative framework of the Homer Multitext project or any other open digital project. If you want to demonstrate to a potential PhD program your capacity to understand Greek and Latin, as well as your mastery of new digital methods, you can create a portfolio of your work and contribute to the next version of the Perseus Digital Library which is now under development at Tufts, Leipzig and elsewhere. The two year program allows you to develop a mature portfolio when Phd applications are due in December of your second year.

    Gregory Crane
    Program Director
    MA in Digital Humanities for Premodern Studies
    Winnick Family Chair of Technology and Entrepreneurship
    Professor of Classics
    Editor-in-Chief, Perseus Digital Library
    Adjunct Professor of Computer Science
    Tufts University

    Alexander von Humboldt Professor of Digital Humanities
    Leipzig University

    Posted in Uncategorized | Comments Off on Considering a post-bac in Classics? Think about the new MA in Digital Tools for Premodern Studies at Tufts.

    Share your Perseus Story!

    The Perseus Digital Library is interested in hearing from you.

    Whether you are a long-time Perseus user or a new visitor, we want to know your story.

    We know our audience is using Perseus in many interesting and unusual ways and we’d like to hear about them.

    Did Perseus help you solve a problem? reach a goal? make a connection? Have a story you want to share?

    Please use the form here or drop an email to with the subject line “My Perseus Story.”

    Posted in Uncategorized | Comments Off on Share your Perseus Story!

    The Humanities, RDA, and Infrastructure

    Earlier this year I blogged about why I think Perseus, and the digital humanities in general, needs infrastructure. In that post I discussed one strategy we’ve been following at Perseus – that of participating in the efforts of the Research Data Alliance (RDA).

    I was recently fortunate to be able to attend the Triangle Scholarly Communications Institute, an Andrew W. Mellon Foundation funded workshop where the theme was “Validating and valuing digital scholarship.” While we did not spend much time talking specifically about “infrastructure”, it was implicit in all the solutions we discussed, from specific data models for representing scholarly assertions as a graph, to taxonomies for crediting work, to approaches for assessing quality. Cameron Neylon, a researcher at Curtin University and one of the institute participants blogged some thoughts upon returning home, including the following:

    “Each time I look at the question of infrastructures I feel the need to go a layer deeper, that the real solution lies underneath the problems of the layer I’m looking at. At some level this is true, but it’s also an illusion. The answers lie in finding the right level of abstraction and model building (which might be in biology, chemistry, physics or literature depending on the problem). Principles and governance systems are one form of abstraction that might help but it’s not the whole answer. It seems like if we could re-frame the way we think about these problems, and find new abstractions, new places to stand and see the issues we might be able to break through at least some of those that seem intractable today.” (Cameron Neylon, “Abundance Thinking“)

    I think this neatly sums up what I hope to gain from participation in the multidisciplinary community of RDA. It’s easy, especially when time and resources are constrained, to get locked into thinking that our problems are unique and that we need to design custom solutions, but when we examine the problem from other perspectives, the abstractions begin to rise to the surface.

    One of the challenges I took on when I agreed to serve as a liaison between RDA and the Alliance of Digital Humanities Organizations (ADHO) was to try to engage humanities researchers in participating in designing solutions for data sharing infrastructure. Collaboration is not easy, even when you’re all working on the same team. We have been struggling with this even in our own small group of developers at Perseus and our sister community in Leipzig, the Open Philology Project. In a recent developers’ meeting we talked about how hard it is to find the time to consider another developer’s solution before going off and designing your own.

    But the rewards might be great. Imagine if by adding support to your project to acquire Persistent Identifiers for your data objects and for communicating the types of data represented by those identifiers you got for free the ability to, with the same code, interoperate with data objects from thousands of other projects. Or if implementing support for building and managing collections of objects meant that your data could participate seamlessly with collections of other projects. These are not abstract use cases. The New York Times just reported on the millions of dollars museums worldwide are spending to digitize their collections. What if we had a standard approach to managing digital objects in collections that allowed us to easily write software that could build new virtual museums from these collections? And are the requirements for such a solution very different from the needs of the Perseids project to manage collections of diverse types of annotations?

    A disillusioned colleague said to me not long ago that nothing we build is going to save the world. While this might be true, I think that we all work too hard to keep reinventing the wheel. It’s going to take us a little longer to build a solution for managing our collections of annotations if we do it in the context of an RDA working group, but to me the benefit of having the expertise and perspective of colleagues from other communities and disciplines, such as researchers from DKRZ (climate science), NoMaD (materials science) and the PECE project (ethnography) is worth the effort. Even if we don’t save the collections of the world, at the very least we’re certain to end up with a solution that is a little better than one we would have built on our own. I urge everyone to be part of the conversation and the solutions. Feel free to start by commenting on the RDA Research Data Collections Working Group case statement or subscribing to contribute to the effort!

    Posted in Uncategorized | Comments Off on The Humanities, RDA, and Infrastructure

    Altertumswissenschaften in a Digital Age: Egyptology, Papyrology and Beyond

    Dear Colleagues,

    We are very pleased to invite you to attend the conference and workshops
    Altertumswissenschaften in a Digital Age: Egyptology, Papyrology and Beyond

    Leipzig, November 4-6, 2015
    Hashtag: #DHEgypt15
    Annotated Corpora | 3D | Input of Hieroglyphics, Demotic, Greek, Coptic

    Felix-Klein-Hörsaal, Paulinum (4th floor)
    Augustusplatz 10, 04109 Leipzig, Germany

    Wednesday, November 4


    08:30-09:30 – Registration
    09:30-10:15 – Welcome (Monica Berti and Franziska Naether)
                            Keynote (Gregory R. Crane)

    Research Area 1: How to Structure and Organize Data? Workflow
    Chair: Felicitas Weber
    10:15-10:45 – Simon Schweitzer (Berlin): The Text Encoding Software of the Thesaurus Linguae Aegyptiae
    10:45-11:15 – Frank Feder (Göttingen): Cataloguing and editing Coptic Biblical texts in an online database system
    11:15-11:45 – Tom Gheldof (Leuven): Trismegistos: identifying and aggregating metadata of Ancient World texts

    11:45-12:00 – Coffee Break

    12:00-12:30 – Stephan Seidlmayer (Berlin/Kairo): Medienuniversum Aswan
    12:30-13:00 – Monica Berti, Franziska Naether, Julia Jushaninowa, Giuseppe G.A. Celano, Polina Yordanova (Leipzig/New York): The Digital Rosetta Stone: textual alignment and linguistic annotation

    13:00-15:00 – Lunch Break and time for individual appointments

    15:00-15:30 – Camilla Di Biase-Dyson, Stefan Beyer, Nina Wagenknecht (Göttingen/Leipzig): Annotating figurative language: Another perspective for digital Altertumswissenschaften
    15:30-16:00 – Jochen Tiepmar (Leipzig): Release of the MySQL based implementation of the CTS protocol

    16:00-16:15 – Coffee Break

    Chair: Holger Essler
    16:15-16:45 – Simon Schweitzer (Berlin), Simone Gerhards (Mainz): Auf dem Weg zu einem TEI-Austauschformat für ägyptisch-koptische Texte
    16:45-17:15 – Lajos Berkes (Heidelberg): Integrating Greek, Coptic and Arabic in the Duke Databank of Documentary Papyri
    17:15-17:45 – Nicola Reggiani (Heidelberg/Parma): The Corpus of Greek Medical Papyri and Digital Papyrology: new perspectives from an ongoing project
    18:15-19:30 – Public Lecture introduced by Monica Berti and Franziska Naether (Hörsaal 8):
    – Keynote by Gregory R. Crane
    – Felix Schäfer (DAI Berlin, IANUS): Ein länges Leben für Deine Daten!

    19:30 – Reception in the Egyptian Museum, Kroch-Hochhaus, Goethestraße 2
    Welcome address by Dietrich Raue, Buffet, Get together, short guided tours
    (by Dietrich Raue and Franziska Naether)

    Thursday, November 5


    Chair: Camilla Di Biase-Dyson
    09:15-09:45 – Marc Brose, Josephine Hensel, Gunnar Sperveslage, (Leipzig/Berlin): Von Champollion bis Erman – Lexikographiegeschichte im Digitalen Zeitalter, Projekt “Altägyptische Wörterbücher im Verbund”
    09:45-10:15 – Lucia Vannini (London): Virtual reunification of papyrus fragments
    10:15-10:45 – Matthias Schulz (Leipzig): What remains behind – on the virtual reconstruction of dismembered manuscripts

    10:45-11:00 – Coffee Break

    Research Area 2: Which Fields of Research are Relevant? Established and Emerging Use Cases
    11:00-11:30 – Anne Herzberg (Berlin): Prosopographia Memphitica. Individuelle Identitäten und Kollektive Biographien einer Residenzstadt des Neuen Reiches
    11:30-12:00 – Felicitas Weber (Swansea): The Ancient Egyptian Demonology Project: Second Millennium BCE
    12:00-12:30 – Holger Essler, Vincenzo Damiani (Würzburg): Anagnosis – automatisierte Buchstabenverknüpfung von Transkript und Papyrusabbildung

    12:30-14:30 – Lunch Break and time for individual appointments

    Chair: Simon Schweitzer
    14:30-15:00 – So Miyagawa (Göttingen/Kyoto): An Intuitive Unicode Input Method for Ancient Egyptian Hieroglyphic Writing: Applying the Input Technology of the Japanese Writing System
    15:00-15:30 – Mark-Jan Nederhof (St. Andrews): OCR of hand-written transcriptions of hieroglyphic text
    15:30-16:00 – Claudia Maderna-Sieben, Fabian Wespi, Jannik Korte (Heidelberg): Deciphering Demotic Digitally

    16:00-16:15 – Coffee Break

    16:15-16:45 – Christopher Waß (München): Demotisch, Hieratisch und SQL: Ein Beispiel für die Anwendung von DH in der Ägyptologie

    Research Area 3: How to Train Next Generations? Teaching
    16:45-17:15 – Julia Jushaninowa (Leipzig): E-learning Kurs “Verarbeitung digitaler Daten in der Ägyptologie”

    Research Area 4: How to Impact Society? Citizen Science and Public Engagement
    17:15-17:45 – Usama Gad (Heidelberg/Cairo): The Digital Challenges and Chances: The Case of Papyri and Papyrology in Egypt
    17:45-18:15 – Aris Legowski (Bonn): The Project is completed! What now? The Ancient Egyptian Book of the Dead – A Digital Textzeugenarchiv

    19:00 – Dinner at “Pascucci”

    Friday, November 6

    10:00-10:15 – Introduction to workshops

    10:15-12:15 – Workshops

    Workshop 1: Disruptive Technologies: Feature on 3D in Egyptian Archaeology
    (Chair: Felix Schäfer)
    short 10-min-presentations by:
    Hassan Aglan (Luxor): 3D tombs modeling by simple tools
    Rebekka Pabst (Mainz): Neue Bilder, neue Möglichkeiten. Chancen für die Ägyptologie durch das 3D-Design
    Room: Seminar Room next to Felix-Klein-Hörsaal

    Workshop 2 – Annotated Corpora: Trends and Challenges
    (Chair : tba)
    Room: Felix-Klein-Hörsaal

    12:15-13:00 – Summary and final Discussion, Outlook

    13:00 – Lunch Break & Departure of Participants

    Poster Presentations

    Isabelle Marthot (Universität Basel):
    Papyri of the University of Basel (together with Sabine Huebner and Graham Claytor)
    University of Minnesota Project: Ancient Lives, a crowd-sourced Citizen Science project

    Uta Siffert (Universität Wien)
    Project Meketre: From Object to Icon (together with Lubica Hudakova, Peter Jánosy and Claus Jurman)

    Journal “Digital Classics Online”
    Organizers & Contact

    Dr. Monica Berti
    Alexander von Humboldt-Lehrstuhl für Digital Humanities – Institut für Informatik
    Augustusplatz 10, 04109 Leipzig, Germany

    Dr. Franziska Naether
    Ägyptologisches Institut/Ägyptisches Museum – Georg Steindorff
    Goethestraße 2, 04109 Leipzig, Germany
    Telefon 0341 97-37146
    Telefax 0341 97-37029
    September 1, 2015 – August 31, 2016
    Volkswagen Visiting Research Fellow
    Institute for the Study of the Ancient World (ISAW), New York
    Posted in Uncategorized | Comments Off on Altertumswissenschaften in a Digital Age: Egyptology, Papyrology and Beyond

    Collaborating Courses on Fifth-Century Greek History in Spring 2016?

    Call for Collaboration

    Gregory Crane
    Leipzig and Tufts Universities
    September 1, 2015

    This is a preliminary call for comment and for participation.

    I expect to be teaching an advanced Greek course in Spring 2016, quite possibly on Thucydides. I would like to explore the possibility of coordinating that teaching with others so that our students can interact and, ideally, collaborate across institutions and even across countries. Courses may also be in advanced Greek but can be on history, archaeology or any other subject relevant to the period. This model of collaboration can be applied quite broadly and others may pursue such collaborations in my subjects. My particular goal is to get something started in North America that focuses on Fifth-Century Greek History and that could feed into the Sunoikisis DC efforts that my colleague Monica Berti began in 2015.

    The goals of this collaboration are

    (1) to connect students, who may be in small and somewhat isolated classes or in larger lecture classes and who often have little sense that they are participating in a larger community of learners.

    (2) to enable students to contribute something as they learn and to leave behind contributions with their names attached. Larger lecture classes could, for example, contribute by analyzing people and places in English translations and thus participate in social network and/or geospatial analysis. Advanced language courses could, for example, contribute by treebanking texts.

    (3) to link courses in Europe, North America and elsewhere by exploiting the differing academic schedules in different countries. Students in North America who begin their semesters in January 2016 can aim to develop courses projects and presentations that will feed into courses that begin at Leipzig in April, with US students presenting via videoconference to introduce topics and methods to students in Germany. In January and early February 2017, European students, who began their classes in October 2016, can reciprocate, presenting topics and methods to North American students, who can, in turn, present in April to the next semester of Europeans students. Here we hope not only to help students develop ties across national boundaries but to recognize that learning is an on-going and cumulative process.

    One method of collaboration would be to participate in the 2016 version of (Sunoikisis Digital Classics), which in turn builds up on the long term efforts of the program that Harvard’s Center for Hellenic Studies supports. Collaborations can, however, take various forms. Different classes could, for example, focus upon a single task (e.g., Treebanking selections of Greek historical sources or focusing upon comprehensive Treebanking of a particular author). Different classes might create shared discussion lists or complementary projects (e.g., one class focusing on language and another on the material record). I particularly welcome anyone from the Boston area who would be interested in the possibility of having our students meet jointly in person one or more times.

    I welcome both public discussions in venues such as the Digital Classicist and private inquiries (which can be sent to

    Posted in Uncategorized | Comments Off on Collaborating Courses on Fifth-Century Greek History in Spring 2016?

    Open Patrologia Graeca 1.0

    August 8, 2015
    Comments to

    Federico Boschetti, CNR, Pisa
    Gregory Crane, Leipzig/Tufts
    Matt Munson, Leipzig/Tufts
    Bruce Robertson, Mount Allison
    Nick White, Durham (UK) (and Tufts during 2014)

    A first stab at producing OCR-generated Greek and Latin for the complete Patrologia Graeca (PG) is now available on GitHub at This release provides raw textual data that will be of service to those with programming expertise and to developers with an interest in Ancient Greek and Latin. The Patrologia Graeca has as much as 50 million words of Ancient Greek produced over more than 1,000 years, along with an even larger amount of scholarship and accompanying translations in Latin.

    Matt Munson started a new organization for this data because it is simply too large to put into
    the existing OGL organization. Each volume can contain 250MB or more of .txt and .hocr files, so it is impossible to put everything in one repository or even several dozen repositories. So he decided to create a new organization where all the OCR results for each volume would be contained within its own repository. This will also allow us to add more OCR data as necessary (e.g., from Bruce Robertson, of Mt. Allison University, or from nidaba, our own OCR pipeline) at the volume level.

    The repositories are being created and populated automatically by a Python script, so if you notice any problems or strange happenings, please let us know either by opening an issue on the individual volume repository or by sending us an email. This is our first attempt at pushing
    this data out. Please let us know what you think.

    Available data includes:

    Greek and Latin text generated by two open source OCR engines, OCRopus ( and Tesseract ( For work done optimizing OCRopus, see For work done optimizing Tesseract, see The output format for both engines in hOCR (, a format that contains links to the coordinates on the original page image from which the OCR was generated.

    OCR results for as many scans of each volume of the Patrologia Graeca that we could find in the HathiTrust. We discovered that the same OCR engine applied to scans of different copies of the same book would generate different errors (even when the scans seemed identical to most human observers). This means that if OCR applied to copy X incorrectly analyzed a particular word, there was a good chance that the same word would be correctly analyzed when the OCR engine was applied to copy Y. A preliminary study of this phenomenon is available here: In most cases, the OCRopus/Lace OCR contains results for four different scanned copies while the Tesseract/AncientGreekOCR output contains results for up to 10 different copies. All of the Patrologia Graeca volumes are old enough that HathiTrust members in Europe and North America can download the PDFs for further analysis. Anyone should be able to see the individual pages used for OCR via the public HathiTrust interface.

    Initial page-level metadata for the various authors and works in the PG, derived from the core index at columns 13-114 of Cavallera’s 1912 index to the PG (which Roger Pearse cites at A working TEI XML transcription, which has begun capturing the data within the print source, is available for inspection at: All figures are preliminary and subject to modification (that is one motivation for posting this call for help), but we do not expect that the figures will change much at this point. At present, we have identified 658 authors and 4,287 works. The PG contains extensive introductions, essays, indices etc. and we have tried to separate these out by scanning for keywords (e.g., praefatio, monitum, notitia, index). We estimate that there are 204,129 columns of source text and 21,369 columns of secondary sources, representing roughly 90% and 10% respectively. Since a column in Migne contains about 500 words and since the Greek texts (almost) always have accompanying Latin translations, the PG contains up to 50 million words of Greek text but many authors have extensive Latin notes and in some cases no Greek text, so there should be even more Latin. For more information, see

    Next Steps

    Developing high-recall searching by combining the results for each scanned page of the PG. This entails several steps. First, we need to align the OCR pages with each other — page 611 for one volume may correspond may correspond to page 605 in another, depending upon how the front matter is treated and upon pages that one scan may have missed. Second, we need to create an index of all forms in the OCR-generated text available for each page in each PG volume. Since one of the two OCR engines applied to multiple scans of the same page is likely to produce a correct transcription, a unified index for all the text for all the scans of a page will capture a very high percentage of the words on that page.

    Running various forms of text mining and analysis over the PG. Many text mining and analysis techniques work by counting frequently repeated features. Such techniques can be relatively insensitive to error rates in the OCR (i.e., you get essentially the same results if your texts is 96% accurate or if your texts are 99.99% accurate). Many methods for topic modelling and stylistic analysis should produce immediately useful results.

    Using the multiple scans to identify and correct errors and to create a single optimized transcription. In most case, bad OCR produces nonsense forms that are not legal Greek or Latin. When one OCR run has a valid Greek or Latin word and others do not, that valid word is usually correct. Where two different scans produce valid Greek or Latin words (e.g., the common confusion of eum and cum), we can use the hOCR feature that allows us to include multiple possibilities. We can do quite a bit encoding the confidence that we have in the accuracy of each transcribed word.

    Providing a public error correction interface. One error correction interface already does exist and has been used to correct millions of words of OCR-generated Greek but two issues face us. First, we need to address the fact that we cannot ourselves serve page images from HathiTrust scans. HathiTrust members could use the system that we have by downloading the scans of the relevant volumes to their own servers but that does not provide a general solution. Second, our correction environment deals with OCR for one particular scanned copy. Ideally, the correction environment would allow readers to draw upon the various different scans from different copies and different OCR engines.

    Posted in Uncategorized | Comments Off on Open Patrologia Graeca 1.0