The Open Philology Project and Humboldt Chair of Digital Humanities at Leipzig

Initial Research Plan (April 2013)
Alexander von Humboldt Chair of Digital Humanities
The University of Leipzig

Abstract: The Humboldt Chair of Digital Humanities at the University of Leipzig sees in the rise of Digital Technologies an opportunity to re-assess and re-establish how the humanities can advance the understanding of the past and to support a dialogue among civilizations. Philology, which uses surviving linguistic sources to understand the past as deeply and broadly as possible, is central to these tasks, because languages, present and historical, are central to human culture. To advance this larger effort, the Humboldt Chair focuses upon enabling Greco-Roman culture to realize the fullest possible role in intellectual life. Greco-Roman culture is particularly significant because it contributed to both Europe and the Islamic world and the study of Greco-Roman culture and its influence thus entails Classical Arabic as well as Ancient Greek and Latin. The Humboldt Chair inaugurates an Open Philology Project with three complementary efforts that produce open philological data, educate a wide audience about historical languages, and integrate open philological data from many sources: the Open Greek and Latin Project organizes content (including translations into Classical Arabic and modern languages); the Historical Language e-Learning Project explores ways to support learning across barriers of language and culture as well as space and time; the Scaife Digital Library focuses on integrating cultural heritage sources available under open licenses.

The Humboldt Chair of Digital Humanities at Leipzig will create the Open Philology Project. In this we advance a digital successor to that philology which sees in language a source for what Augustus Boeckh in 1822 termed “the understanding of all antiquity, including the events of both the physical and intellectual world.”[1] Philology brings the past to life as deeply and as broadly as possible through the use of surviving linguistic sources. From the human perspective philology constitutes a set of language-based critical scholarly skills — not only annotating (annotation is the basic genre), but also comparing, connecting, interpreting, proving or rejecting hypotheses, finding evidence; critical apparatuses and commentaries often preserve condensed fruits of such reasoning, and Open Philology doesn’t let the scholarly heritage of manuscript and print culture vanish, converting it into digital form and using it as a training field for next generations.

The Open Philology Project will initially focus particularly upon pre-modern society but its methods and goals apply to any society for whom traces of their languages survive. Philology provides an opportunity to advance the intellectual life of individual societies and, equally important, dialogue across civilizations, transcending not only barriers of space and time but of language and culture. Digital technology plays a critical role as a catalyst because — and only because — it allows us to re-imagine how we can more fully achieve, and indeed transform our ability to achieve, these ancient goals of philology. This is not a digital philology or digital humanities project. The Open Philology Project is about philology.

To address the vast challenge of an Open Philology that embraces all historical languages, the Humboldt Chair begins by advancing within a European and a global space the role of that Greco-Roman culture out of which Europe largely emerged. Greco-Roman culture has also contributed significantly to the Islamic world and Europe depended upon Arabic sources. Our goal in this activity is not only to increase the intellectual accessibility of European cultural heritage but also to foster exchange of cultural heritage sources such as Persian, Sanskrit, Classical Chinese, Egyptian from the earliest forms through Coptic, and the Cuneiform Languages of the Ancient Near East, and Classical Mayan from the Western Hemisphere. As a platform for this activity, the Open Philology Project builds upon, and helps develop, the Perseus Digital Library, working with colleagues in Europe, North America and elsewhere to expand open collections and services and to reach an increasingly global audience.

The greatest challenge of humanistic scholarship lies, in our view, in making available the human cultural heritage to the global community. Digitization is a necessary but, by itself, insufficient step in this process. Human cultural heritage must be represented in a way that supports intellectual access across barriers of language and culture. This requirement in turn has implications for the technologies but also for the rights regime that we choose. Open data provides the best strategy by which to promote the circulation of sources within a global context. Collections that are protected behind subscription barriers may serve the interests of specialist communities. Collections that cannot be freely modified and re-circulated may be useful for reference. But scholarship in general and philology in particular must build upon open data if it is to realize its intellectual and social obligations to advance the common understanding of human culture. The Humboldt Chair is therefore committed to open source publication, with machine-actionable Creative Commons licenses requiring attribution and sharing of data and allowing commercial reuse (CC-BY-SA) as the preferred mode of distribution.

The larger Open Philology Project begins with three specific, complementary activities, addressing the challenge of creating comprehensive open resources, providing the education needed to understand and to contribute to those resources, and integrating open resources from many different sources into an integrated computational framework for analysis, annotation, and preservation.

First, the Open Greek and Latin Project makes Greek and Latin sources freely accessible, both digitally and intellectually, to a global public. Second, the Historical Language e-Learning Project provides distributed e-learning of historical languages such as Greek and Latin so that as many as possible may penetrate as deeply as they choose into the sources from which the present has been fashioned. Third, support from the Humboldt Foundation allows us to contribute, after years of planning, to the Scaife Digital Library. The SDL develops methods to aggregate and integrate from various sources open data, textual and archaeological alike, in any medium, about human cultural heritage, including, but not limited to, the Greco-Roman world.

All three of these projects focus on the production, analysis, and preservation of machine-actionable annotations. All data about historical records is based upon transcriptions, whether from text-bearing objects or from sound recordings, which are themselves annotations that describe the textual content from a region of a written surface or a time interval in a recording. We will continue to make arguments in the digital successors to notes, articles and monographs but we should increasingly integrate into, and use as the foundation for, those arguments machine actionable links to the sources upon which they are based. These links include not only citations to particular sources (e.g., a machine actionable link to a particular reading in a particular edition of Aeschylus) but also to aggregate data (e.g., the results of a search posed as they appeared at a particular time). In the end, born-digital notes, articles and monographs — if they preserve labels inherited for the form of a book — may preserve a family resemblance to their predecessors but they will surely evolve into something qualitatively different as the adapt to the different gravity, if not fundamentally different physics, of a digital space.


1. The Open Greek and Latin Project.

The ultimate goal is to represent every source text produced in Classical Greek or Latin from antiquity through the present, including texts preserved in manuscript tradition as well as on inscriptions, papyri, ostraca and other written artifacts. Over the course of the next five years, we will focus upon converting as much Greek and Latin, available as scanned printed books, into an open, dynamic corpus, continuously augmented and improved by a combination of automated processes and human contributions of many kinds. The focus upon Greek and Latin reflects both the belief that we have an obligation to disseminate European cultural heritage and the observation that recent advances in OCR technology for Greek and Latin make these intertwined languages ready for large-scale work. This focus also builds upon years of work by many projects in Greek and Latin, including Papyri.info and the Homer Multitext Project, the Inscriptions of Aphrodisias., and CIL Open Access.

The Open Greek and Latin Project aims at providing at least one version for all Greek and Latin sources produced during antiquity (through c. 600 CE) and a growing collection from the vast body of post-classical Greek and Latin that still survives. Perhaps 150 million words of Greek and Latin, preserved in manuscripts, on stone, on papyrus or other writing surface, survive from antiquity. Analysis of 10,000 books in Latin, downloaded from Archive.org, identified more than 200 million words of post-classical Latin. With 70,000 public domain books listed in the Hathi Trust as being in Ancient Greek or Latin, the amount of Greek and Latin already available will almost certainly exceed 1 billion words.

Where existing corpora of Greek and Latin have generally included one edition of a work, Open Greek and Latin Corpus is designed to manage multiple versions of, and to represent the complete textual history of, a work: every manuscript, every papyrus fragment, and every printed edition are all versions within the history of a text. In the short run, this involves using OCR-technology optimized for Classical Greek and Latin to create an open corpus that is reasonably comprehensive for the c. 150 million words produced through c. 600 CE and that begins to make available the billions of words produced after 600 CE in Greek and Latin that survive.

The Open Greek and Latin Project assumes the following modules:

A. The Philological Workflow Module enables a digital representation of a written source, available in a 2D or 3D form, to be converted into machine actionable text, corrected, and annotated with an increasing range of information (named entities, morphology, syntax, and other linguistic features, alignments between different versions of the same text, whether in the same language or translated across multiple languages, text re-use detection, including quotation, paraphrase and citation). Automated methods include Optical Character Recognition, Text Alignment, Syntactic Parsing, etc. In each case, human annotation can augment automated annotations or substitute for them altogether where automated methods are not yet able to produce adequate initial results (e.g, manual transcription of inscriptions and medieval manuscripts).

B. The Distributed Review Module provides a range of options by which to assess and represent the reliability produced, whether by automated systems or by human contributors, as part of the Philological work flow. In many cases annotations can be released even when their reliability is not necessarily high (e.g., noisy OCR-generated text). The point is to identify annotations that most require subsequent attention, whether manual correction or action of some other kind (e.g., poor OCR data may reflect the need to create a new scan of a printed book). The Distributed Review Module assumes that multiple annotations may be equally trustworthy (i.e., experts back different interpretations) and can track inter-annotator disagreement among experts. The Distributed Review Module provides default values but also allows for different weights to be placed upon different validations (e.g., include all readings in a particular version of a text, whether these are readings in a particular manuscript or the readings chosen and emendations proposed by a particular editor, include all prosopographical identifications proposed by one particular scholar). The Distributed Review Module should support searching by both text characteristics (specific passages, authors), annotator characteristics (expert, novice, native language etc.), and annotation characteristics (emendations, grammatical or interpretive comments, degree of inter-annotator disagreement, etc.). But it should also permit browsing the history of annotation by passage, annotator, magnitude of disagreement etc.

C. The Philological Repository Module can preserve all published philological data, including the transcriptions and all subsequent annotations (e.g., identifying a transcribed word as being in Latin, a place name, in the accusative case etc.) as well as the provenance of each annotation (e.g., the annotation is born-digital and was published by a particular individual at a given time or the annotation was extracted from a print book by a particular author and published at a given time, with or without human verification, and with an estimated accuracy). The repository is based upon the Canonical Text Services/CITE Architecture for textual sources developed by researchers at the Center for Hellenic Studies within the larger framework developed by the DataConservancy.org.

D. The e-Portfolio Module aggregates and distributes particular subsets of user contributions for particular audiences. The e-Porfolio Module can identify any published contributions according to type, date, and author (e.g., all syntactic analyses published by a particular person during a particular time interval). The e-Portfolio Module can also make selected materials that are not yet published available to selected audiences (e.g., an editorial board or the admission committee for a degree program). The Perseids Project from Tufts University provides a starting point for this work.


2. The Historical Language e-Learning Project.

Anyone, anywhere, regardless of their linguistic or cultural background, whether they are a student in a formal curriculum or not, should be able to learn as much of a historical language as they need to work directly in original-language primary materials. Work in this context entails not only learning but contributing early and in increasingly sophisticated ways: students can add new, or correct existing, data as they learn to type in an unfamiliar language, while they can, in the language of gaming, “level up” to tasks such as linguistic annotation of new materials and the production of aligned, modern language translations, and see their growing proficiency concretely visualized in a way that permits them to compare it to that of others and documents it for use in e-portfolios and other records of their achievement.

In the short run, building upon existing collections and services, we will support students working with Greek, Latin and Classical Arabic texts in a system readily localized for speakers of multiple modern languages (with Croatian, English, German and French emerging as initial languages of interest). The Historical Language e-Learning Project is based upon the existence of extensible richly annotated corpora. Learners draw from the start on existing richly annotated corpora and on images of sources such as manuscripts and inscriptions. They use morpho-syntactic annotation, dictionary links, and aligned modern language translations, so that they immediately work with primary sources in the original. They learn grammar by comparing their morpho-syntactic analyses with vetted analyses already available, by creating their own aligned translations, and by using annotations and alignments to develop active as well as passive mastery of morphology, syntax, and vocabulary. They demonstrate advanced ability by expanding the corpus of richly annotated materials, proposing new annotations of their own and reviewing annotations proposed by others.

Ancient Greek, Latin and Classical Arabic Large collections such as Gallica, Google Books, and the Internet Archive have already made billions of words in Greek and Latin available to a global audience — a far larger collection than the small handful of advanced researchers can document and a far broader collection in terms of genre and style than the classical corpora on which current programs in Greek and Latin still focus. While the amount of openly licensed Classical Arabic is not yet as extensive, more than enough sources are available and require documentation and analysis. We need to train a new generation of students, who can directly analyze sources in the original languages and make substantive contributions earlier and on a wider range of sources than has previously been feasible.

Traditional programs of Ancient Greek and Latin are not designed to support students who first develop an interest in these languages during their undergraduate careers — by the time students are able to begin interacting proficiently with the primary sources, they are ready to graduate. Traditional class schedules are rigid and rarely can an institution offer more than one section of an ancient language. As for Classical Arabic, few institutions offer any formal instruction at all — Modern Language Association statistics report only 285 students enrolled in Classical Arabic in the United States in 2009.

At the same time, Ancient Greek, Latin, and Classical Arabic must also compete for students with fields where students regularly contribute as members of laboratory teams and can often expect to develop their own research projects as undergraduates. The strongest academic programs not only demand that students master complex disciplinary knowledge but also provide students with an opportunity to use that knowledge to make substantive contributions and to develop significant research projects of their own.

Open Greek and Latin creates an inexhaustible range of substantive activity to which any student of these languages can aspire — whether working on manuscripts of well-known authors (e.g., the Homer Multitext Project), creating the first modern language translations of Greek and Latin sources (e.g., Tufts’ Medieval Latin), or adding critical linguistic annotation (e.g., the Perseus Greek and Latin Treebanks).

The Historical Language e-Learning project depends upon the following:

A. Global Editions of Historical Languages include all features of a traditional edition (including textual notes) but are designed to make primary sources available to the widest possible audience. Global editions are richly encoded source materials that include enough annotation so that readers with a general understanding of grammar and of language are able to work directly with primary sources in a historical language that they have not studied. Core elements to this infrastructure include morphological and syntactic analyses, links to machine readable dictionaries (ideally with data about word sense of a given word in a given context), and one or more aligned modern language translations that themselves have substantial annotation and are designed to facilitate machine translation into many other modern languages.

B. Preliminary Source Texts are digital texts that do not yet have the mature annotations needed for global editions. Students of a language can aspire to begin adding new annotations within the opening weeks of study, working at first with each other and with their instructors but ultimately working to level up to roles with more trust and responsibility as they demonstrate their increasing skills. It is both a goal and a necessity to engage students as collaborators, because we believe that this is a good thing in itself, because we believe that this increases learning, and because so many historical sources are already available that we cannot depend upon a handful of professionals to analyze and annotate them all.

C. Machine Actionable Models of Language Competence provide methods by which to assess knowledge of historical languages at every level, from introductory exposure to the language through standardized examinations (e.g., the US-based National Latin exam, the German Graecum and Latinum) to the various PhD level examinations (e.g., US PhD programs in Greek and Latin commonly have combined reading lists of between 500,000 and 1,000,000 words of Greek and Latin). Where the city of Arpino holds the Certamen Ciceronianum Arpinas, a multinational competition for students from various nations — each of whom can compete in their own national language — we can create an on-going contest, where students from around the world and from widely disparate backgrounds can meet to compare their skills and compete to shed light upon Greco-Roman culture. Machine Actionable Models of Language Competence can be configured for various purposes and pedagogical perspectives. The Competence Models also provide mechanisms for evaluation of competence across national languages — examinations on morphology and syntax provide a powerful measure of competence and can be effectively localized in various national languages — whether the student speaks Arabic or Croatian, English or Lithuanian. The Distributed Review Module provides an environment for assessment of language competence as well as for advanced publications.

D. Localized Learning Materials include grammars, lexica, and translations in a national language. Localized Learning Materials need to be able to be shared across, and customized for, many different languages. Within Europe alone, Greek and Latin, for example, are taught in more than thirty different national languages.[2] We need not only to maintain learning materials in dozens of languages but also to provide learning materials in languages where Greek and Latin are not part of formal academic curricula. To accomplish this we must represent as much information about the language in machine actionable form that can be efficiently represented in many languages. We also need to provide an architecture that supports customization for particular languages, especially the creation of aligned translations that contain from the start links between the source text and the modern language translation.

E. Dynamic Syllabi can be analyzed to track the linguistic phenomena that students have encountered (e.g., vocabulary, grammar) and the content that they have covered. As students pursue different dynamic syllabi at different times, they can track their overall background and necessary background information needed to pursue subsequent courses. Instructors in structured classes can generate personalized background readings and examinations that reflect both what students brought with them to, and what they covered in, the class. The e-Portfolio Module uses Dynamic Syllabi to accomplish these goals.

F. Personalized E-learning Tools analyze individual behaviors of particular learners and provide personalized analyses and suggestions, reflecting the strengths and learning styles of particular students. Personalized e-learning tools allow learners not only to track their progress towards target proficiencies and but also to personalize the target proficiencies as well: students of Homer will have different targets than those of the New Testament or of Plato, while students aspiring to fluent comprehension will have different needs than intellectual historians who wish to explore word usage or linguists interested in syntactic phenomena. The goal is to provide as much feedback as possible, as quickly as possible, and as closely adapted to the needs and interests of each learner as possible.

3. The Scaife Digital Library (SDL)

The Scaife Digital Library (SDL) commemorates Ross Scaife (March 31, 1960 – March 15, 2008) who did pioneering work for the study of Greco-Roman culture in a digital age, who was committed to collaborative scholarship and who was a champion of open data. The SDL is designed as a service, as an experiment, and as a space for research. An increasing amount of Ancient Greek, Latin, Classical Arabic and other sources are available under appropriate open licenses. The SDL builds upon the services and collections listed above. The SDL provides a mechanism by which to compare these services and collections with those available elsewhere, allowing research at the Humboldt Chair to explore new methods while making its own work more visible.

As a service, the SDL will aggregate as much content in these languages as possible, converting, where necessary and feasible, into interoperable formats. The goal of this service is to provide a single space to represent all published Ancient Greek, Latin, and Classical Arabic. In this context, publication entails release under an open license. Proprietary collections are neither public nor published.

As an experiment, the SDL will track how many sources in how many historical languages and of how many types it can identify and integrate and then track this data over time. This experiment attempts to measure both our ability to find materials and the change in what is available. It is our hope that the growth in available resources in languages beyond Greek, Latin, and Arabic will greatly outstrip the ability of the SDL to aggregate and analyze them.

As a research space, the SDL collects not only metadata but also source materials into a single environment. Within this space researchers can explore customized collections (e.g., all available versions and translations of the Odes of Horace or the aligned corpus of Classical Arabic translations and Greek sources) or simply analyze all available Greek. While the SDL may collect as widely as possible from open collections representing cultures from around the world, the SDL intends to provide the most comprehensive possible coverage and services for students of Ancient Greek, Latin, and Classical Arabic.

[1] Augustus Boeck, “Oratio nataliciis Friderici Guilelmi III.” (1822): “Itaque ubi, quae et qualis philologia meo iudicio sit, quaeritis, simplicissima ratione respondeo, si non latiore, quae in ipso vocabulo inest, potestate accipitur, sed ut solet ad antiquas litteras refertur, universae antiquitatis cognitionem historicam et philosophicam.”

[2] http://www.eduhi.at/gegenstand/EuroClassica/?TITEL=EuroClassica+in+Europe&modul=europamap

Posted in Announcement, General, Historical languages | Tagged | 7 Comments

Possible Jobs in Digital Humanities at Leipzig [Please forward]

The Humboldt Chair of Digital Humanities and Department of Computer Science at the University of Leipzig is looking for candidates for two possible collaborating research groups, one focused on reinventing scholarly communication for Greek and Latin, as a case study for historical languages in general, with the other helping the University Library develop methods to manage and visualize billion of words and associated annotations of many kinds. Details of the funding are being finalized but positions will ideally start in May 2013 and with an initial one year contract that could be extended to a second year that could include one semester residence at a US university.

Candidates must have received their most recent degree after January 4, 2011. Current degree candidates may also be considered. We are building a team includes varied backgrounds, with team members having expertise in Greek and Latin, in software analysis and development, and in working with metadata models that are relatively well established (TEI XML, Functional Requirements for Bibliographic Records, CIDOC CRM) and that are just beginning to be exploited (e.g., the full potential of the Europeana Data Model). Project members should be prepared to participate in all forms of intellectual life, including research, both within the humanities and the information sciences, and software development, supervising student researchers, delivering presentations before specialist and general audiences, writing, and participation in teaching activities.

Interested candidates should send a letter of interest, briefly describing how they could contribute to one of these teams, a CV, and the names of three references to dig-hum-jobs@e-humanities.net.

The work will have several complementary tracks:

  1. Open Greek and Latin: Classicists need comprehensive, open collections of Greek and Latin that anyone can download, modify, and then republish. The long term goal of the Open Greek and Latin Project is to represent the full surviving corpus of Greek and Latin sources, including transcriptions from every print source, this will include not only print books but also manuscripts, inscriptions, ostraca, papyri, vases, etc. and will cover the full range of Greek and Latin sources, from the Homeric Epics through post-classical Greek and Latin to the present. In the short run, we focus on providing comprehensive coverage for the c. 100 million words of Greek and Latin that survive through c. 600 CE and opportunistic coverage for the billions of words of surviving post-classical Greek and Latin. The Open Greek and Latin Project integrates the growing body of Greek and Latin available under a Creative Commons license while drawing upon vast collections of scanned editions and new Canadian-Italian research on generating and correcting Greek and Latin. Coverage will include TEI XML transcriptions of editions that are in the public domain and machine actionable RDF equivalents for traditional indices of places where a new edition differs from its most significant predecessor. The whole collection — including textual transcriptions, structural metadata, as well as linguistic and other machine actionable data — will be available in an RDF format developed to interoperate as closely as possible with the Europeana Data Model.
  2. Decentralized editing and annotation: The Open Greek and Latin corpora represent a foundation and starting point for further work. We need methods by which to support annotations of every kind, including not only corrections of OCR errors but also new translations (which are a kind of annotation), data driven studies of textual transmission, textual reuse and the general circulation of ideas across time, space, language and culture, prosopography, and morphological, syntactic, semantic and lexical analyses. We need to support a growing range of machine actionable annotations, each of which represents a nano-publication that may be accompanied by expository prose argumentation and/or additional machine actionable annotations. As students of Greek and Latin begin to confront the opportunities and challenges presented by global access to collections measured in billions — rather than millions — of words, we need to be able to manage contributions from student researchers and citizen scholars as well as from faculty and library professionals.
  3. Transnational systems for Greek, Latin, and other historical languages: Greek and Latin are fundamentally transnational languages — no nation has a unique claim to the intellectual and linguistic heritage of these languages, which together provide a major cultural foundation for what is now the European community. More than 20 organizations representing communities speaking Croatian, Czech, Danish, Dutch, German, English, French, Italian, Lithuanian, Macedonian, Polish, Portuguese, Romanian, Spanish, and Swedish have representatives in the General Assembly of Euroclassica, a European federation of associations of teachers of classical languages and civilisation, while a European Curriculum Framework for Classical Languages (ECFRCL) is already under active development. Language learning based upon interacting with, and then contributing to, richly annotated corpora can provide rapid and plentiful feedback, allowing learners to engage immediately with primary sources while also enabling them to begin making substantive contributions to the field early and often. At the same time, many scholarly arguments include statements that can be represented in a machine actionable format that can be made available to many different language communities. In some cases (as in publications about prosopography or textual criticism) the conclusions of the argument can be represented as machine actionable annotations. Greek and Latin studies provide a particularly interesting space within which to develop methods by which speakers of many languages can collaborate in learning and research because these classical languages are not (unlike English or French) associated with modern hegemonic nations.
Posted in Job(s) | 2 Comments

Querying the Perseus Ancient Greek and Latin Treebank Data in ANNIS

We are pleased to announce the availability of the Perseus ANNIS Environment for searching the syntactical data in the Perseus Ancient Greek and Latin Treebanks.

ANNIS is an open source, versatile web browser-based search and visualization architecture for complex multilevel linguistic corpora with diverse types of annotation.

The Perseus Ancient Greek Dependency Treebank includes the entirety of Homer’s Iliad and Odyssey; Sophocles’ Ajax, Antigone, Electra, Oedipus Tyrannus and Trachinae; Plato’s Euthyphro; and all of the works of Hesiod and Aeschylus – a total of 354,529 words. The Perseus Latin Dependency Treebank includes 53,143 words from eight authors, including Caesar, Cicero, Jerome, Ovid, Petronius, Propertius, Sallust and Vergil.

Posted in Treebanks | Tagged , , | 2 Comments

Marie-Claire Beaulieu Named Associate Editor of the Perseus Digital Library

We are pleased to announce that Marie-Claire Beaulieu, Assistant Professor of Classics, has been named Associate Editor of the Perseus Digital Library.

Hired in 2010, Professor Beaulieu has research interests in Greek religion and myth, epigraphy, and medieval Latin. Upon her arrival at Tufts, Professor Beaulieu immediately used digital technology to give undergraduates and MA students opportunities to make contributions and to conduct research, thus advancing intellectual culture where learning and research are intermeshed and reinforce each other. The thirteen undergraduate and graduate students—plus one community auditor—enrolled in her “Medieval Latin” class in Winter 2011 deciphered, translated, and annotated rare Latin manuscripts and early printed book leaves from “The Tisch Library Miscellany.” Their work was published online on the Tisch Library Miscellany page and has made a significant contribution to scholarship in the field by providing a valuable resource for professors, librarians, and scholars at Tufts and around the world, who can now access translations and annotations for their teaching and research.

Her work with the Tisch Library led her to receive both an internal Tufts award and an NEH Startup Grant to generalize the infrastructure needed to support—and manage over time—the increasingly varied and complex contributions of students at Tufts and elsewhere. These contributions include editorial work, translations, commentaries, and annotations. In addition, Prof. Beaulieu is working on ways to enhance the teaching of Greek mythology and religion with the use of digital technology.

Perseus staff and collaborators enthusiastically welcome her aboard!

Posted in Announcement, General | Comments Off on Marie-Claire Beaulieu Named Associate Editor of the Perseus Digital Library

Suggestions for new Greek, Latin texts? English translations?

[Please repost!]

We are preparing for a new set of texts to be entered by the data entry firm with which we work (http://www.digitaldividedata.org/). The next order will be sent in mid December but a more substantial order will be placed early in 2013.

What would you like to see added to the Perseus Digital Library, both for use within the Perseus site and for download as TEI XML under a Creative Commons license? Note that we only enter materials that are in the public domain and that can be freely redistributed for re-use by others.

Some possibilities — but please suggest other things that you find important!

* Scholia of Greek and Latin authors.

* Collections of fragmentary authors

* Sources from later antiquity (esp. Christian sources)

* More English translations

Please think about (1) individual authors and texts and (2) what you would want to see if we could do something big.

If you have individual suggestions, please write gcrane2008@gmail.com. A public discussion via the Digital Classicist would probably be the best.

Let us know what you want!

Posted in Suggestions | Comments Off on Suggestions for new Greek, Latin texts? English translations?

Annotation Service Beta

Announcing the beta availability of the Tufts Syntactic Annotation service. This service provides RESTful and Service Layer APIs for requesting (a) syntactic annotations from supported annotation repositories and (b) templates for creation of syntactic annotations of passages and texts. The beta instance at http://sosol.perseus.tufts.edu/bsp/annotationservice is currently configured to provide access to the Alpheios repository of the Perseus Ancient Greek and Latin treebank annotations. These annotations can be identified and retrieved by CTS urn. The service can also retrieve text for which to create annotation templates from any CTS-API compliant repository, and can be configured to access other annotation and text repositories. The Syntactic Annotation service leverages the Salt-n-Pepper Framework  for converting between annotation formats. The current release supports the Perseus import format and Perseus and PAULA export formats.

Currently available instances of the Bamboo Services Platform are unsecured, are operated with no explicit SLA, and should be considered stateless: that is, data may be wiped from persistent stores at any time.

Funding for the development of this service was provided by Tufts University, Project Bamboo and the Andrew W. Mellon Foundation.

Posted in Release, Technology | Tagged | Comments Off on Annotation Service Beta

Morphology Service Beta

Announcing the beta availability of service-based access to the morphological engines used by Perseus for Latin and Greek (Morpheus) and Arabic (Buckwalter). This service leverages a standard Morphology Service API and is made available on an instance of the Bamboo Services Platform at http://services.perseids.org/bsp/morphologyservice/analysis/word

Examples:

http://services.perseids.org/bsp/morphologyservice/analysis/word?lang=lat&engine=morpheuslat&word=mare

http://services.perseids.org/bsp/morphologyservice/analysis/word?lang=grc&engine=morpheusgrc&word=ἱστορίης

The Morphological Analysis Service responds to requests for morphological analysis of texts, submits them to the appropriate morphology engine for processing and returns the results in XML adhering to a standard morphology schema.  The Service supports retrieval of texts for analysis from remote repositories as well as user-supplied chunks of text.  URL based and CTS repositories are supported.  Where retrieval from a CTS enabled repository is requested CTS URNs are supported as document identifiers. Where retrieval from a URL based repository is requested, URIs are supported as document identifiers.

Currently available instances of the Bamboo Services Platform are insecure, are operated with no explicit SLA, and should be considered stateless: that is, data may be wiped from persistent stores at any time.

A secure instance of the BSP for which data will be preserved on future upgrades is anticipated in Fall 2012.

Funding for the development of this service was provided by Tufts University, Project Bamboo and the Andrew W. Mellon Foundation.

note: updated links LMC

Posted in Release, Technology | Comments Off on Morphology Service Beta

Help the SDL!

Advance our understanding of the Greco-Roman World!
Contribute to the Scaife Digital Library — improve existing materials and to create new ones!

If you want to understand the present and invent the future then FREE THE PAST!

Everyone can make a difference! Read the sources more closely yourself, learn something for your own enjoyment and at the same time enhance what is available to everyone else!

Translate that text! Only a fraction of billions of words surviving Greek and Latin — no more than 5 or 10% — has been translated into modern languages such as English. The idea of Europe was invented in Greek and then in Latin and developed over thousands of years — but currently less than ⅓ of 1 percent of all US college graduates enroll in a class of Greek or Latin. Help the other 99.6% understand much of what made Europe and the Americas so that they can help decide where they want to go! MARIE-CLAIRE, Suda Online, HMT

What does that word really mean? We have been studying Greek and Latin but how well do we really understand either the languages or the cultures that they represent? Open content lexica and grammars for Classical Greek, Latin, and Arabic are available, along with commentaries with thousands of notes on Classical Greek and Latin texts. And wholly new instruments such as parallel corpora that align source texts with modern translations allow us to detect patterns of meaning in ways that were never before possible. You can help and learn more about Greek, Latin and thousands of years of cultural history as you do it.

Who/what/where is that? Wikipedia provides broad coverage for many ancient topics but the reference works available in the SDL contain direct citations into the primary sources that general resources such as Wikpedia often leave out. The Smith Biographical and Geographical Dictionaries contain fundamental information about more than 30,000 people and places. All of these reference works contain written citations to the primary sources upon which they are based. Many of these citations have been converted to machine actionable links but the automated programs are never perfect. Check the citations that are there — how do they relate to the text as it stands? What citations were missed — you might be surprised at the range of sources that we do have. And are there important sources that should be cited?

Fix that text! Help us add more text — not only can you correct errors in OCR-generated text in front of you but you can create training data that can improve the performance of OCR on millions of words in a similar font or genre! And correcting OCR output allows you to learn how to type Classical Greek accurately and quickly!

Decode that manuscript! OCR software can do a lot with well printed books but it can’t do much with manuscripts or inscriptions, papyri or even early printed books. Adopt a text, read it carefully, share what you find and make that document visible in a digital word centuries or thousands of years after chisel met stone or stylus met parchment! Start with something simple — and then see if you can become a palaeographer and decode the short hand of writers who lived in world long vanished.

More carefully structured text! Digital texts don’t just use italics and bold — they can precisely describe their contents, making them easier for readers to understand and supporting more sophisticated forms of analysis. Join the Athenaeus Project to see what ancient audiences had read and to trace a network that connects tens of thousands of passages from almost a thousand years of Greek. Become a scholarly CSI force and help reconstruct sources that survive only because they are quoted, paraphrased or mentioned!

Where is that place? Is that Alexandria the one in Virginia, the city in Egypt — or one of many other cities that Alexander planted around the Middle East? Help us generate accurate maps of the places in our sources to help others understand what they are reading and to support new ways of understanding how ancient authors conceptualized their world!

Who is that person? Which Caesar is that, anyway? Is that the famous Cleopatra or one of her many namesakes? And help us distinguish the several Alexanders of Macedon from their descendant, the famous conqueror!
What does that word mean? Help reinvent our understanding of Greek and Latin! You can tell us which word sense in one many many dictionaries a particular passage intends us to understand — and you can learn a lot about Greek and Latin! Or you can help line up Greek and Latin words with their corresponding words in modern translations — and you can help not only read more closely but build parallel texts, one of the most important tools in the modern arsenal to help research and to support the next reader!

What is going on in that sentence? Students of Greek and Latin since Cicero and Erasmus have quailed as their teachers asked them “what is that form? and on what does it depend?” You can record your answers for generations of readers to come – and you can contribute to Greek and Latin Treebanks — our closest equivalent to the Genomic databases of Biology!

Posted in Contribution, Scaife Viewer | Comments Off on Help the SDL!

Digital Humanities in the Classroom – Technical Approach to Platform Integration

Bridget Almas, The Perseus Project, Tufts University
Professor Marie-Claire Beaulieu, Tufts University
July 25, 2012

SoSOL and CITE are two separate frameworks, developed independently, for working with digital representations of ancient sources. They each approach the problem set from different directions, resulting in little overlap between what the two offer, and a great deal of potential for integration.

The SoSOL platform was designed to provide support for the collaborative editing of the different types of XML data being integrated from multiple sources under the Papyri.info platform. Supported data types include transcriptions, translations, metadata, commentary and bibliographies, each adhering to the TEI/EpiDoc schema, but with different conventions and restrictions applied. Publications made up of one or more of these data types are guided through an editing lifecyle by a workflow engine built on top of a git repository. Support for a simple role-based user model is provided, leveraging the OpenID specification by delegating authentication to Social Identity Providers. Editors can search a catalog of pre-established publication identifiers to select items to edit, or can create their publication. Each user works on the publications in their own clone of the underlying git source repository until they are ready to submit a revised publication for approval, at which point their submissions are passed to an editorial board for review, and can either be returned to the editor for further work and corrections, or finalized and updated in the master branch of the repository.

The CITE (Collections, Indexes, and Texts, with Extensions) architecture provides a framework both for digitizing textual sources and for creating mappings between those sources and their digital facsimiles on the level of the citation. It consists of technology-independent but machine-actionable URN schemas for canonical citation, APIs for network services that identify and retrieve objects identified by canonical URN , and implementations of those APIs on a variety of platforms . This architecture was developed by the Center for Hellenic Studies (CHS) in part to enable the work of the Homer Multitext Project (HMT). In developing the architecture, the CHS team intended to to support a wide range of ancient source material in addition to manuscripts, and with the CTS (Canonical Text Services) URN syntax we are able to express in a single identifier both the position of the work in a FRBR-like hierarchy, and the position of a node or continuous range of nodes within a work. The CITE URN syntax applies the same theory to non-document objects, and supports a citation scheme for images, enabling, in a single identifier, identification of both the image itself and specific coordinates on that image.

We have several separate but related needs driving our work on integrating these two platforms at Perseus. Most of our work focuses on the first two of these with a view to supporting the third and fourth goals in subsequent work.

  1. To support collaborative work by students, along the model of the HMT project, thus allowing students to conduct substantive linguistic research with a tangible outcome, the publication of a digital edition of their work.
  2. To work not only with inscriptions and papyri but with more general textual sources, such as the Greek, Latin, and Arabic collections in the Perseus Digital Library, for which subsets of the TEI Guidelines such as the TEI-Analytics subset (being developed by the Abbott Project) are more suitable.
  3. To support work on a growing range of historical sources in multiple formats and languages. These include more than 1,200 medieval manuscripts for which the Walter Art Gallery (250 MSS) and the Swiss e-codices project (900 MSS) have published high resolution scans under a Creative Commons license.
  4. To support a large and international community of digital editors, including students, advanced researchers and citizen scholars. The spring 2012 user base for the Perseus Digital Library exceeded 300,000 users, with c. 10% (30,000), working directly with Greek and Latin sources. The 90-9-1 rule predicts that 9% of an online community will contribute occasionally and 1% will make the majority of new contributions. This would imply active communities of 30,000 for Perseus as a whole and 3,000 for the Greek and Latin collections.

Professor Beaulieu’s project to engage students in work on ancient funerary inscriptions provides an excellent opportunity to explore this work. The job of mapping her collection of images to transcriptions in order to produce digital editions leveraging those mapping parallels in many ways the work of the HMT project and is a good fit for the CITE services and APIs. In addition, the TEI-based Epidoc XML standard to be used for digitizing the inscriptions is already well-supported by the SoSOL platform. We are able to reuse large parts of the XML validation and display code from the papyri publication support on SoSOL while focusing on the addition of support for the CTS identifiers. This incremental approach allows us to lay the groundwork for the eventual support of the full collection of Perseus texts integration while at the same time producing something more immediately applicable and available for use by a smaller, controlled community of students who can effectively serve as Beta testers for the platform.

In keeping with agile development methodologies, we are taking an iterative approach to the integration. We started with the following code bases:

  1. a forked clone of the git repository of the SoSOL platform’s JRuby code base
  2. the Groovy/Java/Google App Engine reference implementation of CTS and CITE APIs from the HMT Project

The first deliverable was to create a prototype implementation that re-used the existing SoSOL code for Epidoc transcriptions almost in its entirely by sub-classing it and changing only the structure of the document identifiers to correspond more closely to the CTS URN syntax. We also substituted a CTS text inventory for the Papyri.info catalog. Coding the prototype gave us a means to explore the design of the SoSOL platform’s code and assess its viability for reuse. The concrete deliverable of a working user-interface gave Professors Beaulieu and Crane a means to explore the viability from the perspective of the user (both student and reviewer).

The next step was to analyze whether we could also extend this work to support the larger Perseus corpus, which will be using the TEI-Analytics XML schema instead of Epidoc, and for which we will need to support collaborative editing not only at the level of the entire text but also at the level of a citation or passage. The latter leverages the CTS API heavily. However, as CTS is a read-only API, we needed to develop a set of parallel write/update/delete functionality that could be used to update and create new editions of CTS-compatible texts. To experiment with this, we augmented the XQuery based implementation of the CTS APIs from the Alpheios project, which was written by the developer working on this project. We also coded prototypes of additional extensions to the SoSOL code to work with texts and passages that use the TEI-A XML schema rather than Epidoc, and to present a passage selection interface.

Completing these two deliverables gave us confidence that the integration was in fact viable, and funding as an NEH startup project enables us to move the work beyond the prototype stage to actual implementation.

Through the work on the prototype, we were able to identify some key interoperability challenges for the two platforms.

For SoSOL this has centered around identifying and isolating the Papyri-specific assumptions of the platform. These have primarily been in the following areas:

  • identifier scheme
  • cataloging system
  • stylesheets for display
  • differing concepts of what makes up a “Publication”

For CTS the primary integration challenge so far has been in augmenting it with a compatible Create/Update/Delete system.

The challenges also include the need to identify or define a canonical citation scheme for the inscriptions, although this is not specifically a platform integration issue but instead a more general one related to the creation of digital editions.

The first deliverable of the implementation stage of the project was to integrate the prototype code with the master branch of the SoSOL repository that had continued to evolve during our protoyping efforts, and with which our forked clone was now out of sync. Through this process, we were able to both take advantage of various enhancements made to the SoSOL code in the interim and reduce the amount of changes necessary to the main code base to support the new data and identifier types. This process also required some significant rewriting of the prototype code, but this was not surprising as the creation of production quality code was not the main objective of the prototype. We are now working on a branch of the master SoSOL repository, rather than a fork, and expect to be able to integrate the branched code back into the master branch fairly soon.

Once the above process was completed, the next deliverable was to deploy the SoSOL and CTS services on a Perseus server with a functioning interface that Professor Beaulieu and her assistants could use to select an inscription upon which to work and then enter the XML for the transcription, translation and commentary of that inscription. This deliverable has been fulfilled and they have been able to complete creation of a digital transcription and translation of the Nedymos epigram through the SoSOL interface.

Although initially we had also planned to include integration with the ImageJ tool in this iteration, the development in the meantime by the HMT of a superior web-based Image Citation tool for working with the images, along with the expanding adoption of the Open Annotation Core (OAC) Data Model specification for annotations, has led us to change course on that part of the design. We have begun the work of integrating the Image Citation tool into the SoSOL interface, and it can now be used from within this interface to select a region of interest on an image and create a CITE URN for that selection when editing or viewing the transcription. We are currently using a shared Google Drive spreadsheet to record these urns, and the corresponding CTS urns for the mapped text, in an index. The next step will be for the SoSOL tool to automatically record and store these mappings as annotations on the text in the form of OAC RDF triples.

Deploying and using the SoSOL interface for this inscription has enabled us to better understand the actual workflow we will need to support for the work on the inscriptions, and uncovered some differences between this workflow and the one currently supported by the SoSOL platform for the Papyrological work. Among other things, we have identified the need to make some decisions about how we want to handle the commentary and bibliography for the inscriptions, and we have also recognized the need for some design changes to the interface introduced by the CTS approach of keeping the translations in separate documents from the source editions. These changes will be included in the next iteration, during which we will also begin to work on adding support for storing image to text mappings as OAC annotations and continue to move forward with the support for TEI-Analytics and citation-based editing that will be required for the larger Perseus corpus.

Having used these tools to produce the XML and image mapping data for the Nedymos inscription, we are now also able to begin scoping the requirements for the eventual display of the digital edition. We have used the Groovy based reference implementation of a facsimile browser from the HMT project and the Alpheios browser plugins to experiment with the options and to produce screenshots through which we are able to review and discuss the requirements in a concrete way. In the next iteration we will decide upon an implementation approach for the display code and for supporting automatic integration of the display and editing environments.

Posted in Digital Humanities | 1 Comment