Bridget Almas, The Perseus Project, Tufts University
Professor Marie-Claire Beaulieu, Tufts University
July 25, 2012
SoSOL and CITE are two separate frameworks, developed independently, for working with digital representations of ancient sources. They each approach the problem set from different directions, resulting in little overlap between what the two offer, and a great deal of potential for integration.
The SoSOL platform was designed to provide support for the collaborative editing of the different types of XML data being integrated from multiple sources under the Papyri.info platform. Supported data types include transcriptions, translations, metadata, commentary and bibliographies, each adhering to the TEI/EpiDoc schema, but with different conventions and restrictions applied. Publications made up of one or more of these data types are guided through an editing lifecyle by a workflow engine built on top of a git repository. Support for a simple role-based user model is provided, leveraging the OpenID specification by delegating authentication to Social Identity Providers. Editors can search a catalog of pre-established publication identifiers to select items to edit, or can create their publication. Each user works on the publications in their own clone of the underlying git source repository until they are ready to submit a revised publication for approval, at which point their submissions are passed to an editorial board for review, and can either be returned to the editor for further work and corrections, or finalized and updated in the master branch of the repository.
The CITE (Collections, Indexes, and Texts, with Extensions) architecture provides a framework both for digitizing textual sources and for creating mappings between those sources and their digital facsimiles on the level of the citation. It consists of technology-independent but machine-actionable URN schemas for canonical citation, APIs for network services that identify and retrieve objects identified by canonical URN , and implementations of those APIs on a variety of platforms . This architecture was developed by the Center for Hellenic Studies (CHS) in part to enable the work of the Homer Multitext Project (HMT). In developing the architecture, the CHS team intended to to support a wide range of ancient source material in addition to manuscripts, and with the CTS (Canonical Text Services) URN syntax we are able to express in a single identifier both the position of the work in a FRBR-like hierarchy, and the position of a node or continuous range of nodes within a work. The CITE URN syntax applies the same theory to non-document objects, and supports a citation scheme for images, enabling, in a single identifier, identification of both the image itself and specific coordinates on that image.
We have several separate but related needs driving our work on integrating these two platforms at Perseus. Most of our work focuses on the first two of these with a view to supporting the third and fourth goals in subsequent work.
- To support collaborative work by students, along the model of the HMT project, thus allowing students to conduct substantive linguistic research with a tangible outcome, the publication of a digital edition of their work.
- To work not only with inscriptions and papyri but with more general textual sources, such as the Greek, Latin, and Arabic collections in the Perseus Digital Library, for which subsets of the TEI Guidelines such as the TEI-Analytics subset (being developed by the Abbott Project) are more suitable.
- To support work on a growing range of historical sources in multiple formats and languages. These include more than 1,200 medieval manuscripts for which the Walter Art Gallery (250 MSS) and the Swiss e-codices project (900 MSS) have published high resolution scans under a Creative Commons license.
- To support a large and international community of digital editors, including students, advanced researchers and citizen scholars. The spring 2012 user base for the Perseus Digital Library exceeded 300,000 users, with c. 10% (30,000), working directly with Greek and Latin sources. The 90-9-1 rule predicts that 9% of an online community will contribute occasionally and 1% will make the majority of new contributions. This would imply active communities of 30,000 for Perseus as a whole and 3,000 for the Greek and Latin collections.
Professor Beaulieu’s project to engage students in work on ancient funerary inscriptions provides an excellent opportunity to explore this work. The job of mapping her collection of images to transcriptions in order to produce digital editions leveraging those mapping parallels in many ways the work of the HMT project and is a good fit for the CITE services and APIs. In addition, the TEI-based Epidoc XML standard to be used for digitizing the inscriptions is already well-supported by the SoSOL platform. We are able to reuse large parts of the XML validation and display code from the papyri publication support on SoSOL while focusing on the addition of support for the CTS identifiers. This incremental approach allows us to lay the groundwork for the eventual support of the full collection of Perseus texts integration while at the same time producing something more immediately applicable and available for use by a smaller, controlled community of students who can effectively serve as Beta testers for the platform.
In keeping with agile development methodologies, we are taking an iterative approach to the integration. We started with the following code bases:
- a forked clone of the git repository of the SoSOL platform’s JRuby code base
- the Groovy/Java/Google App Engine reference implementation of CTS and CITE APIs from the HMT Project
The first deliverable was to create a prototype implementation that re-used the existing SoSOL code for Epidoc transcriptions almost in its entirely by sub-classing it and changing only the structure of the document identifiers to correspond more closely to the CTS URN syntax. We also substituted a CTS text inventory for the Papyri.info catalog. Coding the prototype gave us a means to explore the design of the SoSOL platform’s code and assess its viability for reuse. The concrete deliverable of a working user-interface gave Professors Beaulieu and Crane a means to explore the viability from the perspective of the user (both student and reviewer).
The next step was to analyze whether we could also extend this work to support the larger Perseus corpus, which will be using the TEI-Analytics XML schema instead of Epidoc, and for which we will need to support collaborative editing not only at the level of the entire text but also at the level of a citation or passage. The latter leverages the CTS API heavily. However, as CTS is a read-only API, we needed to develop a set of parallel write/update/delete functionality that could be used to update and create new editions of CTS-compatible texts. To experiment with this, we augmented the XQuery based implementation of the CTS APIs from the Alpheios project, which was written by the developer working on this project. We also coded prototypes of additional extensions to the SoSOL code to work with texts and passages that use the TEI-A XML schema rather than Epidoc, and to present a passage selection interface.
Completing these two deliverables gave us confidence that the integration was in fact viable, and funding as an NEH startup project enables us to move the work beyond the prototype stage to actual implementation.
Through the work on the prototype, we were able to identify some key interoperability challenges for the two platforms.
For SoSOL this has centered around identifying and isolating the Papyri-specific assumptions of the platform. These have primarily been in the following areas:
- identifier scheme
- cataloging system
- stylesheets for display
- differing concepts of what makes up a “Publication”
For CTS the primary integration challenge so far has been in augmenting it with a compatible Create/Update/Delete system.
The challenges also include the need to identify or define a canonical citation scheme for the inscriptions, although this is not specifically a platform integration issue but instead a more general one related to the creation of digital editions.
The first deliverable of the implementation stage of the project was to integrate the prototype code with the master branch of the SoSOL repository that had continued to evolve during our protoyping efforts, and with which our forked clone was now out of sync. Through this process, we were able to both take advantage of various enhancements made to the SoSOL code in the interim and reduce the amount of changes necessary to the main code base to support the new data and identifier types. This process also required some significant rewriting of the prototype code, but this was not surprising as the creation of production quality code was not the main objective of the prototype. We are now working on a branch of the master SoSOL repository, rather than a fork, and expect to be able to integrate the branched code back into the master branch fairly soon.
Once the above process was completed, the next deliverable was to deploy the SoSOL and CTS services on a Perseus server with a functioning interface that Professor Beaulieu and her assistants could use to select an inscription upon which to work and then enter the XML for the transcription, translation and commentary of that inscription. This deliverable has been fulfilled and they have been able to complete creation of a digital transcription and translation of the Nedymos epigram through the SoSOL interface.
Although initially we had also planned to include integration with the ImageJ tool in this iteration, the development in the meantime by the HMT of a superior web-based Image Citation tool for working with the images, along with the expanding adoption of the Open Annotation Core (OAC) Data Model specification for annotations, has led us to change course on that part of the design. We have begun the work of integrating the Image Citation tool into the SoSOL interface, and it can now be used from within this interface to select a region of interest on an image and create a CITE URN for that selection when editing or viewing the transcription. We are currently using a shared Google Drive spreadsheet to record these urns, and the corresponding CTS urns for the mapped text, in an index. The next step will be for the SoSOL tool to automatically record and store these mappings as annotations on the text in the form of OAC RDF triples.
Deploying and using the SoSOL interface for this inscription has enabled us to better understand the actual workflow we will need to support for the work on the inscriptions, and uncovered some differences between this workflow and the one currently supported by the SoSOL platform for the Papyrological work. Among other things, we have identified the need to make some decisions about how we want to handle the commentary and bibliography for the inscriptions, and we have also recognized the need for some design changes to the interface introduced by the CTS approach of keeping the translations in separate documents from the source editions. These changes will be included in the next iteration, during which we will also begin to work on adding support for storing image to text mappings as OAC annotations and continue to move forward with the support for TEI-Analytics and citation-based editing that will be required for the larger Perseus corpus.
Having used these tools to produce the XML and image mapping data for the Nedymos inscription, we are now also able to begin scoping the requirements for the eventual display of the digital edition. We have used the Groovy based reference implementation of a facsimile browser from the HMT project and the Alpheios browser plugins to experiment with the options and to produce screenshots through which we are able to review and discuss the requirements in a concrete way. In the next iteration we will decide upon an implementation approach for the display code and for supporting automatic integration of the display and editing environments.