Announcing the Gardener Theme for Treebank Self-publication

Treebanking has distinct merit as a pedagogical tool. The entire process is useful for language learners of all levels, whether as an introduction to more complex sentence structure or as a practice exercise to hone skills. It can sometimes be challenging to convince educators to use treebanking tools, not because they cannot see the merit, but rather because they are concerned about being able to use the tools effectively. They may feel they lack the technical ability to manipulate the data, files, and annotation platform. We want educators to have confidence in their ability to use the tools and teach others to do the same. Treebanking with Perseids and Arethusa is fairly simple, and most people will learn how to use these tools by sitting down and using them.

With this in mind, I set myself to answering another question I continue to get each time I introduce new teachers to treebanking, “what do I do with all this data?” With Perseids we want to empower users to own their data. One avenue would be for teachers to connect with projects which aggregate treebanks, with the hopes they might turn each classroom into part of a larger crowdsourcing project. However, larger treebanking projects prefer tagsets based on international standards for dependency grammar, which are sometimes unintelligible to the average student of Greek or Latin and can be overwhelming. For this reason, we designed Arethusa to give users the ability to customize the tagsets to coincide with the grammars and map to textbooks they are already familiar with.

Along the same lines, we want to put in the end-users hands the ability to publish their work themselves, whether they’ve used a standard tagset for their annotations, or a customized version, and to then extend their publications with other resources as appropriate for their choices. The work we did with Professor Matthew Harrington’s Latin AP Pedagogical Treebanking project allowed us to explore how we might use GitHub to support this. We can take advantage of the services GitHub provides freely for publishing versioned repository resources as web sites, and for connecting with Zenodo to assign digital object identifiers (DOIs) to these resources. We developed a customization of a Jekyll theme with predefined templates for displaying treebank data in a GitHub pages site. Perseids users can download their treebank files directly from Perseids and import them into a GitHub repository using this theme. This allows users with minimal technical expertise to easily get a site up to publish and display their work.

The way this works is simple. Users download their treebank data, add them to their GitHub repository and then create html files for each tree that contain just a small .yaml header (which we call ‘tbpages’). Jekyll uses this .yaml header to populate a table of all the trees in the collection on the main page, as well as a display page for each individual tree. The theme uses a widgetized version of the javascript-based Arethusa application to display interactive representations of the trees. The Arethusa widget runs in the page served by the GitHub pages site and and retrieves the tree data directly from the underlying GitHub repository. The only dependency on Perseids is for retrieval of the tagset data files, although these can be configured to reside locally as well.

The goal was to create something that people could use with a minimal understanding of the underlying technology. Users with a little more technical expertise can explore the options Jekyll offers for additional customizations. Setting the whole thing up as a theme allows for us to roll out improvements to the widget and distribute them to our users in an easy way. Bringing together the widget and the Jekyll theme, we created the Gardener Theme, which allows users to plant gardens for their collections of trees.

One of the initial goals of the project was to document the pedagogical techniques that we have seen work with the Perseids Project in classrooms around the world. Additionally, the project was intended to help students experience having their work published. Gardener was designed not just to be easy to use, but to be flexible enough to facilitate publishing both trees and associated student work. Jekyll only needs to read the .yaml header to generate the page, the rest of the tbpage file can contain whatever additional content the user wishes. In talking with educators, we found that response essays or analytical write ups are commonly used as a way to assess the skills learned via the creating the treebank. If a classroom used this theme to publish their work, each published page could also contain essays which would help explain the shape of the trees. This creates a fantastic place for students to display work and create fixed proof of their skills in Greek and Latin.

We are pleased to announce that the first open version of the Gardener Theme is now up and running. Go and check out our demo blog here or fork the theme and get started here. And don’t forget to send us feedback!

Using Plokamos and Social Networks in the Classical Mythology Classroom

How can undergraduates contribute to research in a large lecture-hall mythology class? More importantly, how can such a class get beyond the rote memorization of stories and genealogies to engage with the primary documents and understand mythology in its own context?

The Perseids team has been experimenting with annotation to tackle these questions, because annotation is well known to produce deep engagement with a text in the form of close reading while promoting collaboration and conversation among students. However, one big pedagogical challenge is to design a workflow that is simple and lightweight so as not to get in the way of learning. On the technical side, the challenge is to produce good data that can then be preserved and aggregated easily.

Our first effort had students annotating Smith’s Dictionary of Greek and Roman Biography and Mythology with the Hypothes.is web annotation tools. The assignment was to collate the relationships among the figures in an entry of the Dictionary by annotating them using Hypothes.is. For instance, in the entry for Achilles, Thetis would be tagged as “MotherOf” and Peleus “FatherOf”. These tags used the SNAP ontology as a controlled vocabulary. The annotations were then harvested via the Hypothes.is API and serialized according to the OA model. In further passes, students documented attestations of relationships, i.e. which ancient text says that this relationship existed. They did so by inserting a Perseus URI in the annotation pointing to the specific passage attesting the relationship. Students also documented places associated with mythological figures using Pleiades URIs. Finally, students associated each mythological figure with the words that ancient texts used to describe them. These characterizations were produced following the “Word Study” exercise in the “Breaking the Language Barrier” series by Anna Krohn and Gregory Crane. Students looked up the Greek and Latin words used to describe a mythological figure and associated it with an English equivalent in the annotations using Perseus citation URIs.

At the end of this multi-part assignment, students had thoroughly researched their mythological figure. They learned who the figure was associated with, not just in strict genealogical terms, but also other associations such as EnemyOf, Companion, etc. They also gained an understanding of the geographical associations of the figure, since Greek mythology is heavily based on local legends. Finally, the students got a sense for the literary treatment of the figures by looking at the original texts.

However, after using this workflow with two different groups of students, we found that while the assignment was valuable, the limitations of the tools affected the data gathered. For instance, the lack of a visualization in real time led to issues with the directionality of the relationships, so a mother could be labeled as the son of her child. Also, our instructions to the students had become very complex as we expanded the assignment with characterizations and attestations.

In order to continue and improve this work, our team began development of the Plokamos application. Plokamos is Greek for “something woven” and it allows students to build a network graph as they annotate. The application also allows users to see their annotations as a table, and the data will soon be downloadable as a CSV and as RDF.

Plokamos has an intuitive and minimalist interface which cuts down on the time needed for annotation and the possibilities for user error. As a result, our instructions to the students became much shorter and simpler. Plokamos also has an attractive interactive visualization which helps to see the characterizations in the context of the network and make sense of the two together.

For instance, students working on Odysseus and Amymone noticed that both these figures, who appear on each side of a Classical pelike in the Boston Museum of Fine Arts, are connected to Poseidon and his offspring of aquatic monsters (fig. 1). These monsters are further connected to Odysseus because they are all eventually pitted against him and defeated. The characterizations strengthen these connections, as Odysseus is depicted with seafaring epithets, bravery, and sound thinking, while Poseidon is depicted with sea epithets and words indicating fertility and progeny. Finally, Amymone is associated with bodies of water such as springs and lakes, and with her descendants, the Danaids, who carry water eternally in Hades. In this way, Plokamos helped students to gain a better understanding of mythology at the conceptual level, and then apply this knowledge to a specific piece of ancient artwork.

Fig. 1 Social network of Odysseus and Amymone, by Christopher Duff and Patrick Margey

Announcing Plokamos, a Semantic Annotation Tool

Plokamos is a new text annotation framework developed by Frederik Baumgardt and the Perseids project. It is a browser-based tool that can be used to support scholars and students of literary disciplines in their work with text. Plokamos provides users with the ability to assign meaning to segments of text, to share their assertions with colleagues and classmates and to visualize the result of their work in aggregate. We have been using Plokamos as a plugin to our Nemo text browser in the classroom over the last 2-3 months and are looking forward to making it generally available to everyone for use on any source texts in early 2017.

Plokamos is really a continuation of our previous work in building a comprehensive toolset to enable our users to create and use semantically meaningful textual annotations. Our goal in this next step was to better integrate the individual components we used previously, to provide data validation assistance at annotation time, and to be in a better position to adapt our tools to new use cases. In the process we also wanted to make it easier for the users to enter data from a shared and controlled vocabulary. Furthermore, we aimed to add data versioning functionality to the infrastructure to follow students’ progress, to enable parallelism between text and annotations, and to provide this functionality as a tool for scholarly work. Finally, we planned for the application to be easily extensible to allow us to expand into more use cases over time as well as allow collaborators to tailor the annotations and the user interface to their own needs.

Figure 1: Plokamos tooltip embedded into a web article

In more technical terms, Plokamos is made up of an almost fully self-contained Javascript client application to be loaded inside a browser window, and a server-side linked-data named graph store with a SPARQL endpoint. In addition to annotation data, the quad store also serves configuration data that enables the client to validate, interpret and adequately visualize the annotations.

The Plokamos client consists of 3 layers which handle the annotations at different levels of abstraction and each layer provides its own mechanism to extend the application and use it for new kinds of sources, data types, forms of presentation or editing interfaces.

The annotator/applicator layer is the central piece of a Plokamos client application. It manages a local copy of the annotation data, adds interactive highlights to the source text and keeps a history of modified and newly created annotations. It has a core logic that is using SPARQL and the Open Annotation linked data framework to retrieve the available annotations and place them on their correct locations within the text. It can be extended to be able to handle different types of locations (“Selectors”) and different shapes of annotation payloads (“Annotation bodies”).

While the previous layer interprets annotations as just a network of entities and relations, and is agnostic to specific meaning (“Ontologies”) that is embedded in the network, the ontology layer is there to find and extract meaning from it. It can shape parts of the network into objects, translate URIs into easier to understand descriptions, and vice-versa. This is an essential step to negotiate between Plokamos’ general-purpose nature and its goal to provide user-friendly interactions. The ontology layer can be extended with new templates to extract objects from the graph and with additional dictionaries that provide translations between machine- and human-readable representations of the annotations.

The plugin layer takes the extracted objects and creates user interfaces for them which allow users to read and edit the data in different forms. Plugins can either let the ontology layer automatically select ontologies for the object conversion or specify them explicitly. The annotator/applicator layer provides placeholders for plugins to insert themselves into during Plokamos’ initialization, currently there are two such placeholders for annotations on phrases and whole documents, respectively. Inside the placeholders plugins can be designed freely using HTML and Javascript, including libraries such as Bootstrap and d3.js.

Figure 2: Visualization of corpus-level annotations filtered by family relationships

Over the course of the fall semester this architecture has proven itself to be useful and flexible for timely adaptations. We were able to develop new, unobtrusive and intuitive user interfaces for both the annotation reading and editing on single text passages as well as annotation visualization on a corpus. We also achieved our goal of improving the (syntactic) quality of the annotation data by providing the users with suggestions and visual feedback about the plausibility of the entered data. This last step benefited from the feedback that our students gave us while using the tool for their coursework and which we were able to quickly implement as additions to our plugins.

In 2017 we plan to focus on two particular features for Plokamos which we think will help make it a useful tool for many applications. The first one is a refactoring of the component in Plokamos that anchors annotations in their source data — the aforementioned Selectors — to enable higher-level annotations, i.e. annotating annotations. The obvious use case is for educators to grade and comment on their students annotations, but we’re sure that this will unlock further very interesting ways for scholars to express ideas. The second planned feature is the ability to run multiple instances of Plokamos on different regions of a website and let them interact to annotate relationships between segment of the regions. Those relationships can be for example assertions or translations, but again we’re convinced that this provides a foundation for new types of annotations that will emerge with time.

In addition to these features, we will round out the support for open, standards-based access to annotations created through Plokamos. First, we will add full API support, through an implementation of the RDA Collections API. Second, we will work towards updating the annotation data model as needed to be in compliance with the latest version of the Open Annotation specification, the Web Annotation data model.

We’re excited to watch Plokamos play its part as both a platform for data entry as well as experimentation with new kinds of scholarly concepts, as the Digital Humanities continue to reshape scholarship in the digital era.

Preserving Digital Scholarship in Perseids: An Exploration

Fernando Rios, Data Management Services, The Sheridan Libraries, Johns Hopkins University
Bridget Almas, Perseids Project, Tufts University
10.5281/zenodo.159569

Introduction

Software is an important part of many kinds of scholarship. However, it is often an invisible part of the knowledge generation process. As a result, software’s lack of visibility within the scholarly record inhibits the understanding and future use of the scholarship which is dependent on it. One way to mitigate that outcome is to preserve not only the final result but also the actual platform, services and tools upon which it depends.

In order to guide preservation of these platforms and services, Data Management Services at Johns Hopkins University is exploring several aspects of software preservation, one of which is investigating how preservation needs can be determined for particular projects such as Perseids. The Perseids Project at Tufts University is a web-based platform that is being used to produce new forms of digital scholarship for the humanities. Consequently, examining how this scholarship might be preserved by preserving the underlying software is of practical importance.

One of the outputs of the Perseids Project has been a series of prototypes of new forms of data-driven publications and digital editions. The data for these online publication prototypes have been produced through the use of a variety of software tools and services that combine dynamically provided data through orchestrated calls to web services. The software tools and underlying services have gone through several iterations of development throughout the lifetime of the project and publications have been produced at different stages of that development. This scenario poses a series of interesting challenges for preservation of these digital publications, the underlying data, and the tools and services that are intrinsic to them.

Objectives

This exploratory project had two objectives. The first was to give structure to thinking about how the data-driven publications and digital editions enabled by Perseids could be preserved. The primary concerns were what should be considered in determining how to adequately capture the collection of services and tools that comprise Perseids? Should the entire collection even be captured? The second objective was to develop and trial a set of questions, presented in the form of a questionnaire, that could be used to elicit information to help address the first objective.

Methods

The Perseids platform and the publications produced on it rely upon complex pieces of software with many moving parts. In order to begin addressing the question of how such a platform and its publications might be preserved, we had several informal discussions of what the major parts of Perseids were, along with general approaches to preservation and the associated challenges. We focused our investigation on a prototype digital publication that was developed on an early version of the platform and that used versions of the annotation tools and services from Perseids which have since been significantly updated or replaced since the prototype was first produced.

In order to understand how we might proceed with a potential software preservation activity, we decided it was important to answer three questions. First, we agreed it was important to have clarity on what the purpose of preservation is and who would benefit. Second, we determined that understanding what the pieces of the software are and how they are interdependent was critical. Third, we decided that being clear on what the costs versus benefits of preserving the Perseids software were, in relation to alternative approaches (e.g., website capture), was the most important question to address, from a practical perspective.

To structure the information, we used two questionnaires developed by Fernando for the purpose of providing consulting services for software archiving by the Data Management Services group at Johns Hopkins University. The first questionnaire asked very general questions in order to appraise the state of the software and gauge any potential gaps which may hinder its preservation and future reuse. Questions included asking the purpose of the Perseids project, its intended audience, the state of user- and developer-oriented documentation, general information about external software dependencies, and questions meant to gauge the general attitude with respect to software preservation and credit. After Bridget completed the questionnaire, we decided to move forward with determining what might need to be done in order to preserve the scholarship that the target use case represents (i.e, the prototype digital publication) and how it might be carried out.

To do this, a second, more focused questionnaire was developed (by Fernando, using feedback given by Bridget on the first questionnaire) in order to get us thinking about the specifics of preservation, including most importantly, the why. The figure below shows the sequence in which different aspects of preservation were addressed. The questions are loosely grouped by what kind of information they capture: why, what, when, how long, who, and how.

Although the questionnaires are still undergoing refinement and are not (yet) publicly distributed, a brief description of the information captured by the questionnaire we used is shown in the table below.

Why	Questions in this part revolved around really thinking about the true purpose of preserving software (e.g., enabling reproducibility, reuse, or continued access to scholarship) as well as the intended audience.
What	This section attempted to help us think through two things. First, at what level of granularity should the software be described and preserved in order to fulfil the preservation goal? This is important because different goals may require different levels of granularity in the description of the software. An example of a highly granular description is describing not only the software as a whole but also describing and documenting the individual pieces that comprise it as well as their interrelationships. Once an appropriate level of granularity was determined, a series of questions elicited information on those pieces.
When	This section attempts to determine what an appropriate time to preserve software is. For normal grant-funded projects, this will likely be at the end of the project or at the time of publication.
How Long	This part simply asks at least how long should the software be preserved. It is a simple question with a potentially difficult answer. Ideally, the answer is ‘a long time’ but the longer the time span, the more effort must be made to ensuring the software remains not only accessible but also usable. Therefore, it is important to come up with a number based on available resources.
Who	This section is meant to determine who is responsible for not only the software but also who bears responsibility for archiving it, making it citable, assigning unique identifiers etc. This section also is meant to help in identifying a suitable archive where it may be stored.
How	This section elicits what approach seems reasonable to preserve the software (e.g., by archiving the source code as-is, using virtualization or emulation technology, or by continued development). In addition, this section determines the kind of documentation that will be included and how it will be attached to the software (e.g., readme file, wiki, structured metadata). Although not part of the questionnaire, the Pathways of Research Software Preservation (Rios, 2016) gives an overview of how different parts of research software might be preserved and how different approaches are related.

Lessons Learned

We learned, first, the importance of sorting through the “why” and “what” to identify those pieces of software which warrant preservation activity and to define exactly what approach to take to preservation. Having the framework of the questionnaire to guide our thinking about those issues helped to focus what felt at the beginning like a daunting task.

Bridget entered into the discussions with Fernando with a pragmatic motivation: as development progresses on Perseids, having to support multiple earlier versions of services in order to support the prototype publications becomes increasingly unmanageable. We wanted to be able to retire the earlier service versions that these prototypes depend upon, but the cost versus benefit ratio for upgrading prototype code does not always allow for that. In considering the options for preserving a functioning version of a prototype, some of which themselves imply a fair amount of work (such as creating and preserving a Docker container image of all the supporting pieces), thinking about the the true purpose for preservation helped to put the problem in perspective and also to identify gaps in our planning and preservation capabilities.

While each of the suggested motivations from Fernando’s questionnaire could be considered to be an ideal to which to aspire in general, when held up against the specific software, they didn’t all make practical sense. For example, while in theory, reproducibility of the exact display of the annotations and textual data from our target use case seemed desirable, we had to ask if that was essential for preserving and reproducing the scholarship. The answer to that might have been yes if we had amassed large quantities of data for the use case, and expanded it beyond the initial prototype. But as we have not yet been able to do that, and the tools and services in question have since evolved, the small dataset we have accumulated for our publication would be better reproduced and expanded via newer tools. With this consideration in mind, it seems the remaining value of the prototype code would be as a demonstration of a methodology for annotation and a proposed service-based infrastructure to support that methodology. The code itself is of less consequence than a documentation of the ideas and dependencies would be.

This problem is discussed in the context of scientific workflows in “Techniques for Preserving Scientific Software Executions: Preserve the Mess or Encourage Cleanliness?” (Thain, Ivie and Meng, 2015). The authors found that preservation of distributed environments is still very much an open question and they suggest various approaches. In our case, a Docker image would allow an end-user to see the prototype functioning as it did when published but would provide little insight into the methodology or infrastructure. As we don’t intend to reproduce this environment exactly, we might consider just preserving the “working principle”, providing a description of the setup, using a controlled vocabulary.

It also became clear, in reviewing the questionnaire, that simply having code in GitHub or other open source versioning repository is not sufficient. All code we write is available in the project’s GitHub repository. However, because of the complex history and dependencies of open source software development, what exists in the repository represents, in many cases, only the tip of the iceberg. In addition, the GitHub repository, as it currently stands, doesn’t present a true picture of all the people who contributed intellectually to these efforts, because the code is just one piece of the puzzle. As discussed in Matthew Turk’s excellent post, “The Royal ‘We’ in Scientific Software Development”, we need to do a much better job of recording and crediting this intellectual work. Further, we need to be cognizant of the need to to this as the work takes place. An ontology such as TaDiRAH would be worth considering here.

The “who” section of the questionnaire also raised some interesting questions. Where does the responsibility for preservation lie, between the software developer and the scholar? Many of the use cases we work on in Perseids are not explicitly funded projects in and of themselves. Our approach has been to try to do as much as possible to serve as many real scholarly workflow needs as possible. This has provided the opportunity for us to explore various questions around what humanities infrastructure needs to support, while hopefully still also providing real value to our users. At the same time, we have learned that without adequate planning for governance and sustainability, things can and do fall through the cracks. Prototype code which we have developed, such as for the use case we examined here, does not always have a clear owner. For future projects of this nature, we need to take the time at the beginning to ask ourselves these questions about who will take ownership and responsibility for ensuring the preservation in order to eliminate this ambiguity.

Conclusions and Next Steps

Although data preservation and sharing has received much attention from funders, publishers, libraries and research communities in the past 10 years or so, methods, tools, and best practices for preserving and curating the software associated have not been as fully developed. The evaluation of the Perseids project served to contextualize some of the ideas and workflows around capturing information to enable the archiving of research software that are being developed in the Data Management Services group at Johns Hopkins University. From the Perseids Project’s perspective, the iterative approach we took gave us a clearer idea of the unique requirements and challenges of preserving the scholarship embedded in this software.

We learned that while having an ideal to shoot for is good, the ideal isn’t always the best or most practical approach. We have, however, identified some concrete next steps we can take to move closer to where we would like to be with preservation of the platform components and outputs.

First, we will explore ontologies and approaches for describing the distributed infrastructure we have envisioned for our publications. We have started with an analysis of the Ontosoft Ontology, although at first glance, it does not seem possible to express with it all the layers of intent and dependencies in our environment. We also intend to explore the Linked Resource Model ontology developed by the Pericles-EU project for this purpose.

In order to preserve the end-user experience of our publications, we expect to use Webrecorder.io service to create web archive snapshots of their current state. This will allow us to preserve the visual representation of the scholarly output without a dependency upon the software behind it being available in perpetuity.

Finally, we hope to do a better job planning for the sustainability and stewardship of future undertakings on the platform from the outset, including identifying all participants and the nature of their contributions.