Progress towards Perseus6

Gregory Crane
November 15, 2024

A great deal of work towards the new Perseus (Perseus 6) has been going on behind the scenes and that work is now beginning to become visible. James Tauber of Signum University has led the development, with collaboration from Charles Pletcher and Clifford Wulfman. In most cases, the “we” in this narrative means “James Tauber did the work with the rest of us making comments from the peanut gallery.”

 

Figure: the ATLAS backend architecture

We are still not in the final stage of implementation but we are finishing up a major, and probably the biggest, task: creating a sustainable backend to manage a growing range of machine actionable textual data from a widening set of open data projects in multiple countries. As we refine the backend data, we can begin to move to the final stage of work: implementing within the Scaife Viewer the frontend services that support treebanks, grammatical annotations, aligned translations, metrical analyses, automatic mapping, links from edition to manuscripts and that were developed in Beyond Translation.

For now I point out work done to make the existing Scaife/Perseus Greek and Latin accessible in the ATLAS architecture.

Every Greek and Latin text available in Scaife is now available in the new ATLAS architecture. If you start with the Scaife ATLAS homepage (https://atlas.perseus.tufts.edu/) drill down into the CTS library (https://atlas.perseus.tufts.edu/library/), you can find your way into a list of all editions currently available in Scaife (a dozen versions of Thucydides in various languages are available https://atlas.perseus.tufts.edu/library/urn:cts:greekLit:tlg0003.tlg001/), and then down to each of the lowermost citable notes (e.g., https://atlas.perseus.tufts.edu/library/passage/urn:cts:greekLit:tlg0003.tlg001.perseus-grc2:1.1.1/). At this point, you can call for either plain text or xml by adjusting the final argument: https://atlas.perseus.tufts.edu/library/passage/urn:cts:greekLit:tlg0003.tlg001.perseus-grc2:1.1.1/text/ vs. https://atlas.perseus.tufts.edu/library/passage/urn:cts:greekLit:tlg0003.tlg001.perseus-grc2:1.1.1/xml/.

If you explore the Github repository, you will not not only the full text library stored in the simpler ATLAS form (identifier+textchunk) but also complete morphosyntactic analyses for all 40 million words of Greek (using GreCy) and 16 million words of Latin (using LatinCy). Navigating to particular authors is a bit cumbersome because we had too much data and it was necessary to split the data into multiple Github repositories. The information you need to find the right repository for any given Scaife Greek or Latin author is at https://github.com/scaife-viewer/tagging-pipeline/tree/main/data.

We now have three layers of morpho-syntactic data.

  1. Actively curated Treebanks (e.g., Francesco Mambrini’s Daphne).
  2. The growing repository of curated and automatically generated treebanks from Alek Keersmaeker’s Glaux Trees (currently 20 million words of Greek and including some works that are not yet in Perseus)
  3. The comprehensive morpho-syntactic data generated for any texts added to Perseus.

The opening sections of Thucydides (source here):

The identifier+textchunk format is all that we need to add new texts into Perseus. While the greater structure of the TEI XML format that we have used in Scaife has advantages, we can now add much more content much more quickly and work with larger corpora that has been practical in the more demanding XML framework. The goal is not to abandon XML but to provide the XML as time allows while also being able to work with larger, less structured collections.

The morpho-syntactic analysis (link here):

A draft of a fuller write-up for this work is available in Zenodo: The Sixth Generation of the Perseus Digital Library and a Workflow for Open Philology.

This entry was posted in Uncategorized. Bookmark the permalink.

Comments are closed.