The Dynamic Variorum Editions project aims to identify and track topics about the Greco-Roman world in large, multi-lingual public document collections and make this information available to digital libraries. We began as a successful entry in the first round of the Digging Into Data Challenge and have continued to work together and grow. The original team comprised researchers from Tufts University, Imperial College London and Mount Allison University.
Aims and Objectives
DVE aims to identify and track topics about the Greco-Roman world as they appear in more than a million documents produced across thousands of years and in several languages. Using data from the Internet Archive (IA), JSTOR, the HathiTrust and other sources, DVE aims to create a framework to produce “dynamic variorum” editions of texts, mining the primary and secondary sources contained in these million book collections to locate where people and places from the Greco-Roman world are discussed, which Greek and Latin works are cited, quoted, and alluded to, and what kinds of things are said about the people, places, and texts of the Greco-Roman world over time.
The objectives of DVE are:
- To create a reference database to the written culture of Classical Antiquity using statistic methods
- To refine existing statistical and machine-learning methods for feature extraction in the domain of Classical Antiquity (ancient Greek and Latin literature).
- To provide a robust scalable infrastructure for access to this information in the form of services supported by scalable on-demand computing power (‘Cloud’).
- To enable continued community contribution to this body of information via services.
DVE builds on three decades of work in making the texts of Classical Antiquity available in electronic form and creating access methods to this data that meet the long-term needs of research, education and general knowledge. Work on the digitization of Greek and Latin texts began in the 1980s on a variety of fronts, most notably with the Perseus Project and the Thesaurus Linguae Graecae. Both projects focused not only on digitizing text but also on creating the structured, persistent access to the digitized material without which research would be impossible.
The current project can be seen as a continuation of the first linguistic parser for ancient Greek, built at the Perseus Project in the early 1980s. The goal is to identify and correlate features in the texts of Classical Antiquity, and to publish this information in a sustainable and scalable form. A feature may be anything from a word (‘amo’, ‘amas’,’amat’ can all be correlated with the Latin verb ‘amare’; ‘est’ can mean either ‘I am’ or ‘I eat’) through places (‘Thebae’ refers to several physical locations) to people (there are several people with the name ‘Plinius’) and events (e.g. the Peloponnesian War).
The goal of DVE is to identify these features, to disambiguate them when necessary, and to correlate them with one another, so that, for instance, when we need to find all of the references to the Thebes that is in Corinth, and not the Thebes that is in Egypt, we will get a reliable list of pointers to passages in texts.
Our long-term aim is to create a kind of electronic reference database to the written evidence-base for Classical Antiquity. A central index of this kind makes it possible to create so-called ‘variorum editions’ for any author or any stretch of text, in which features in that text are linked in a meaningful way to their uses in other collections. This opens up exciting new possibilities not only for research in the field of Classical Antiquity, but also for cross-fertilization between research fields as well as between research and other forms of cultural activity.
To build such a reference database entirely by hand for all of classical antiquity and its Nachleben far exceeds the capabilities of researchers, even if we able to overcome the considerable organizational difficulties posed by such a project. The scale of the project requires the use of powerful statistic methods for data analysis that have been an area of intensive research within computing in the areas of machine learning and data mining. And here, where human researchers are almost certain to be overwhelmed, statistic methods thrive: the more data we have, the more reliable our results.
Implementing such a statistics-based reference system requires substantial compute power. This is especially true if the data which the system will process and reference can be expected to grow, as is more than likely, given the fact that few European cultural artifacts are without some reference to Classical Antiquity. In addition to compute power to extract the data, infrastructure in the form of services will be needed to access it in a meaningful form. For this reason, a major goal of the project is to improve infrastructure and access methods to the point where dynamic real-time extraction of statistic information becomes feasible. Even without this, the successful implementation of a reference database of this kind on a robust and extensible infrastructure which provides community access to data would be a major contribution not only to the field of Classical Studies, but to the larger domain of Cultural Heritage as well.