Findings

Findings

The results of DVE were published in a white paper submitted to the NEH as well as in a number of papers and presentations. Information about on-going work can be found on the updates page and on the DVE blog

White Paper

B. Almas, Alison Babeu, David Bamman, Federico Boschetti, Lisa Cerrato, Gregory Crane, Brian Fuchs, David Mimno, Bruce Robertson, David Smith.(2011) “What Did We Do With A Million Books: Rediscovering the Greco-Ancient world and reinventing the Humanities.” In White Paper Submitted to the NEH, National Endowment for the Humanities, 2011. http://hdl.handle.net/10427/75558

Papers

B. Robertson (2012) Optical Character Recognition of 19th Century Polytonic Greek Texts Results of A Preliminary Survey, Bruce Robertson, Dept. of Classics, Mount Allison University 2012-01-19

Abstract

This is a quantitative overview of a strategy for performing optical character recognition on text images comprising ancient Greek. We produced 22 different classifiers to conduct OCR on 19th-century ancient Greek texts from around the world. For each classifier, we processed 10 page images from 158 books. The output was scored for its ‘Greekness’ on phonetic and lexical grounds, and summarized in a table. In the majority of cases, the output of each text’s highest-scoring classifier is of sufficient quality to be useful in further research and image-fronted search engines. There is a good correlation between the best classifier or group of classifiers and the publisher and publication date. This confirms the usefulness of our approach, and will simplify OCR of occasional Greek words in other texts by the same publishers. Better line-segmentation strategies will provide the greatest single improvement in this process.


Bamman, David, and David Smith. (2012).  “Extracting Two Thousand Years of Latin from a Million Book Library.Journal of Computing and Cultural Heritage, 5 (1) 2012. http://doi.acm.org/10.1145/2160165.2160167

Abstract

With the rise of large open digitization projects such as the Internet Archive and Google Books, we are witnessing an explosive growth in the number of source texts becoming available to researchers in historical languages. The Internet Archive alone contains over 27,014 texts catalogued as Latin, including classical prose and poetry written under the Roman Empire, ecclesiastical treatises from the Middle Ages, and dissertations from 19th-century Germany written—in Latin—on the philosophy of Hegel. At one billion words, this collection eclipses the extant corpus of Classical Latin by several orders of magnitude. In addition, the much larger collection of books in English, German, French, and other languages already scanned contains unknown numbers of translations for many Latin books, or parts of books.

The sheer scale of this collection offers a broad vista of new research questions, and we focus here on both the opportunities and challenges of computing over such a large space of heterogeneous texts. The works in this massive collection do not constitute a finely curated (or much less balanced) corpus of Latin; it is, instead, simply all the Latin that can be extracted, and in its reach of twenty-one centuries (from approximately 200 BCE to 1922 CE) arguably spans the greatest historical distance of any major textual collection today. While we might hope that the size and historical reach of this collection can eventually offer insight into grand questions such as the evolution of a language over both time and space, we must contend as well with the noise inherent in a corpus that has been assembled with minimal human intervention.

Bamman, David and Gregory Crane.  (2011).  “Measuring Historical Word Sense Variation.” In Proceedings of the 11th ACM/IEEE-CS Joint Conference on Digital libraries (JCDL 2011), pp. 1-10, ACM Digital Library 2011.  Preprint available at: http://hdl.handle.net/10427/75561


Bamman, David, Alison Babeu, and Gregory Crane (2010).  “Transferring Structural Markup Across Translations Using Multilingual Alignment and Projection.” In JCDL ’10: Proceedings of the 10th annual joint conference on Digital libraries, New York, NY, USA, pp. 11-20. ACM.  Preprint available at: http://hdl.handle.net/10427/70398

Abstract

We present here a method for automatically projecting structural information across translations, including canonical citation structure (such as chapters and sections), speaker information, quotations, markup for people and places, and any other element in TEI-compliant XML that delimits spans of text that are linguistically symmetrical in two languages. We evaluate this technique on two datasets, one containing perfectly transcribed texts and one containing errorful OCR, and achieve an accuracy rate of 88.2% projecting 13,023 XML tags from source documents to their transcribed translations, with an 83.6% accuracy rate when projecting to texts containing uncorrected OCR. This approach has the potential to allow a highly granular multilingual digital library to be bootstrapped by applying the knowledge contained in a small, heavily curated collection to a much larger but unstructured one.


Boschetti, F., Romanello, M., Babeu, A., Bamman, D. and Crane, G (2009). “Improving OCR Accuracy for Classical Critical Editions” in Agosti et al. (ed.) Researchand Advanced Technology for Digital Libraries Springer Berlin, 2009, Vol. 5714, pp. 156-167

Abstract

This paper describes a work-flow designed to populate a digital library of ancient Greek critical editions with highly accurate OCR scanned text. While the most recently available OCR engines are now able after suitable training to deal with the polytonic Greek fonts used in 19th and 20th century editions, further improvements can also be achieved with postprocessing. In particular, the progressive multiple alignment method applied to different OCR outputs based on the same images is discussed in this paper.


Cohen, Jeremy, Ioannis Filippis, Mark Woodbridge, Daniela Bauer, Neil Chue Hong, Mike Jackson, Sarah Butcher et al. “RAPPORT: running scientific high-performance computing applications on the cloud.” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 371, no. 1983 (2013). http://dx.doi.org/10.1098/rsta.2012.0073

Abstract

Cloud computing infrastructure is now widely used in many domains, but one area where there has been more limited adoption is research computing, in particular for running scientific high-performance computing (HPC) software. The Robust Application Porting for HPC in the Cloud (RAPPORT) project took advantage of existing links between computing researchers and application scientists in the fields of bioinformatics, high-energy physics (HEP) and digital humanities, to investigate running a set of scientific HPC applications from these domains on cloud infrastructure. In this paper, we focus on the bioinformatics and HEP domains, describing the applications and target cloud platforms. We conclude that, while there are many factors that need consideration, there is no fundamental impediment to the use of cloud infrastructure for running many types of HPC applications and, in some cases, there is potential for researchers to benefit significantly from the flexibility offered by cloud platforms.

Presentations

B. Fuchs, F. Boschetti, J. Darlington (2011), “From Scan to Scholarly Resource: a Greco-Roman Index for the Internet Archive.” presented at: CAA, Beijing, CH. 2011.

B. Robertson, F. Boschetti (2012), “Large-scale Polytonic Greek Optical Character Recognition in Practice and Theory”, e-humanities seminar, University of Leipzig, Oct 2012.

 

Comments are closed.