August 8, 2015
http://tinyurl.com/nlvhy9b
Comments to munson@dh.uni-leipzig.de
Federico Boschetti, CNR, Pisa
Gregory Crane, Leipzig/Tufts
Matt Munson, Leipzig/Tufts
Bruce Robertson, Mount Allison
Nick White, Durham (UK) (and Tufts during 2014)
A first stab at producing OCR-generated Greek and Latin for the complete Patrologia Graeca (PG) is now available on GitHub at https://github.com/OGL-PatrologiaGraecaDev. This release provides raw textual data that will be of service to those with programming expertise and to developers with an interest in Ancient Greek and Latin. The Patrologia Graeca has as much as 50 million words of Ancient Greek produced over more than 1,000 years, along with an even larger amount of scholarship and accompanying translations in Latin.
Matt Munson started a new organization for this data because it is simply too large to put into
the existing OGL organization. Each volume can contain 250MB or more of .txt and .hocr files, so it is impossible to put everything in one repository or even several dozen repositories. So he decided to create a new organization where all the OCR results for each volume would be contained within its own repository. This will also allow us to add more OCR data as necessary (e.g., from Bruce Robertson, of Mt. Allison University, or from nidaba, our own OCR pipeline) at the volume level.
The repositories are being created and populated automatically by a Python script, so if you notice any problems or strange happenings, please let us know either by opening an issue on the individual volume repository or by sending us an email. This is our first attempt at pushing
this data out. Please let us know what you think.
Available data includes:
Greek and Latin text generated by two open source OCR engines, OCRopus (https://github.com/tmbdev/ocropy) and Tesseract (https://github.com/tesseract-ocr). For work done optimizing OCRopus, see http://heml.mta.ca/lace. For work done optimizing Tesseract, see http://ancientgreekocr.org/. The output format for both engines in hOCR (https://en.wikipedia.org/wiki/HOCR), a format that contains links to the coordinates on the original page image from which the OCR was generated.
OCR results for as many scans of each volume of the Patrologia Graeca that we could find in the HathiTrust. We discovered that the same OCR engine applied to scans of different copies of the same book would generate different errors (even when the scans seemed identical to most human observers). This means that if OCR applied to copy X incorrectly analyzed a particular word, there was a good chance that the same word would be correctly analyzed when the OCR engine was applied to copy Y. A preliminary study of this phenomenon is available here: http://tinyurl.com/ppyfdfj. In most cases, the OCRopus/Lace OCR contains results for four different scanned copies while the Tesseract/AncientGreekOCR output contains results for up to 10 different copies. All of the Patrologia Graeca volumes are old enough that HathiTrust members in Europe and North America can download the PDFs for further analysis. Anyone should be able to see the individual pages used for OCR via the public HathiTrust interface.
Initial page-level metadata for the various authors and works in the PG, derived from the core index at columns 13-114 of Cavallera’s 1912 index to the PG (which Roger Pearse cites at http://www.roger-pearse.com/weblog/patrologia-graeca-pg-pdfs/). A working TEI XML transcription, which has begun capturing the data within the print source, is available for inspection at: https://www.dropbox.com/s/mldhu4okpq4i7r8/pg_index2.xml. All figures are preliminary and subject to modification (that is one motivation for posting this call for help), but we do not expect that the figures will change much at this point. At present, we have identified 658 authors and 4,287 works. The PG contains extensive introductions, essays, indices etc. and we have tried to separate these out by scanning for keywords (e.g., praefatio, monitum, notitia, index). We estimate that there are 204,129 columns of source text and 21,369 columns of secondary sources, representing roughly 90% and 10% respectively. Since a column in Migne contains about 500 words and since the Greek texts (almost) always have accompanying Latin translations, the PG contains up to 50 million words of Greek text but many authors have extensive Latin notes and in some cases no Greek text, so there should be even more Latin. For more information, see http://tinyurl.com/ppyfdfj.
Next Steps
Developing high-recall searching by combining the results for each scanned page of the PG. This entails several steps. First, we need to align the OCR pages with each other — page 611 for one volume may correspond may correspond to page 605 in another, depending upon how the front matter is treated and upon pages that one scan may have missed. Second, we need to create an index of all forms in the OCR-generated text available for each page in each PG volume. Since one of the two OCR engines applied to multiple scans of the same page is likely to produce a correct transcription, a unified index for all the text for all the scans of a page will capture a very high percentage of the words on that page.
Running various forms of text mining and analysis over the PG. Many text mining and analysis techniques work by counting frequently repeated features. Such techniques can be relatively insensitive to error rates in the OCR (i.e., you get essentially the same results if your texts is 96% accurate or if your texts are 99.99% accurate). Many methods for topic modelling and stylistic analysis should produce immediately useful results.
Using the multiple scans to identify and correct errors and to create a single optimized transcription. In most case, bad OCR produces nonsense forms that are not legal Greek or Latin. When one OCR run has a valid Greek or Latin word and others do not, that valid word is usually correct. Where two different scans produce valid Greek or Latin words (e.g., the common confusion of eum and cum), we can use the hOCR feature that allows us to include multiple possibilities. We can do quite a bit encoding the confidence that we have in the accuracy of each transcribed word.
Providing a public error correction interface. One error correction interface already does exist and has been used to correct millions of words of OCR-generated Greek but two issues face us. First, we need to address the fact that we cannot ourselves serve page images from HathiTrust scans. HathiTrust members could use the system that we have by downloading the scans of the relevant volumes to their own servers but that does not provide a general solution. Second, our correction environment deals with OCR for one particular scanned copy. Ideally, the correction environment would allow readers to draw upon the various different scans from different copies and different OCR engines.