New OCR methods for early printings » Perseus Digital Library Updates

Uwe Springmann
Digital Humanist
Ludwig-Maximilians-University, Munich

Personal Website
Digital Humanities Site

Access to texts in the form of machine actionable data is a prerequisite for research in Digital Humanities. If one’s goal is to take a large volume of data into account, manual transcription is too slow and costly to be employed at scale, while automatic state-of-the-art OCR methods give good results only from the 19th c. onward. A method for automatic, high accuracy transformation of images of early printed pages has therefore been missing.

The recent successful application of recurrent neural networks (RNNs) with long short-term memory (LSTM) to the OCR of 19th century books printed in Antiqua and Fraktur types by Breuel et al. has prompted the investigation of the applicability of this method to very early printings, starting from the incunabula period (1450-1500) to the present. Experiments on scripts and alphabets and the role of training and postcorrection are described. The results show that character recognition rates above 95% can be expected for good publicly available scans of historical books if their page image layout has been properly analyzed.