Skip to content

Extracting Text from Early Modern Books

Early modern print books present important difficulties when it comes to optical character recognition.  Some copies were printed using more or less ink, the pressure applied varied, and the passage of time may have introduced smudges, tears, holes, and stains:

Example of a tear.

Heavily annotated page. The notes on the left margin
contain numerous abbreviations.
Examples of stains.

Different kinds of difficulties are presented by Greek and Hebrew characters, which were often misunderstood during typesetting: