Feedback OCR Bodin Book 1

Data Entry Errors

Requiring Manual Correction:

1) The following bit of XML had been copied from the English and pasted into the top of each chapter in the Latin Book 1 word document, but should only have appeared once at the top. Only the DIV3 for type=”chapter” should have been repeated at the top of each chapter

<DIV1 TYPE=”text”>
<DIV2 N=”1″ TYPE=”book”>
<PB N=”1″ REF=”3″/>
<HEAD>THE FIRST BOOKE OF A COMMONWEALE.</HEAD>

2) An invalid Unicode space character (01xf) appeared a number of times in the document.

3) The biggest problem was with a variety of errors in the use of the <NOTE PLACE=”marg”></NOTE> tag

missing closing and/or opening tag
missing = before PLACE attribute
curly quotes (“ and ”) instead of straight (“) used around PLACE attribute value

Correctable via automatic methods:

Page xx should be <PB n=”xx”/>
There were several instances of XXX instead of Page XXX
All chapter DIV3 elements used N=”1″ rather than correct chapter number

Steps to convert to TEI-Analytics

Saved word document as plain text
adding wrapping <TEI/> root element
correct above errors
added closing tags for <DIV3> elements
replaced & with &
lower-cased all element and attribute names
changed divX elements to div
added TEI header, text, body, p elementsreplaced [A] [B], etc. with <milestone n=”A”/> etc.

Additional Cleanup on XML needed:

Fill in teiHeader info (bibl, publication statement, etc.)
Correct chapter <HEAD/> text (still same as English)
If [A], [B], etc. sections are to become part of citation scheme, they need to be converted to divs