Feedback OCR Bodin Book 1


Data Entry Errors

Requiring Manual Correction:

1) The following bit of XML had been copied from the English and pasted into the top of each chapter in the Latin Book 1 word document, but should only have appeared once at the top.  Only the DIV3 for type=”chapter” should have been repeated at the top of each chapter

<DIV1 TYPE=”text”>
<DIV2 N=”1″ TYPE=”book”>
<PB N=”1″ REF=”3″/>

2) An invalid Unicode space character (01xf) appeared a number of times in the document.

3) The biggest problem was with a variety of errors in the use of the <NOTE PLACE=”marg”></NOTE> tag

  • missing closing and/or opening tag
  • missing = before PLACE attribute
  • curly quotes (“ and ”) instead of straight (“) used around PLACE attribute value

Correctable via automatic methods:

  1. Page xx should be <PB n=”xx”/>
  2. There were several  instances of XXX instead of Page XXX
  3. All chapter DIV3 elements used N=”1″ rather than correct chapter number

Steps to convert to TEI-Analytics

  1. Saved word document as plain text
  2. adding wrapping <TEI/> root element
  3. correct above errors
  4. added closing tags for <DIV3> elements
  5. replaced & with &amp;
  6. lower-cased all element and attribute names
  7. changed divX elements to div
  8. added TEI header, text, body, p elementsreplaced [A] [B], etc. with <milestone n=”A”/> etc.

Additional Cleanup on XML needed:

  1. Fill in teiHeader info (bibl, publication statement, etc.)
  2. Correct chapter <HEAD/> text (still same as English)
  3. If [A], [B], etc. sections are to become part of citation scheme, they need to be converted to divs