FRBR Catalog SIP Creation Processs

Overview


Motivations

  1. Normalize the MODS data
  2. Add CTS urns and TextInventory metadata
  3. Make the catalog data available in Fedora
  4. Create a prototype of a search/browse interface for the catalog

Outputs

1. One ATOM feed for each work that contains the following ATOM entries:

  • a CTS inventory for the work, listing all catalogued versions and translations, with online elements for any versions/translations that exist in Perseus
  • one expression-level MODS record for each catalogued version
  • one expression-level MODS record for each catalogued translation
  • one expression-level MODS record for each catalogued related item (summary, commentary, etc.)
  • An automatically created MARC version of each MODS record

2. One set of Fedora Update Directives per work, created from the ATOM Feeds

  • Ant build script iterates through the work feeds, querying the fedora repository for existing records for that work, and creating the appropriate add/update/delete directories for ingest of the data into Fedora.

3. One set XC catalog records pulled from the CTS Metadata in Fedora

  • XC MetadataServicesToolkit issues OAI-PMH ListRecords requests to the Fedora Repository for metadataPrefix=cts
  • Processes through CTS Service
    • To Do: describe details of this process

4. One set of XC catalog records pulled from the MARC Metadata in Fedora

  • XC MetadataServicesToolkit issues OAI-PMH ListRecords requests to the Fedora Repository for metadataPrefix=marc
  • Processes through MarcNormalization Service
    • To Do: describe details of this process

Inputs

  • Authors-Abbreviations-Editions.xls (Latin and Greek sheets only)
  • IDMatches.xls (Anonymous and IDMatches sheets)
  • Perseus CTS Inventory

Process

  1. Create 4 separate lists of works with identifiers to search for in the MODS records.
    1. The identifiers come from the above listed input spreadsheets
    2. The following additional information is also taken from the 4 input spreadsheets:
      1. author url
      2. author names
      3. titles
      4. perseus status
    3. The entire FRBR collection (Greek and Latin) is now searched for all works.
  2. Iterate through these work lists, searching and aggregating MODS records matching one of the possible identifiers for the work (identifier search order is phi, then stoa, then tlg). Search/aggregation strategy is as follows:
    1. First find all the MODS records in the FRBR collection for records which:
      1. have a value matching the identifier in a top-level mods:identifier field that either does not have the @displayLabel attribute defined or has it set to ‘isTranslationOf’ [ The purpose of the @displayLabel filter is to limit the results to versions and translations, no related items like commentaries, etc.]
      2. have a mods:relatedItem whose @type attribute equals ‘constituent’ and that has a child mods:identifier field whose value matches the identifier and that either does not have the @displayLabel attribute defined or has it set to ‘isTranslationOf’ [ to find all catalogued constituent records for the work, limited to versions and translations, no related items like commentaries, etc.]
      3. have a mods:relatedItem whose @type attribute equals ‘constituent’ and that has a child mods:identifier field whose value matches the identifier and that has a @displayLabel attribute defined to something other than ‘isTranslationOf’ [ to find all catalogued constituent records for the work, limited to related items like commentaries, etc.]
      4. nested related items are included as of the 10-24-2012 run.
    2. Next create 2 groups of MODS records from the search results from the above as follows:
      1. Expression Level Records:
        1. taking the records from group i without modification
        2. creating new top-level mods:records from the relatedItem records in Group ii
      2. Related Item Records:
        1. creating new top-level mods:records from the relatedItem records in Group iii
    3. Expand and refine the Expression records by:
      1. splitting records with more than 1 distinct mods:languageTerm value from mods:language parents which either don’t have @objectPart defined or have it set to ‘text’
        1. This step is skipped for MODs records with Perseus identifiers under the assumption that these records have already been split
        2. the following fields are not copied for the source language (i.e. grc or lat): mods:name/mods:role/mods:roleTerm=translator and mods:subject whose @type=’lcsh’ and whose mods:topic matches case-insensitive ‘translation’
      2. Deduping the entire resulting set for the work using the location urls as the primary deduping key. If a MODs record A for a work has the same mods:location/mods:url value as another MODS record B for that work, it is considered a dupe if the following criteria are also met:
        1. Record A has only 1 mods:location/mods:url field OR Record A has more than 1 mods:location/mods:url field AND all of the locations in Record A match a mods:location field with the same @displayLabel value in record B
        2. All of the the languages identified in the mods:language/mods:languageTerm fields of Record A match a language in the mods:language/mods:languageTerm fields of Record B (only mods:language without @objectPart or where @objectPart equals ‘text’ are considered).
    4. If we have found at least one Expression Level MODS record (OR if no MODS records were found but we do have a version or translation for a work matching the identifier in the Perseus CTS inventory), then we proceed to make the ATOM Feed using the identifier we matched on as the base identifier for the CTS urns referenced in the feed.
    5. The versions and translations in the CTS Text inventory part of the feed are created from:
      1. any edition and translation entries from the original Perseus CTS inventory
      2. a new edition or translation for each MODS record unless the MODS record had a Perseus location url matching a document name for the work in the original Perseus inventory. It is identified as a translation if MODS record has:
        1. mods:name/mods:role/mods:roleTerm=translator OR
        2. mods:subject whose @authority=’lcsh’ and matches(mods:topic,’translation’,'i’) OR
        3. mods:language[@objectPart='text' or not(@objectPart)]/mods:languageTerm is not equal to the language being processed
    6. The MODS records included in the feed include the following normalizations:
      1. the first title pulled from the spreadsheets is added as a uniform title (mods:titleInfo[@type='uniform']) if it isn’t already present
      2. mods:identifier[@type='cts-urn'] fields are normalized to mods:identifier[@type='ctsurn']
      3. a mods:identifier[@type='ctsurn'] is added if one doesn’t already exist
      4. cts urns are normalized from greekLang and latinLang to greekLit and latinLit
    7. A set of refindex entries is added to the Feed for all location urls found across all MODS records. These aren’t currently used at all by downstream processing and should probably be handled differently.
    8. The related items MODS records are now included at the end of the Feeds (no attempt to dedupe these is currently being made)
  3. We then have a set of 4 different feeds, which we combine. Where a work feed was created from more than one spreadsheet:
    1. work feeds created from IDMatches take precedence over feeds from Anonymous, and AAE-Greek and AAE-Latin
    2. work feeds created from Anonymous take precedence over feeds from AAE-Greek and AAE-Latin
    3. work feeds from AAE-Greek take precedence over AAE-Latin only if the identifier starts with tlg

Necessary Corrections and Improvements

  • Works with identifiers containing Xs are not handled consistently. This may be due to odd encoding of character text. We need to investigate this and correct.
  • We need to add special handling for Nepos and Suetonius for which we have created extended Phi identifiers.
  • The work phi1017.phi1012 is the PHI identifier for Seneca the Younger’s, but this actually comprises nine different works, each of which does have a unique STOA identifier as well as a Perseus:abo. Specify an exception in the code for these works (stoa0255-stoa004, stoa0255-stoa006 through stoa0255-stoa013) to be pulled individually by their STOA’s instead of PHI.
  • When adding versions and translations to the CTS Inventory for the MODS records we should also double check any existing ctsurns in the MODS record
  • We need to include the MADS records and leverage them for the aggregating on Author for a browse display
  • Add word counts to the MODS records from the TLG and PHI spreadsheets that Greg provided
  • Use the newly aggregated Expression level MODS records as a new base for work going forward??
  • Fedora ingest process currently is not picking up all the records. There are a variety of different errors being reported, some with the data and some with the Fedora environment. We need to review and correct the errors.
  • Review and revise the Fedora PID scheme and datastream contents
  • Analyze whether we should upgrade to the latest XC software or use a completely different catalog search interface for the data.
  • Make search services available which provide access to data metrics and visualizations

Comments are closed.