Saturday, 26 of July of 2014

Jane & Ade Stevenson as well as LOCAH and Linking Lives


Eye candy by Eric

Just less than a week ago I had a chat with Jane and Ade Stevenson on the topic of the linked open data and archival materials. This blog posting outlines some of my “take aways”. In short, I learned that implementing linked open data (LOD) against archival materials may not be straight-forward, but it does offer significant potential.

Jane Stevenson works for the Archives Hub, and Ade Stevenson works for Mimas. Both of them worked on a pair of projects surrounding LOD and archival materials. The first project was called LOCAH whose purpose was to expose archival metadata as LOD. The second project — Linking Lives — grew out of the first project, and its purpose was to provide an usable researcher interface to discover new knowledge based on LOD from archives. Both of these projects are right on target when it comes to the LiAM project.

We had an easily flowing Skype conversation, and below are some of the things I heard Jane and Ade say about an implementation process:

  1. Ask yourself, “What is the goal I want to accomplish, and what am I trying to achieve?”
  2. Decide what you want to include in your output. Do not expect to include all of the metadata in your EAD files.
  3. Create a data model. This data model is not expected to be a transformation of EAD to RDF but rather to create a triple-store of the EAD metadata. “Don’t think ‘records’ or ‘files’”, but instead think models.
  4. Select the vocabulary(ies) you are going to use to structure your data. There are pros and cons to using existing vocabularies as well as creating your own. LOCAH and Linking Lives did both; they both used existing vocabularies as well as created their own, specifically when it came to identifying “creators”.
  5. Do data clean up. Archives Hub is an aggregation of EAD files. While the EAD files may validate against the EAD DTD or schema, they are not necessarily encoded consistently nor as thoroughly. (This is what Dorothea Salo calls ‘variety’. The EAD exemplify “variety” thus making them difficult to compute against.) Normalizing and enhancing EAD content may be a necessary evil.
  6. Transform your EAD files into RDF; the LOCAH project’s output included an XSL stylesheet doing just this sort of work. Be forewarned. The stylesheet is not for the faint of heart.
  7. Save the resulting RDF to some sort of triple store for access.
  8. Consider enhancing the data in the triple store with additional metadata, specifically from other LOD sites.
  9. Take the resulting data accessible via a SPARQL endpoint.
  10. Build a Web-based interface to access the SPARQL endpoint and interact directly with a researcher.

There are a number of challenges in the process. Some of them are listed below, and some of them have been alluded to above:

  • Create useful LOD, meaning, create LOD that links to other LOD. LOD does not live in a world by itself. Remember, the “L” stands for “linked”. For example, try to include URIs that are the URIs used on other LOD data sets. Sometimes this is not possible, for example, le with the names of people in archival materials. When possible, they used VIAF, but other times they needed to create their own URI denoting an individual.
  • There is a level of rigor involved in creating the data model, and there may be many discussions regarding semantics. For example, what is a creator? Or, when is a term intended to be an index term as opposed reference. When does one term in one vocabulary equal a different term in a different vocabulary?
  • Balance the creation of your own vocabulary with the need to speak the language of others using their vocabulary.
  • Consider “fixing” the data as it comes in or goes out because it might not be consistent nor thorough.
  • Provenance is an issue. People — especially scholars — will want to know where the LOD came from and whether or not it is authoritative. How to solve or address this problem? The jury is still out on this one.
  • Creating and maintaining LOD is difficult because it requires the skills of a number of different types of people. Computer programmers. Database designers. Subject experts. Metadata specialists. Archivists. Etc. A team is all but necessary.

I asked about Google and its indexing abilities. Both Jane and Ade expressed appreciation for Google, but they also thought there was room for improvement. If there weren’t then things like would not have been manifested. I also asked what they thought success might look like in a project like LiAM’s and they said that maybe their XSL stylesheet could be made more applicable to wider sets of EAD files, and thus be taken to another level.

Thanks to go Jane and Ade. Their experience was truly beneficial. “Thank you!”

Leave a comment