Saturday, 25 of October of 2014

Initial pile of RDF

I have created an initial pile of RDF, mostly.

I am in the process of experimenting with linked data for archives. My goal is to use existing (EAD and MARC) metadata to create RDF/XML, and then to expose this RDF/XML using linked data principles. Once I get that far I hope to slurp up the RDF/XML into a triple store, analyse the data, and learn how the whole process could be improved.

This is what I have done to date:

  • accumulated sets of EAD files and MARC records
  • identified and cached a few XSL stylesheets transforming EAD and MARCXML into RDF/XML
  • wrote a couple of Perl script that combine Bullet #1 and Bullet #2 to create HTML and RDF/XML
  • write a mod_perl module implementing rudimentary content negotiation
  • made the whole thing (scripts, sets of data, HTML, RDF/XML, etc.) available on the Web

You can see the fruits of these labors at http://infomotions.com/sandbox/liam/, and there you will find a few directories:

  • bin – my Perl scripts live here as well as a couple of support files
  • data – full of RDF/XML files — about 4,000 of them
  • etc – mostly stylesheets
  • id – a placeholder for the URIs and content negotiation
  • lib – where the actual content negotiation script lives
  • pages – HTML versions of the original metadata
  • src – a cache for my original metadata
  • tmp – things of brief importance; mostly trash

My Perl scripts read the metadata, create HTML and RDF/XML, and save the result in the pages and data directories, respectively. A person can browse these directories, but browsing will be difficult because there is nothing there except cryptic file names. Selecting any of the files should return valid HTML or RDF/XML.

Each cryptic name is the leaf of a URI prefixed with “http://infomotions.com/sandbox/liam/id/”. For example, if the leaf is “mshm510″, then the combined leaf and prefix form a resolvable URI — http://infomotions.com/sandbox/liam/id/mshm510. When user-agent says it can accept text/html, then the HTTP server redirects the user-agent to http://infomotions.com/sandbox/liam/pages/mshm510.html. If the user agent does not request a text/html representation, then the RDF/XML version is returned — http://infomotions.com/sandbox/liam/data/mshm510.rdf. This is rudimentary content-negotiation. For a good time, here are a few actionable URIs:

For a good time, feed them to the W3C RDF Validator.

The next step is to figure out how to handle file not found errors when a URI does not exist. Another thing to figure out is how to make potential robots aware of the data set. The bigger problem is to simply make the dataset more meaningful the the inclusion of more URIs in the RDF/XML as well as the use of a more consistent and standardized set of ontologies.

Fun with linked data?


Leave a comment