Friday, 19 of December of 2014

Publishing linked data by way of EAD files

[This blog posting comes from a draft of the Linked Archival Metadata: A Guidebook --ELM ]

If you have used EAD to describe your collections, then you can easily make your descriptions available as valid linked data, but the result will be less than optimal. This is true not for a lack of technology but rather from the inherent purpose and structure of EAD files.

A few years ago an organisation in the United Kingdom called the Archive’s Hub was funded by a granting agency called JISC to explore the publishing of archival descriptions as linked data. One of the outcomes of this effort was the creation of an XSL stylesheet transforming EAD into RDF/XML. The terms used in the stylesheet originate from quite a number of standardized, widely accepted ontologies, and with only the tiniest bit configuration / customization the stylesheet can transform a generic EAD file into valid RDF/XML. The resulting XML files can then be made available on a Web server or incorporated into a triple store. This goes a long way to publishing archival descriptions as linked data. The only additional things needed are a transformation of EAD into HTML and the configuration of a Web server to do content-negotiation between the XML and HTML.

For the smaller archive with only a few hundred EAD files whose content does not change very quickly, this is a simple, feasible, and practical solution to publishing archival descriptions as linked data. With the exception of doing some content-negotiation, this solution does not require any computer technology that is not already being used in archives, and it only requires a few small tweaks to a given workflow:

  1. implement a content-negotiation solution
  2. edit EAD file
  3. transform EAD into RDF/XML
  4. transform EAD into HTML
  5. save the resulting XML and HTML files on a Web server
  6. go to step #2

On the other hand an EAD file is the combination of a narrative description with a hierarchal inventory list, and this data structure does not lend itself very well to the triples of linked data. For example, EAD headers are full of controlled vocabularies terms but there is no way to link these terms with specific inventory items. This is because the vocabulary terms are expected to describe the collection as a whole, not individual things. This problem could be overcome if each individual component of the EAD were associated with controlled vocabulary terms, but this would significantly increase the amount of work needed to create the EAD files in the first place.

The common practice of using literals (“strings”) to denote the names of people, places, and things in EAD files would also need to be changed in order to fully realize the vision of linked data. Specifically, it would be necessary for archivists to supplement their EAD files with commonly used URIs denoting subject headings and named authorities. These URIs could be inserted into id attributes throughout an EAD file, and the resulting RDF would be more linkable, but the labor to do so would increase, especially since many of the named authorities will not exist in standardized authority lists.

Despite these short comings, transforming EAD files into some sort of serialized RDF goes a long way towards publishing archival descriptions as linked data. This particular process is a good beginning and outputs valid information, just information that is not as accurate as possible. This process lends itself to iterative improvements, and outputting something is better than outputting nothing. But this particular process is not for everybody. The archive whose content changes quickly, the archive with copious numbers of collections, or the archive wishing to publish the most accurate linked data possible will probably not want to use EAD files as the root of their publishing system. Instead some sort of database application is probably the best solution.


Initial pile of RDF

I have created an initial pile of RDF, mostly.

I am in the process of experimenting with linked data for archives. My goal is to use existing (EAD and MARC) metadata to create RDF/XML, and then to expose this RDF/XML using linked data principles. Once I get that far I hope to slurp up the RDF/XML into a triple store, analyse the data, and learn how the whole process could be improved.

This is what I have done to date:

  • accumulated sets of EAD files and MARC records
  • identified and cached a few XSL stylesheets transforming EAD and MARCXML into RDF/XML
  • wrote a couple of Perl script that combine Bullet #1 and Bullet #2 to create HTML and RDF/XML
  • write a mod_perl module implementing rudimentary content negotiation
  • made the whole thing (scripts, sets of data, HTML, RDF/XML, etc.) available on the Web

You can see the fruits of these labors at http://infomotions.com/sandbox/liam/, and there you will find a few directories:

  • bin – my Perl scripts live here as well as a couple of support files
  • data – full of RDF/XML files — about 4,000 of them
  • etc – mostly stylesheets
  • id – a placeholder for the URIs and content negotiation
  • lib – where the actual content negotiation script lives
  • pages – HTML versions of the original metadata
  • src – a cache for my original metadata
  • tmp – things of brief importance; mostly trash

My Perl scripts read the metadata, create HTML and RDF/XML, and save the result in the pages and data directories, respectively. A person can browse these directories, but browsing will be difficult because there is nothing there except cryptic file names. Selecting any of the files should return valid HTML or RDF/XML.

Each cryptic name is the leaf of a URI prefixed with “http://infomotions.com/sandbox/liam/id/”. For example, if the leaf is “mshm510″, then the combined leaf and prefix form a resolvable URI — http://infomotions.com/sandbox/liam/id/mshm510. When user-agent says it can accept text/html, then the HTTP server redirects the user-agent to http://infomotions.com/sandbox/liam/pages/mshm510.html. If the user agent does not request a text/html representation, then the RDF/XML version is returned — http://infomotions.com/sandbox/liam/data/mshm510.rdf. This is rudimentary content-negotiation. For a good time, here are a few actionable URIs:

For a good time, feed them to the W3C RDF Validator.

The next step is to figure out how to handle file not found errors when a URI does not exist. Another thing to figure out is how to make potential robots aware of the data set. The bigger problem is to simply make the dataset more meaningful the the inclusion of more URIs in the RDF/XML as well as the use of a more consistent and standardized set of ontologies.

Fun with linked data?


Illustrating RDF

I have had some success converting EAD and MARC into RDF/XML, and consequently I am able to literally illustrate the resulting RDF triples.

I have acquired sets of EAD files and MARC records of an archival nature. When it comes to EAD files I am able to convert them into RDF/XML with a stylesheet from the Archives Hub. I then fed the resulting RDF/XML to the W3C RDF Validation Service and literally got an illustration of the RDF, below:

rdf graph
hou00096.xml –> hou00096.rdf –> illustration

Transforming MARC into RDF was a bit more complicated. I first convert a raw MARC record into MARCXML with a Perl module called MARC::File::XML. I transformed the result into MODS with MARC21slim2MODS3.xsl, and finally into RDF/XML with mods2rdf.xslt. Again, I validated the results to got the following illustration:

rdf graph
003078076.marc –> 003078076.xml –> 003078076.mods –> 003078076.rdf –> illustration

The resulting images are huge, and the astute/diligent reader will see a preponderance of literals in the results. This is not a good thing, but it all that is available right now.

On the other hand the same astute/diligent reader will see the root of the RDF/XML pointing to a meaningful URI. This URI will be resolvable in the near future via content negotiation. This is a simple first step. The next steps will be to apply this process to an entire collection of EAD files and MARC records. After that the two other things can happen: 1) the original metadata files can begin to include URIs, and 2) the XSL used to process the metadata can employ a more standardized ontology. It is not an easy process, but it is a beginning.

Right now, something is better than nothing.


Transforming MARC to RDF

I hope somebody can give me some advice for transforming MARC to RDF.

I am in the midst of writing a book describing the benefits of linked data for archives. Archival metadata usually comes in two flavors: EAD and MARC. I found a nifty XSL stylesheet from the Archives Hub (that’s in the United Kingdom) transforming EAD to RDF/XML. With a bit of customization I think it could be used quite well for just about anybody with EAD files. I have retained a resulting RDF/XML file online.

Converting MARC to RDF has been more problematic. There are various tools enabling me to convert my original MARC into MARCXML and/or MODS. After that I can reportably use a few tools to convert to RDF:

  • MARC21slim2RDFDC.xsl – functions, but even for my tastes the resulting RDF is too vanilla.
  • modsrdf.xsl – optimal, but when I use my transformation engine (Saxon), I do not get XML but rather plain text
  • BIBFRAME Tools – sports nice ontologies, but the online tools won’t scale for large operations

In short, I have discovered nothing that is “easy-to-use”. Can you provide me with any other links allowing me to convert MARC to serialized RDF?


Simple linked data recipe for libraries, museums, and archives

Participating in the Semantic Web and providing content via the principles of linked data is not “rocket surgery”, especially for cultural heritage institutions — libraries, archives, and museums. Here is a simple recipe for their participation:

  1. use existing metadata standards (MARC, EAD, etc.) to describe collections
  2. use any number of existing tools to convert the metadata to HTML, and save the HTML on a Web server
  3. use any number of existing tools to convert the metadata to RDF/XML (or some other “serialization” of RDF), and save the RDF/XML on a Web server
  4. rest, congratulate yourself, and share your experience with others in your domain
  5. after the first time though, go back to Step #1, but this time work with other people inside your domain making sure you use as many of the same URIs as possible
  6. after the second time through, go back to Step #1, but this time supplement access to your linked data with a triple store, thus supporting search
  7. after the third time through, go back to Step #1, but this time use any number of existing tools to expose the content in your other information systems (relational databases, OAI-PMH data repositories, etc.)
  8. for dessert, cogitate ways to exploit the linked data in your domain to discover new and additional relationships between URIs, and thus make the Semantic Web more of a reality

I am in the process of writing a guidebook on the topic of linked data and archives. In the guidebook I will elaborate on this recipe and provide instructions for its implementation.


OAI2LOD

The other day I discovered a slightly dated application called OAI2LOD, and I think it works quite nicely. It’s purpose? To expose OAI data repositories as linked open data. Installation was all but painless. Download source for GitHub. Build with ant. Done. Getting it up and running was just as easy. Select sample configuration file. Edit some values so OAI2LOD knows about your (or maybe somebody else’s) OAI repository. Run OAI2LOD. The result is an HTTP interface to the OAI data repository in the form of linked data. A reader can browse by items or by sets. OAI2LOD supports “content negotiation” so it will return RDF when requested. It also supports a SPARQL endpoint. The only flaw I found with the tool is its inability to serve more than one data repository at a time. For a limited period of time, I’ve made one of my OAI repositories available for perusing. Enjoy.


RDF triple stores

Less than a week ago I spent a lot of time trying to install and configure a few RDF triple stores, with varying degrees of success:

  • OpenRDF- This was the easiest system to install so far. First, identify, download, install, configure, and turn on a Java servlet container. I used Tomcat. Then copy the OpenRDF .war files into the webapps directory. Restart the servlet container. Use your Web browser to connect to OpenRDF. Using the Web interface I was able to import RDF/XML and then query it.
  • Virtuoso Open-Source Edition – This took a long time to compile, but it seemingly complied flawlessly. I have yet to actually install it and give it a whirl.
  • 4store – Many of my colleagues suggested this application, but I had one heck of a time installing it on my Linux host. When I finally finished installation I was able to fill the store with triples from the command-line and then query it from an HTTP interface.
  • Jena/Fuseki – After all the time I spent on the previous applications, I ran out of energy. Installing Jena/Fuseki is still on my to-do list.
  • D2R Server – I should probably give this one a whirl too.

What’s the point? Well, quite frankly, I’m not sure yet. All of these “stores” are really databases of RDF triples. Once they are in the store a person can submit queries to find triples. The query language used is SPARQL, and SPARQL sort of feels like Yet Another Kewl Hack. What is the problem trying to be solved here? The only truly forward thinking answer I can figure out is the implementation of an inference engine used to find relationships that weren’t obvious previously. Granted, my work has just begun and I’m more ignorant than anything else.

Please enlighten me?


Initialized a list of tools in the LiAM Guidebook, plus other stuff

Added lists of tools to LiAM Guidebook. [0, 1] Add a Filemaker file to the repository. The file keeps track of my tools. Added tab2text.pl which formats the output of my Filemaker data into plain text. Added a very rudimentary PDF document — a version of the Guidebook designed for human consumption. For a good time, added an RDF output of my Zotero database.

“Librarians love lists.”

[0] About the LiAM Guidebook – http://sites.tufts.edu/liam/
[1] Git repository of Guidebook – https://github.com/liamproject


Guidebook moved to liamproject

The Github staging location for the Linked Archival Metadata (LiAM) Guidebook has been moved to liamproject — https://github.com/liamproject/.

We now return you to your regular scheduled programming.


What is Linked Data and why should I care?

peonies

Eye candy by Eric

Linked Data is a process for manifesting the ideas behind the Semantic Web. The Semantic Web is about encoding data, information, and knowledge in computer-readable fashions, making these encodings accessible on the World Wide Web, allowing computers to crawl the encodings, and finally, employing reasoning engines against them for the purpose of discovering and creating new knowledge. The canonical article describing this concept was written by Tim Berners-Lee, James Hendler, and Ora Lassila in 2001.

In 2006 Berners-Lee more concretely described how to make the Semantic Web a reality in a text called “Linked Data — Design Issues“. In it he outlined four often-quoted expectations for implementing the Semantic Web. Each of these expectations are listed below along with some elaborations:

  1. “Use URIs as names for things” – URIs (Universal Resource Identifiers) are unique identifiers, and they are expected to have the same shape as URLs (Universal Resource Locators). These identifiers are expected to represent things such as people, places, institutions, concepts, books, etc. URIs are monikers or handles for real world or imaginary objects.
  2. “Use HTTP URIs so that people can look up those names.” – The URIs are expected to look and ideally function on the World Wide Web through the Hypertext Transfer Protocol (HTTP), meaning the URI’s point to things on Web servers.
  3. “When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)” – When URIs are sent to Web servers by Web browsers (or “user-agents” in HTTP parlance), the response from the server should be in a conventional, computer readable format. This format is usually a version of RDF (Resource Description Framework) — a notation looking much like a rudimentary sentence composed of a subject, predicate, and object.
  4. “Include links to other URIs. So that they can discover more things.” – Simply put, try very hard to use URIs that other people have have used. This way the relationships you create can literally be linked to the relationships other people have created. These links may represent new knowledge.

In the same text (“Linked Data — Design Issues”) Berners-Lee also outlined a sort of reward system — sets of stars — for levels of implementation. Unfortunately, nobody seems to have taken up the stars very seriously. A person gets:

  • 1 star for making data available on the web (whatever format) but with an open licence, to be Open Data
  • 2 stars for making the data machine-readable structured data (e.g. excel instead of image scan of a table)
  • 3 stars for making the data available in non-proprietary format (e.g. CSV instead of excel)
  • 4 stars for using open standards from W3C (RDF and SPARQL) to identify things, so that people can point at your stuff
  • 5 stars for linking your data to other people’s data to provide context

The whole idea works like this. Suppose I assert the following statement:

The Declaration Of Independence was authored by Thomas Jefferson.

This statement can be divided into three parts. The first part is a subject (Declaration Of Independence). The second part is a predicate (was authored by). The third part is an object (Thomas Jefferson). In the language of the Semantic Web and Linked Data, these combined parts are called a triple, and they are expected to denote a fact. Triples are the heart of RDF.

Suppose further that the subject and object of the triple are identified using URIs (as in Expectations #1 and #2, above). This would turn our assertion into something like this with carriage returns added for readability:

http://www.archives.gov/exhibits/charters/declaration_transcript.html
was authored by
http://www.worldcat.org/identities/lccn-n79-89957

Unfortunately, this assertion is not easily read by a computer. Believe it or not, something like the XML below is much more amenable, and if it were the sort of content returned by a Web server to a user-agent, then it would satisfy Expectations #3 and #4 because the notation is standardized and because it points to other people’s content:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:dc="http://purl.org/dc/elements/1.1/">
  <rdf:Description rdf:about="http://www.archives.gov/exhibits/charters/declaration_transcript.html">
    <dc:creator>http://www.worldcat.org/identities/lccn-n79-89957</dc:creator>
  </rdf:Description>
</rdf:RDF>

Suppose we had a second assertion:

Thomas Jefferson was a man.

In this case, the subject is “Thomas Jefferson”. The predicate is “was”. The object is “man”. This assertion can be expressed in a more computer-readable fashion like this:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:foaf="http://xmlns.com/foaf/0.1/">
  <rdf:Description rdf:about="http://www.worldcat.org/identities/lccn-n79-89957">
    <foaf:gender>male</foaf:gender>
  </rdf:Description>
</rdf:RDF>

Looking at the two assertions, a reasonable person can deduce a third assertion, namely, the Declaration Of Independence was authored by a man. Which brings us back to the point of the Semantic Web and Linked Data. If everybody uses URIs (read “URLs”) to describe things, if everybody denotes relationships (through the use of predicates) between URIs, if everybody makes their data available on the Web in standardized formats, and if everybody uses similar URIs, then new knowledge can be deduced from the original relationships.

Unfortunately and to-date, too little Linked Data has been made available and/or too few people have earned too few stars to really make the Semantic Web a reality. The purpose of this guidebook is to provide means for archivists to do their part, make their content available on the Semantic Web through Linked Data, all in the hopes of facilitating the discovery of new knowledge. On our mark. Get set. Go!