Monday, 22 of September of 2014

Linked Archival Metadata: A Guidebook (version 0.99)

I have created and made availble 0.99 of Linked Archival Metadata: A Guidebook. It is distributed here in two flavors: PDF and ePub (just because I can). From the Executive Summary:

Linked data is a process for embedding the descriptive information of archives into the very fabric of the Web. By transforming archival description into linked data, an archivist will enable other people as well as computers to read and use their archival description, even if the others are not a part of the archival community. The process goes both ways. Linked data also empowers archivists to use and incorporate the information of other linked data providers into their local description. This enables archivists to make their descriptions more thorough, more complete, and more value-added. For example, archival collections could be automatically supplemented with geographic coordinates in order to make maps, images of people or additional biographic descriptions to make collections come alive, or bibliographies for further reading.

Publishing and using linked data does not represent a change in the definition of archival description, but it does represent an evolution of how archival description is accomplished. For example, linked data is not about generating a document such as EAD file. Instead it is about asserting sets of statements about an archival thing, and then allowing those statements to be brought together in any number of ways for any number of purposes. A finding aid is one such purpose. Indexing is another purpose. For use by a digital humanist is anther purpose. While EAD files are encoded as XML documents and therefore very computer readable, the reader must know the structure of EAD in order to make the most out of the data. EAD is archives-centric. The way data is manifested in linked data is domain-agnostic.

The objectives of archives include collection, organization, preservation, description, and often times access to unique materials. Linked data is about description and access. By taking advantages of linked data principles, archives will be able to improve their descriptions and increase access. This will require a shift in the way things get done but not what gets done. The goal remains the same.

Many tools are ready exist for transforming data in existing formats into linked data. This data can reside in Excel spreadsheets, database applications, MARC records, or EAD files. There are tiers of linked data publishing so one does not have to do everything all at once. But to transform existing information or to maintain information over the long haul requires the skills of many people: archivists & content specialists, administrators & managers, metadata specialists & catalogers, computer programers & systems administrators.

Moving forward with linked data is a lot like touristing to Rome. There are many ways to get there, and there are many things to do once you arrive, but the result will undoubtably improve your ability to participate in the discussion of the human condition on a world wide scale.

Thank you’s go to all the people who provided feedback along the way. “Thanks!


Trends and gaps in linked data for archives

“A funny thing happened on the way to the forum.”

Two travelogues

roman forum Two recent professional meetings have taught me that — when creating some sort of information service — linked data will reside and be mixed with data collected from any number of Internet sites. Linked data interfaces will coexist with REST-ful interfaces, or even things as rudimentary as FTP. To the archivist, this means linked data is not the be-all and end-all of information publishing. There is no such thing. To the application programmer, this means you will need to have experience with a ever-growing number of Internet protocols. To both it means, “There is more than one way to get there.”

In October of 2013 I had the opportunity to attend the Semantic Web In Libraries conference. It was a three-day event attended by approximately three hundred people who could roughly be divided into two equally sized groups: computer scientists and cultural heritage institution employees. The bulk of the presentations fell into two categories: 1) publishing linked data, and 2) creating information services. The publishers talked about ontologies, human-computer interfaces for data creation/maintenance, and systems exposing RDF to the wider world. The people creating information services were invariably collecting, homogenizing, and adding value to data gathered from a diverse set of information services. These information services were not limited to sets of linked data. They also included services accessible via REST-ful computing techniques, OAI-PMH interfaces, and there were probably a few locally developed file transfers or relational database dumps described as well. These people where creating lists of information services, regularly harvesting content from the services, writing cross-walks, locally storing the content, indexing it, providing services against the result, and sometimes republishing any number of “stories” based on the data. For the second group of people, linked data was certainly not the only game in town.

In February of 2014 I had the opportunity to attend a hackathon called GLAM Hack Philly. A wide variety of data sets were presented for “hacking” against. Some where TEI files describing Icelandic manuscripts. Some was linked data published from the British museum. Some was XML describing digitized journals created by a vendor-based application. Some of it resided in proprietary database applications describing the location of houses in Philadelphia. Some of it had little or no computer-readable structure and described plants. Some of it was the wiki mark-up for local municipalities. After the attendees (there were about two dozen of us) learned about each of the data sets we self-selected and hacked away at projects of our own design. The results fell into roughly three categories: geo-referencing objects, creating searchable/browsable interfaces, and data enhancement. With the exception of the hack repurposing journal content into visual art, the results were pretty typical for cultural heritage institutions. But what fascinated me was way us hackers selected our data sets. Namely, the more complete and well-structured the data the more hackers gravitated towards it. Of all the data sets, the TEI files were the most complete, accurate, and computer-readable. Three or four projects were done against the TEI. (Heck, I even hacked on the TEI files.) The linked data from the British Museum — very well structured but not quite as through at the TEI — attracted a large number of hackers who worked together for a common goal. All the other data sets had only one or two people working on them. What is the moral to the story? There are two of them. First, archivists, if you want people to process your data and do “kewl” things against it, then make sure the data is thorough, complete, and computer-readable. Second, computer programmers, you will need to know a variety of data formats. Linked data is not the only game in town.

The technologies described in this Guidebook are not the only way to accomplish the goals of archivists wishing to make their content more accessible. Instead, linked data is just one of many protocols in the toolbox. It is open, standards-based, and simpler rather than more complex. On the other hand, other protocols exist which have a different set of strengths and weaknesses. Computer technologists will need to have a larger rather than smaller knowledge of various Internet tools. For archivists, the core of the problem is still the collection and description of content. This — a what of archival practice — continues to remain constant. It is the how of archival practice — the technology — that changes at a much faster pace.

With great interest I read the Spring/Summer issue of Information Standards Quarterly entitled “Linked Data in Libraries, Archives, and Museums” where there were a number of articles pertaining to linked data in cultural heritage institutions. Of particular interest to me were the loosely enumerated challenges of linked data. Some of them included:

  • the apparent Tower Of Babel when it comes to vocabularies used to describe content, and the same time we need to have “ontology mindfulness”.
  • dirty, inconsistent, or wide varieties of data integrity
  • persistent URIs
  • the “chicken & egg” problem of why linked data if there is no killer application

There are a number of challenges in the linked data process. Some of them are listed below, and some of them have been alluded to previously. Create useful linked data, meaning, create linked that links to other linked data. Linked data does not live in a world by itself. Remember, the “l” stands for “linked”. For example, try to include URIs that are the URIs used on other linked data sets. Sometimes this is not possible, for example, with the names of people in archival materials. When possible, they used VIAF, but other times they needed to create their own URI denoting an individual. There is a level of rigor involved in creating the data model, and there may be many discussions regarding semantics. For example, what is a creator? Or, when is a term intended to be an index term as opposed to a reference. When does one term in one vocabulary equal a different term in a different vocabulary? Balance the creation of your own vocabulary with the need to speak the language of others using their vocabulary. Consider “fixing” the data as it comes in or goes out because it might not be consistent nor thorough. Provenance is an issue. People — especially scholars — will want to know where the linked data came from and whether or not it is authoritative. How to solve or address this problem? The jury is still out on this one. Creating and maintaining linked data is difficult because it requires the skills of a number of different types of people. Computer programmers. Database designers. Subject experts. Metadata specialists. Archivists. Etc. A team is all but necessary.

Linked data represents a modern way of making your archival descriptions accessible to the wider world. In that light, it represents a different way of doing things but not necessary a different what of doing things. You will still be doing inventory. You will still be curating collections. You will still be prioritizing what goes and what stays.

Gaps

Linked data makes a lot of sense, but there are some personnel and technological gaps needing to be filled before it can really and truly be widely adopted by archives (or libraries or museums). They include but are not limited to: hands-on training, “string2URI” tools, database to RDF interfaces, mass RDF editors, and maybe “killer applications”.

Hands-on training

Different people learn in different ways, and hands-on training on what linked data is and how it can be put into practice would go a long way towards the adoption of linked data in archives. These hands-on sessions could be as short as an hour or as long as one or two days. They would include a mixture of conceptual and technological topics. For example, there could be a tutorial on how to search RDF triple stores using SPARQL. Another tutorial would compare & contrast the data models of databases with the RDF data model. A class could be facilitated on how to transform XML files (MARCXML, MODS, EAD) to any number of RDF serializations and publish the result on a Web server. There could be a class on how to design URIs. A class on how to literally draw an RDF ontology would be a good idea. Another class would instruct people on how to formally read & write an ontology using OWL. Yet another hands-on workshop would demonstrate to participants the techniques for creating, maintaining, and publishing an RDF triple store. Etc. Linked data might be a “good thing”, but people are going to need to learn how to work more directly with it. These hands-on trainings could be aligned with hack-a-thons, hack-fests, or THATCamps so a mixture of archivists, metadata specialists, and computer programmers would be in the same spaces at the same times.

string2URI

There is a need for tools enabling people and computers to automatically associate string literals with URIs. If nobody (or relatively few people) share URIs across their published linked data, then the promises of linked data won’t come to fruition. Archivists (and librarians and people who work in museums) take things like controlled vocabularies and name authority lists very seriously. Identifying the “best” URI for a given thing, subject term, or personal name is something the profession is going to want to do and do well.

Fabian Steeg and Pascal Christoph at the 2013 Semantic Web in Libraries conference asked the question, “How can we benefit from linked data without being linked data experts?” Their solution was the creation of a set of tools enabling people to query a remote service and get back a list of URIs which were automatically inserted into a text. This is an example of a “string2URI” tool that needs to be written and widely adopted. These tools could be as simple as a one-box, one-button interface where a person enters a word for phrase and one or more URIs are returned for selection. A slightly more complicated version would include a drop-down menu allowing the person to select places to query for the URI. Another application suggested by quite a number of people would use natural language to first extract named entities (people, places, things, etc.) from texts (like abstracts, scope notes, biographical histories, etc.). Once these entities were extracted, they would then be fed to string2URI. The LC Linked Data Service, VIAF, and Worldcat are very good examples of string2URI tools. The profession needs more of them. SNAC’s use of EAC-CPF is a something to watch in this space.

Database to RDF publishing systems

There are distinct advantages and disadvantages of the current ways of creating and maintaining the descriptions of archival collections. They fill a particular purpose and function. Nobody is going to suddenly abandon well-known techniques for ones seemingly unproven. Consequenlty, there is a need to easily migrate existing data to RDF. One way towards this goal is to transform or export archival descriptions from their current containers to RDF. D2RQ could go a long way towards publishing the underlying databases of PastPerfect, Archon, Archivist’s Toolkit, or ArchivesSpace as RDF. A seemingly little used database to RDF modeling language — R2RML — could be used for similar purposes. These particular solutions are rather generic. Either a great deal of customizations needs to be done using D2RQ, or new interfaces to the underlying databases need to be created. Regarding the later, this will require a large amount of specialized work. An ontology & vocabulary would need to be designed or selected. The data and the structure of the underlying databases would need to be closely examined. A programmer would need to write reports against the database to export RDF and publish it in one form or another. Be forewarned. Software, like archival description, is never done. On the other hand, this sort of work could be done once and then shared with the wider archival community and then applied to local implementations of Archivist’s Toolkit, ArchivesSpace, etc.

Mass RDF editors

Archivists curate physical collections as well as descriptions of those collections. Ideally, the descriptions would reside in a triple store as if it were a database. The store would be indexed. Queries could be applied against the store. Create, read, update, and delete operations could be easily done. As RDF is amassed it will almost definitely need to be massaged and improved. URIs may need to be equated. Controlled vocabulary terms may need to be related. Supplementary statements may need to be asserted enhancing the overall value of the store. String literals may need to be normalized or even added. This work will not be done on a one-by-one statement-by-statement basis. There are simply too many triples — 100′s of thousands, if not millions of them. Some sort of mass RDF editor will need to be created. If the store was well managed, and if a peson was well-versed in SPARQL, then much of this work could be done through SPARQL statements. But SPARQL is not for the faint of heart, and despite what some people say, it is not easy to write. Tools will need to be created — akin to the tools described by Diane Hillman and articulated through the experience with the National Science Foundation Digital Library — making it easy to do large-scale additions and updates to RDF triple stores.

“Killer” applications

To some degree, the idea of the Semantic Web and linked data has been oversold. We were told, “Make very large sets of RDF freely available and new relationships between resources will be discovered.” The whole thing smacks of artificial intelligence which simultaneously scares people and makes them laugh out loud. On the other hand, a close reading of Allemang’s and Hendler’s book Semantic Web For The Working Ontologist describes exactly how and why these new relationships can be discovered, but these discoveries do take some work and a significant volume of RDF from a diverse set of domains.

So maybe the “killer” application is not so much a sophisticated super-brained inference engine but something less sublime. A number of examples come to mind. Begin by recreating the traditional finding aid. Transform an EAD file into serialized RDF, and from the RDF create a finding aid. This is rather mundane and redundant but it will demonstrate and support a service model going forward. Go three steps further. First, create a finding aid, but supplement it with data and information from your other finding aids. Your collections do not exist in silos nor isolation. Supplement these second-generation finding aids with images, geographic coordinates, links to scholarly articles, or more thorough textual notes all discovered from outside linked data sources. Third, establish relationships between your archival collections and the archival collections outside your institution. Again, relationships could be created between collections and between items in the collections. These sorts of “killer” applications enables the archivist to stretch the definition of finding aid.

Another “killer” application may be a sort of union catalog. Each example of the union catalog will have some sort of common domain. The domain could be topical: Catholic studies, civil war history, the papers of a particular individual or organization. Collect the RDF from archives of these domains, put it into a single triple store, clean up and enhance the RDF, index it, and provide a search engine against the index. The domain could be regional. For example, the RDF from an archive, library, and museum of an individual college or university could be created, amalgamated, and presented. The domain could be professional: all archives, all libraries, or all museums.

Another killer application, especially in an academic environment, would be the integration of archival description into course management systems. Manifest archival descriptions as RDF. Manifest course offerings across the academy in the form of RDF. Manifest student and instrutor information as RDF. Discover and provide links between archival content and people in specific classes. This sort of application will make archival collections much more relevant to the local population.

Tell stories. Don’t just provide links. People want answers as much as they want lists of references. After search queries are sent to indexes, provide search results in the form of lists of links, but also mash together information from search results into a “named graph” that includes an overview of the apparent subject queried, images of the subject, a depiction of where the subject is located, and few very relevant and well-respected links to narrative descriptions of the subject. You can see these sorts of enhancement in the results of many Google and Facebook search results.

Support the work of digital humanists. Amass RDF. Clean, normalize, and enhance it. Provide access to it via searchable and browsable interfaces. Provide additional services against the results such as timelines built from the underlying dates found in the RDF. Create word clouds based on statistically significant entities such as names of people or places or themes. Provide access to search results in the form of delimited files so the data can be imported into other tools for more sophisticated analysis. For example, support a search results to Omeka interface. For that matter, create an Omeka to RDF service.

The “killer” application for linked data is only as far away as your imagination. If you can articulate it, then it can probably be created.

Last word

Linked data changes the way your descriptions get expressed and distributed. It is a lot like taking a trip across country. The goal was always to get to the coast to see the ocean, but instead of walking, going by stage coach, taking a train, or driving a car, you will be flying. Along the way you may visit a few cities and have a few layovers. Bad weather may even get in the way, but sooner or later you will get to your destination. Take a deep breath. Understand that the process will be one of learning, and that learning will be applicable in other aspects of your work. The result will be two-fold. First, a greater number of people will have access to your collections, and consequently, more people will will be using your collections.


LiAM Guidebook: Executive summary

spanish steps Linked data is a process for embedding the descriptive information of archives into the very fabric of the Web. By transforming archival description into linked data, an archivist will enable other people as well as computers to read and use their archival description, even if the others are not a part of the archival community. The process goes both ways. Linked data also empowers archivists to use and incorporate the information of other linked data providers into their local description. This enables archivists to make their descriptions more thorough, more complete, and more value-added. For example, archival collections could be automatically supplemented with geographic coordinates in order to make maps, images of people or additional biographic descriptions to make collections come alive, or bibliographies for further reading.

Publishing and using linked data does not represent a change in the definition of archival description, but it does represent an evolution of how archival description is accomplished. For example, linked data is not about generating a document such as EAD file. Instead it is about asserting sets of statements about an archival thing, and then allowing those statements to be brought together in any number of ways for any number of purposes. A finding aid is one such purpose. Indexing is another purpose. For use by a digital humanist is anther purpose. While EAD files are encoded as XML documents and therefore very computer readable, the reader must know the structure of EAD in order to make the most out of the data. EAD is archives-centric. The way data is manifested in linked data is domain-agnostic.

The objectives of archives include collection, organization, preservation, description, and often times access to unique materials. Linked data is about description and access. By taking advantages of linked data principles, archives will be able to improve their descriptions and increase access. This will require a shift in the way things get done but not what gets done. The goal remains the same.

Many tools are ready exist for transforming data in existing formats into linked data. This data can reside in Excel spreadsheets, database applications, MARC records, or EAD files. There are tiers of linked data publishing so one does not have to do everything all at once. But to transform existing information or to maintain information over the long haul requires the skills of many people: archivists & content specialists, administrators & managers, metadata specialists & catalogers, computer programers & systems administrators.

Moving forward with linked data is a lot like touristing to Rome. There are many ways to get there, and there are many things to do once you arrive, but the result will undoubtably improve your ability to participate in the discussion of the human condition on a world wide scale.


Rome in three days, an archivists introduction to linked data publishing

If you to go to Rome for a few days, do everything you would do in a single day, eat and drink in a few cafes, see a few fountains, and go to a museum of your choice.

trevi fountain Linked data in archival practice is not new. Others have been here previously. You can benefit from their experience and begin publishing linked data right now using tools with which you are probably already familiar. For example, you probably have EAD files, sets of MARC records, or metadata saved in database applications. Using existing tools, you can transform this content into RDF and put the result on the Web, thus publishing your information as linked data.

EAD

If you have used EAD to describe your collections, then you can easily make your descriptions available as valid linked data, but the result will be less than optimal. This is true not for a lack of technology but rather from the inherent purpose and structure of EAD files.

A few years ago an organisation in the United Kingdom called the Archive’s Hub was funded by a granting agency called JISC to explore the publishing of archival descriptions as linked data. The project was called LOCAH. One of the outcomes of this effort was the creation of an XSL stylesheet (ead2rdf) transforming EAD into RDF/XML. The terms used in the stylesheet originate from quite a number of standardized, widely accepted ontologies, and with only the tiniest bit configuration / customization the stylesheet can transform a generic EAD file into valid RDF/XML for use by anybody. The resulting XML files can then be made available on a Web server or incorporated into a triple store. This goes a long way to publishing archival descriptions as linked data. The only additional things needed are a transformation of EAD into HTML and the configuration of a Web server to do content negotiation between the XML and HTML.

For the smaller archive with only a few hundred EAD files whose content does not change very quickly, this is a simple, feasible, and practical solution to publishing archival descriptions as linked data. With the exception of doing some content negotiation, this solution does not require any computer technology that is not already being used in archives, and it only requires a few small tweaks to a given workflow:

  1. implement a content negotiation solution

  2. create and maintain EAD file
s
  3. transform EAD into RDF/XML

  4. transform EAD into HTML

  5. save the resulting XML and HTML files on a Web server

  6. go to step #2

EAD is a combination of narrative description and a hierarchal inventory list, and this data structure does not lend itself very well to the triples of linked data. For example, EAD headers are full of controlled vocabularies terms but there is no way to link these terms with specific inventory items. This is because the vocabulary terms are expected to describe the collection as a whole, not individual things. This problem could be overcome if each individual component of the EAD were associated with controlled vocabulary terms, but this would significantly increase the amount of work needed to create the EAD files in the first place.

The common practice of using literals to denote the names of people, places, and things in EAD files would also need to be changed in order to fully realize the vision of linked data. Specifically, it would be necessary for archivists to supplement their EAD files with commonly used URIs denoting subject headings and named authorities. These URIs could be inserted into id attributes throughout an EAD file, and the resulting RDF would be more linkable, but the labor to do so would increase, especially since many of the named items will not exist in standardized authority lists.

Despite these short comings, transforming EAD files into some sort of serialized RDF goes a long way towards publishing archival descriptions as linked data. This particular process is a good beginning and outputs valid information, just information that is not as linkable as possible. This process lends itself to iterative improvements, and outputting something is better than outputting nothing. But this particular proces is not for everybody. The archive whose content changes quickly, the archive with copious numbers of collections, or the archive wishing to publish the most complete linked data possible will probably not want to use EAD files as the root of their publishing system. Instead some sort of database application is probably the best solution.

MARC

In some ways MARC lends it self very well to being published via linked data, but in the long run it is not really a feasible data structure.

Converting MARC into serialized RDF through XSLT is at least a two step process. The first step is to convert MARC into MARCXML and then MARCXML into MODS. This can be done with any number of scripting languages and toolboxes. The second step is to use a stylesheet such as the one created by Stefano Mazzocchi to transform the MODS into RDF/XML — mods2rdf.xsl From there a person could save the resulting XML files on a Web server, enhance access via content negotiation, and called it linked data.

Unfortunately, this particular approach has a number of drawbacks. First and foremost, the MARC format had no place to denote URIs; MARC records are made up almost entirely of literals. Sure, URIs can be constructed from various control numbers, but things like authors, titles, subject headings, and added entries will most certainly be literals (“Mark Twain”, “Adventures of Huckleberry Finn”, “Bildungsroman”, or “Samuel Clemans”), not URIs. This issue can be overcome if the MARCXML were first converted into MODS and URIs were inserted into id or xlink attributes of bibliographic elements, but this is extra work. If an archive were to take this approach, then it would also behoove them to use MODS as their data structure of choice, not MARC. Continually converting from MARC to MARCXML to MODS would be expensive in terms of time. Moreover, with each new conversion the URIs from previous iterations would need to be re-created.

EAC-CPF

Encoded Archival Context for Corporate Bodies, Persons, and Families (EAC-CPF) goes a long way to implementing a named authority database that could be linked from archival descriptions. These XML files could easily be transformed into serialized RDF and therefore linked data. The resulting URIs could then be incorporated into archival descriptions making the descriptions richer and more complete. For example the FindAndConnect site in Australia uses EAC-CPF under the hood to disseminate information about people in its collection. Similarly, “SNAC aims to not only make the [EAC-CPF] records more easily discovered and accessed but also, and at the same time, build an unprecedented resource that provides access to the socio-historical contexts (which includes people, families, and corporate bodies) in which the records were created” More than a thousand EAC-CPF records are available from the RAMP project.

METS, MODS, OAI-PMH service providers, and perhaps more

If you have archival descriptions in either of the METS or MODS formats, then transforming them into RDF is as far away as your XSLT processor and a content negotiation implementation. As of this writing there do not seem to be any METS to RDF stylesheets, but there are a couple stylesheets for MODS. The biggest issue with these sorts of implementations are the URIs. It will be necessary for archivists to include URIs into as many MODS id or xlink attributes as possible. The same thing holds true for METS files except the id attribute is not designed to hold pointers to external sites.

Some archives and libraries use a content management system called ContentDM. Whether they know it or not, ContentDM comes complete with an OAI-PMH (Open Archives Initiative – Protocol for Metadata Harvesting) interface. This means you can send a REST-ful URL to ContentDM, and you will get back an XML stream of metadata describing digital objects. Some of the digital objects in ContentDM (or any other OAI-PMH service provider) may be something worth exposing as linked data, and this can easily be done with a system called oai2lod. It is a particular implementation of D2RQ, described below, and works quite well. Download application. Feed oai2lod the “home page” of the OAI-PMH service provider, and oai2load will publish the OAI-PMH metadata as linked open data. This is another quick & dirty way to get started with linked data.

Databases

Publishing linked data through XML transformation is functional but not optimal. Publishing linked data from a database comes closer to the ideal but requires a greater amount of technical computer infrastructure and expertise.

Databases — specifically, relational databases — are the current best practice for organizing data. As you may or may not know, relational databases are made up of many tables of data joined together with keys. For example, a book may be assigned a unique identifier. The book has many characteristics such as a title, number of pages, size, descriptive note, etc. Some of the characteristics are shared by other books, like authors and subjects. In a relational database these shared characteristics would be saved in additional tables, and they would be joined to a specific book through the use of unique identifiers (keys). Given this sort of data structure, reports can be created from the database describing its content. Similarly, queries can be applied against the database to uncover relationships that may not be apparent at first glance or buried in reports. The power of relational databases lies in the use of keys to make relationships between rows in one table and rows in other tables. The downside of relational databases as a data model is infinite variety of fields/table combinations making them difficult to share across the Web.

Not coincidently, relational database technology is very much the way linked data is expected to be implemented. In the linked data world, the subjects of triples are URIs (think database keys). Each URI is associated with one or more predicates (think the characteristics in the book example). Each triple then has an object, and these objects take the form of literals or other URIs. In the book example, the object could be “Adventures Of Huckleberry Finn” or a URI pointing to Mark Twain. The reports of relational databases are analogous to RDF serializations, and SQL (the relational database query language) is analogous to SPARQL, the query language of RDF triple stores. Because of the close similarity between well-designed relational databases and linked data principles, the publishing of linked data directly from relational databases makes whole lot of sense, but the process requires the combined time and skills of a number of different people: content specialists, database designers, and computer programmers. Consequently, the process of publishing linked data from relational databases may be optimal, but it is more expensive.

Thankfully, many archivists probably use some sort of behind the scenes database to manage their collections and create their finding aids. Moreover, archivists probably use one of three or four tools for this purpose: Archivist’s Toolkit, Archon, ArchivesSpace, or PastPerfect. Each of these systems have a relational database at their heart. Reports could be written against the underlying databases to generate serialized RDF and thus begin the process of publishing linked data. Doing this from scratch would be difficult, as well as inefficient because many people would be starting out with the same database structure but creating a multitude of varying outputs. Consequently, there are two alternatives. The first is to use a generic database application to RDF publishing platform called D2RQ. The second is for the community to join together and create a holistic RDF publishing system based on the database(s) used in archives.

D2RQ is a very powerful software system. It is supported, well-documented, executable on just about any computing platform, open source, focused, functional, and at the same time does not try to be all things to all people. Using D2RQ it is more than possible to quickly and easily publish a well-designed relational database as RDF. The process is relatively simple:

  • download the software

  • use a command-line utility to map the database structure to a configuration file

  • edit the configuration file to meet your needs

  • run the D2RQ server using the configuration file as input thus allowing people or RDF user-agents to search and browse the database using linked data principles

  • alternatively, dump the contents of the database to an RDF serialization and ingest the result into your favorite RDF triple store

The downside of D2RQ is its generic nature. It will create an RDF ontology whose terms correspond to the names of database fields. These field names do not map to widely accepted ontologies & vocabularies and therefore will not interact well with communities outside the ones using a specific database structure. Still, the use of D2RQ is quick, easy, and accurate.

If you are going to be in Rome for only a few days, you will want to see the major sites, and you will want to adventure out & about a bit, but at the same time is will be a wise idea to follow the lead of somebody who has been there previously. Take the advise of these people. It is an efficient way to see some of the sights.


Rome in a day, the archivist on a linked data pilgrimage way

If you to go to Rome for a day, then walk to the Colosseum and Vatican City. Everything you see along the way will be extra.

sistine chapelLinked data is not a fad. It is not a trend. It makes a lot of computing sense, and it is a modern way of fulfilling some the goals of archival practice. Just like Rome, it is not going away. An understanding of what linked data has to offer is akin to experiencing Rome first hand. Both will ultimately broaden your perspective. Consequently it is a good idea to make a concerted effort to learn about linked data, as well as visit Rome at least once. Once you have returned from your trip, discuss what you learned with your friends, neighbors, and colleagues. The result will be enlightening everybody.

The previous sections of this book described what linked data is and why it is important. The balance of book describes more of the how’s of linked data. For example, there is a glossary to help reenforce your knowledge of the jargon. You can learn about HTTP “content negotiation” to understand how actionable URIs can return HTML or RDF depending on the way you instruct remote HTTP servers. RDF stands for “Resource Description Framework”, and the “resources” are represented by URIs. A later section of the book describes ways to design the URIs of your resources. Learn how you can transform existing metadata records like MARC or EAD into RDF/XML, and then learn how to put the RDF/XML on the Web. Learn how to exploit your existing databases (such as the one’s under Archon, Archivist’s Toolkit, or ArchiveSpace) to generate RDF. If you are the Do It Yourself type, then play with and explore the guidebook’s tool section. Get the gentlest of introductions to searching RDF using a query language called SPARQL. Learn how to read and evaluate ontologies & vocabularies. They are manifested as XML files, and they are easily readable and visualizable using a number of programs. Read about and explore applications using RDF as the underlying data model. There are a growing number of them. The book includes a complete publishing system written in Perl, and if you approach the code of the publishing system as if it were a theatrical play, then the “scripts” read liked scenes. (Think of the scripts as if they were a type of poetry, and they will come to life. Most of the “scenes” are less than a page long. The poetry even includes a number of refrains. Think of the publishing system as if it were a one act play.) If you want to read more, and you desire a vetted list of books and articles, then a later section lists a set of further reading.

After you have spent some time learning a bit more about linked data, discuss what you have learned with your colleagues. There are many different aspects of linked data publishing, such as but not limited to:

  • allocating time and money
  • analyzing the RDF of yours as well as others
  • articulating policies
  • cleaning and improving RDF
  • collecting and harvesting the RDF of others
  • deciding what ontologies & vocabularies to use
  • designing local URIs
  • enhancing RDF triples stores by asserting additional relationships
  • finding and identifying URIs for the purposes of linking
  • making RDF available on the Web (SPARQL, RDFa, data dumps, etc.)
  • project management
  • provisioning value-added services against RDF (catalogs, finding aids, etc.)
  • storing RDF in triple stores

In archival practice, each of these things would be done by different sets of people: archivists & content specialists, administrators & managers, computer programers & systems administrators, metadata experts & catalogers. Each of these sets of people have a piece of the publishing puzzle and something significant to contribute to the work. Read about linked data. Learn about linked data. Bring these sets of people together discuss what you have learned. At the very least you will have a better collective understanding of the possibilities. If you don’t plan to “go to Rome” right away, you might decide to reconsider the “vacation” at another time.

Even Michelangelo, when he painted the Sistine Chapel, worked with a team of people each possessing a complementary set of skills. Each had something different to offer, and the discussion between themselves was key to their success.


Four “itineraries” for putting linked data into practice for the archivist

If you to go to Rome for a day, then walk to the Colosseum and Vatican City. Everything you see along the way will be extra. If you to go to Rome for a few days, do everything you would do in a single day, eat and drink in a few cafes, see a few fountains, and go to a museum of your choice. For a week, do everything you would do in a few days, and make one or two day-trips outside Rome in order to get a flavor of the wider community. If you can afford two weeks, then do everything you would do in a week, and in addition befriend somebody in the hopes of establishing a life-long relationship.

map of vatican cithyWhen you read a guidebook on Rome — or any travel guidebook — there are simply too many listed things to see & do. Nobody can see all the sites, visit all the museums, walk all the tours, nor eat at all the restaurants. It is literally impossible to experience everything a place like Rome has to offer. So it is with linked data. Despite this fact, if you were to do everything linked data had to offer, then you would do all of things on the following list starting at the first item, going all the way down to evaluation, and repeating the process over and over:

  • design the structure your URIs
  • select/design your ontology & vocabularies — model your data
  • map and/or migrate your existing data to RDF
  • publish your RDF as linked data
  • create a linked data application
  • harvest other people’s data and create another application
  • evaluate
  • repeat

Given that it is quite possible you do not plan to immediately dive head-first into linked data, you might begin by getting your feet wet or dabbling in a bit of experimentation. That being the case, here are a number of different “itineraries” for linked data implementation. Think of them as strategies. They are ordered from least costly and most modest to greatest expense and completest execution:

  1. Rome in a day – Maybe you can’t afford to do anything right now, but if you have gotten this far in the guidebook, then you know something about linked data. Discuss (evaluate) linked data with with your colleagues, and consider revisiting the topic a year.
  2. Rome in three days – If you want something relatively quick and easy, but with the understanding that your implementation will not be complete, begin migrating your existing data to RDF. Use XSLT to transform your MARC or EAD files into RDF serializations, and publish them on the Web. Use something like OAI2RDF to make your OAI repositories (if you have them) available as linked data. Use something like D2RQ to make your archival description stored in databases accessible as linked data. Create a triple store and implement a SPARQL endpoint. As before, discuss linked data with your colleagues.
  3. Rome in week – Begin publishing RDF, but at the same time think hard about and document the structure of your future RDF’s URIs as well as the ontologies & vocabularies you are going to use. Discuss it with your colleagues. Migrate and re-publish your existing data as RDF using the documentation as a guide. Re-implement your SPARQL endpoint. Discuss linked data not only with your colleagues but with people outside archival practice.
  4. Rome in two weeks – First, do everything you would do in one week. Second, supplement your triple store with the RDF of others’. Third, write an application against the triple store that goes beyond search. In short, tell stories and you will be discussing linked data with the world, literally.

Italian Lectures on Semantic Web and Linked Data

rome   croce   koha

Koha Gruppo Italiano has organized the following free event that may be of interest to linked data affectionatos in cultural heritage institutions:

Italian Lectures on Semantic Web and Linked Data: Practical Examples for Libraries, Wednesday May 7, 2014 at The American University of Rome – Auriana Auditorium (Via Pietro Roselli, 16 – Rome, Italy)

  • 9.00 – Benvenuto
    • Andrew Thompson (Executive Vice President and Provost AUR)
    • Juan Diego Ramírez (Direttore Biblioteca Pontificia Università della Santa Croce)
  • 9.15 – “So many opportunities! Which ones to choose?”, Eric Lease Morgan (University of Notre Dame)
  • 10.00 – “SKOS, Nuovo Soggettario e Wikidata: appunti per l’evoluzione dei sistemi di gestione dell’informazione bibliografica”, Giovanni Bergamin (Biblioteca Nazionale di Firenze)
  • 10.30 – “Open, Big, and Linked Data”, Stefano Bargioni (Biblioteca Pontificia Università della Santa Croce)
  • 11.00 – “La digitalizzazione di materiale archivistico e bibliotecario: un ulteriore elemento per valorizzare gli open data”, Bucap Spa
  • 11.15 – Coffee break
  • 11.45 – “xDams RELOADed: Cultural Heritage to the Web of Data”, Silvia Mazzini (Regesta.exe)
  • 12.00 – Discussion Panel: “L’avvento dei linked data e la fine del MARC”
    • Federico Meschini, moderatore (Università della Tuscia)
    • Lucia Panciera (Camera dei Deputati)
    • Fabio Di Giammarco (Biblioteca di Storia moderna e contemporanea)
    • Michele Missikoff e Marco Fratoddi (Stati Generali dell’Innovazione)
  • 13.00 – Conclusione dei lavori

Please RSVP to f.wallner at aur.edu by May 5.

This event is generously sponsored by regesta.exe, Bucap Document Imaging SpA, and SOS Archivi e Biblioteche.

regesta   bucap   sos


Linked Archival Metadata: A Guidebook

A new but still “pre-published” version of the Linked Archival Metadata: A Guidebook is available. From the introduction:

The purpose of this guidebook is to describe in detail what linked data is, why it is important, how you can publish it to the Web, how you can take advantage of your linked data, and how you can exploit the linked data of others. For the archivist, linked data is about universally making accessible and repurposing sets of facts about you and your collections. As you publish these fact you will be able to maintain a more flexible Web presence as well as a Web presence that is richer, more complete, and better integrated with complementary collections.

And from the table of contents:

  • Executive Summary
  • Introduction
  • Linked data: A Primer
  • Getting Started: Strategies and Steps
  • Projects
  • Tools and Visualizations
  • Directories of ontologies
  • Content-negotiation and cURL
  • SPARQL tutorial
  • Glossary
  • Further reading
  • Scripts
  • A question from a library school student
  • Out takes

There are a number of versions:

Feedback desired and hoped for.


What is linked data and why should I care?

“Tell me about Rome. Why should I go there?”

Linked data is a standardized process for sharing and using information on the World Wide Web. Since the process of linked data is woven into the very fabric of the way the Web operates, it is standardized and will be applicable as long as the Web is applicable. The process of linked data is domain agnostic meaning its scope is equally apropos to archives, businesses, governments, etc. Everybody can and everybody is equally invited to participate. Linked data is application independent. As long as your computer is on the Internet and knows about the World Wide Web, then it can take advantage of linked data.

Linked data is about sharing and using information (not mere data but data put into context). This information takes the form of simple “sentences” which are intended to be literally linked together to communicate knowledge. The form of linked data is similar to the forms of human language, and like human languages, linked data is expressive, nuanced, dynamic, and exact all at once. Because of its atomistic nature, linked data simultaneously simplifies and transcends previous information containers. It reduces the need for profession-specific data structures, but at the same time it does not negate their utility. This makes it easy for you to give your information away, and for you to use other people’s information.

The benefits of linked data boil down to two things: 1) it makes information more accessible to both people as well as computers, and 2) it opens the doors to any number of knowledge services limited only by the power of human imagination. Because it standardized, agnostic, independent, and mimics human expression linked data is more universal than the current processes of information dissemination. Universality infers decentralization, and decentralization promotes dissemination. On the Internet anybody can say anything at anytime. In the aggregate, this is a good thing and it enables information to be combined in ways yet to be imagined. Publishing information as linked data enables you to seamlessly enhance your own knowledge services as well as simultaneously enhance the knowledge of others.

“Rome is the Eternal City. After visting Rome you will be better equipped to participate in the global conversation of the human condition.”


Impressed with ReLoad

I’m impressed with the linked data project called ReLoad. Their data is robust, complete, and full of URIs as well as human-readable labels. From the project’s home page:

The ReLoad project (Repository for Linked open archival data) will foster experimentation with the technology and methods of linked open data for archival resources. Its goal is the creation of a web of linked archival data.
LOD-LAM, which is an acronym for Linked Open Data for Libraries, Archives and Museums, is an umbrella term for the community and active projects in this area.

The first experimental phase will make use of W3C semantic web standards, mash-up techniques, software for linking and for defining the semantics of the data in the selected databases.

The archives that have made portions of their institutions’ data and databases openly available for this project are the Central State Archive, and the Cultural Heritage Institute of Emilia Romagna Region. These will be used to test methodologies to expose the resources as linked open data.

For example, try these links:

Their data is rich enough so things like LodLive can visualize resources well: