Friday, 21 of September of 2018

Three RDF data models for archival collections

Listed and illustrated here are three examples of RDF data models for archival collections. It is interesting to literally see the complexity or thoroughness of each model, depending on your perspective.

This one was designed by Aaron Rubinstein. I don’t know whether or not it was ever put into practice.

This is the model used in Project LOACH by the Archives Hub.

This final model — OAD — is being implemented in a project called ReLoad.

There are other ontologies of interest to cultural heritage institutions, but these three seem to be the most apropos to archivists.

This work is a part of a yet-to-be published book called the LiAM Guidebook, a text intended for archivists and computer technologists interested in the application of linked data to archival description.

LiAM Guidebook – a new draft

I have made available a new draft of the LiAM Guidebook. Many of the lists of things (tools, projects, vocabulary terms, Semantic browsers, etc.) are complete. Once the lists are done I will move back to the narratives. Thanks go to various people I’ve interviewed lately (Gregory Colati, Karen Gracy, Susan Pyzynski, Aaron Rubinstein, Ed Summers, Diane Hillman, Anne Sauer, and Eliot Wilczek) because without them I would to have been able to get this far nor see a path forward.

Linked data projects of interest to archivists (and other cultural heritage personnel)

While the number of linked data websites is less than the worldwide total number, it is really not possible to list every linked data project but only things that will presently useful to the archivist and computer technologist working in cultural heritage institutions. And even then the list of sites will not be complete. Instead, listed below are a number of websites of interest today. This list is a part of the yet-to-be published LiAM Guidebook.


The following introductions are akin to directories or initial guilds filled with pointers to information about RDF especially meaningful to archivists (and other cultural heritage workers).

  • Datahub ( – This is a directory of data sets. It includes descriptions of hundreds of data collections. Some of them are linked data sets. Some of them are not.
  • LODLAM ( – LODLAM is an acronym for Linked Open Data in Libraries Archives and Museums. is a community, both virtual and real, of linked data aficionados in cultural heritage institutions. It, like OpenGLAM, is a good place to discuss linked data in general.
  • OpenGLAM ( – GLAM is an acronym for Galleries, Libraries, Archives, and Museums. OpenGLAM is a community fostered by the Open Knowledge Foundation and a place to to discuss linked data that is “free”. for It, like LODLAM, is a good place to discuss linked data in general.
  • ( – is a portal for publishing information on research and development related to the topics Semantic Web and Wikis. Includes and

Data sets and projects

The data sets and projects range from simple RDF dumps to full-blown discovery systems. In between some simple browsable lists and raw SPARQL endpoints.

  • 20th Century Press Archives ( – This is an archive of digitized newspaper articles which is made accessible not only as HTML but a number of other metadata formats such as RDFa, METS/MODS and OAI-ORE. It is a good example of how metadata publishing can be mixed and matched in a single publishing system.
  • AGRIS ( – Here you will find a very large collection of bibliographic information from the field of agriculture. It is accessible via quite a number of methods including linked data.
  • D2R Server for the CIA Factbook ( – The content of the World Fact Book distributed as linked data.
  • D2R Server for the Gutenberg Project ( – This is a data set of Project Gutenburgh content — a list of digitized public domain works, mostly books.
  • Dbpedia ( – In the simplest terms, this is the content of Wikipedia made accessible as RDF.
  • Getty Vocabularies ( – A set of data sets used to “categorize, describe, and index cultural heritage objects and information”.
  • Library of Congress Linked Data Service ( – A set of data sets used for bibliographic classification: subjects, names, genres, formats, etc.
  • LIBRIS ( – This is the joint catalog of the Swedish academic and research libraries. Search results are presented in HTML, but the URLs pointing to individual items are really actionable URIs resolvable via content negotiation, thus support distribution of bibliographic information as RDF. This initiative is very similar to OpenCat.
  • Linked Archives Hub Test Dataset ( – This data set is RDF generated from a selection of archival finding aids harvested by the Archives Hub in the United Kingdom.
  • Linked Movie Data Base ( – A data set of movie information.
  • Linked Open Data at Europeana ( – A growing set of RDF generated from the descriptions of content in Europeana.
  • Linked Open Vocabularies ( – A linked data set of linked data sets.
  • Linking Lives ( – While this project has had no working interface, it is a good read on the challenges of presenting link data people (as opposed to computers). Its blog site enumerates and discusses issues from provenance to unique identifiers, from data clean up to interface design.
  • LOCAH Project ( – This is/was a joint project between Mimas and UKOLN to make Archives Hub data available as structured Linked Data. (All three organizations are located in the United Kingdom.). EAD files were aggregated. Using XSLT, they were transformed into RDF/XML, and the RDF/XML was saved in a triple store. The triple store was then dumped as a file as well as made searchable via a SPARQL endpoint.
  • New York Times ( – A list of New York Times subject headings.
  • OCLC Data Sets & Services ( – Here you will find a number of freely available bibliographic data sets and services. Some are available as RDF and linked data. Others are Web services.
  • OpenCat ( – This is a library catalog combining the authority data (available as RDF) provided by the National Library of France with works of a second library (Fresnes Public Library). Item level search results have URIs whose RDF is available via content negotiation. This project is similar to LIBRIS.
  • PELAGIOS ( – A data set of ancient places.
  • ReLoad ( – This is a collaboration between the Central State Archive of Italy, the Cultural Heritage Institute of Emilia Romagna Region, and Regesta.exe. It is the aggregation of EAD files from a number of archives which have been transformed into RDF and made available as linked data. Its purpose and intent are very similar to the the purpose and intent of the combined LOCAH Project and Linking Lives.
  • VIAF ( – This data set functions as a name authority file.
  • World Bank Linked Data ( – A data set of World Bank indicators, climate change information, finances, etc.

RDF tools for the archivist

This posting lists various tools for archivists and computer technologists wanting to participate in various aspects of linked data. Here you will find pointers to creating, editing, storing, publishing, and searching linked data. It is a part of yet-to-be published LiAM Guidebook.


The sites listed in this section enumerate linked data and RDF tools. They are jumping off places to other sites:

RDF converters, validators, etc.

Use these tools to create RDF:

  • ead2rdf ( – This is the XST stylesheet previously used by the Archives Hub in their LOCAH Linked Archives Hub project. It transforms EAD files into RDF/XML. A slightly modified version of this stylesheet was used to create the LiAM “sandbox”.
  • Protégé ( – Install this well-respected tool locally or use it as a hosted Web application to create OWL ontologies.
  • RDF2RDF ( – A handy Java jar file enabling you to convert various versions of serialized RDF into other versions of serialized RDF.
  • Vapour, a Linked Data Validator ( – Much like the W3C validator, this online tool will validate the RDF at the other end of a URI. Unlike the W3C validator, it echoes back and forth the results of the content negotiation process.
  • W3C RDF Validation Service ( – Enter a URI or paste an RDF/XML document into the text field, and a triple representation of the corresponding data model as well as an optional graphical visualization of the data model will be displayed.

Linked data frameworks and publishing systems

Once RDF is created, use these systems to publish it as linked data:

  • 4store ( – A linked data publishing framework for managing triple stores, querying them locally, querying them via SPARQL, dumping their contents to files, as well as providing support via a number of scripting languages (PHP, Ruby, Python, Java, etc.).
  • Apache Jena ( – This is a set of tools for creating, maintaining, and publishing linked data complete a SPARQL engine, a flexible triple store application, and inference engine.
  • D2RQ ( – Use this application to provide a linked data front-end to any (well-designed) relational database. It supports SPARQL, content negotiation, and RDF dumps for direct HTTP access or uploading into triple store.
  • oai2lod ( – This is a particular implementation D2RQ Server. More specifically, this tool is an intermediary between a OAI-PMH data providers and a linked data publishing system. Configure oai2lod to point to your OAI-PMH server and it will publish the server’s metadata as linked data.
  • OpenLink Virtuoso Open-Source Edition ( – An open source version of OpenLink Virtuoso. Feature-rich and well-documented.
  • OpenLink Virtuoso Universal Server ( – This is a commercial version of OpenLink Virtuoso Open-Source Edition. It seems to be a platform for modeling and accessing data in a wide variety of forms: relational databases, RDF triples stores, etc. Again, feature-rich and well-documented.
  • openRDF ( – This is a Java-based framework for implementing linked data publishing including the establishment of a triple store and a SPARQL endpoint.

Semantic Web browsers

This is a small set of Semantic Web browsers. Give them URIs and they allow you to follow and describe the links they include.

  • LOD Browser Switch ( – This is really a gateway to other Semantic Web browsers. Feed it a URI and it will create lists of URLs pointing to Semantic Web interfaces, but many of the URLs (Semantic Web interfaces) do not seem to work. Some of the resulting URLs point to RDF\ serialization converters
  • LodLive ( – This Semantic Web browser allows you to feed it a URI and interactively follow the links associated with it. URIs can come from DBedia, Freebase, or one of your own.
  • Open Link Data Explorer ( – The most sophisticated Semantic Web browser in this set. Given a URI it creates various views of the resulting triples associated with including lists of all its properties and objects, networks graphs, tabular views, and maps (if the data includes geographic points).
  • Quick and Dirty RDF browser ( – Given the URL pointing to a file of RDF statements, this tool returns all the triples in the file and verbosely lists each of their predicate and object values. Quick and easy. This is a good for reading everything about a particular resource. The tool does not seem to support content negotiation.

If you need some URIs to begin with, then try some of these:

Writing A Book

writing a book

Semantic Web application

This posting outlines the implementation of a Semantic Web application.

Many people seem to think the ideas behind the Semantic Web (and linked data) are interesting, but many people are also waiting to see some of the benefits before committing resources to the effort. This is what I call the “chicken & egg problem of the linked data”.

While I have not created the application outlined below, I think it is more than feasible. It is a sort of inference engine feed with a URI and integer, both supplied by a person. Its ultimate goal is to find relationships between URIs that were not immediately or readily apparent.* It is a sort of “find more like this one” application. Here’s the algorithm:

  1. Allow the reader to select an actionable URI of personal interest, ideally a URI from the set of URIs you curate
  2. Submit the URI to an HTTP server or SPARQL endpoint and request RDF as output
  3. Save the output to a local store
  4. For each subject and object URI found the output, go to Step #2
  5. Go to step #2 n times for each newly harvested URI in the store where n is a reader-defined integer greater than 1; in other words, harvest more and more URIs, predicates, and literals based on the previously harvested URIs
  6. Create a set of human readable services/reports against the content of the store, and think of these services/reports akin to a type of finding aid, reference material, or museum exhibit of the future. Example services/reports might include:
    • hierarchal lists of all classes and properties – This would be a sort of semantic map. Each item on the map would be clickable allowing the reader to read more and drill down.
    • text mining reports – collect into a single “bag of words” all the literals saved in the store and create: word clouds, alphabetical lists, concordances, bibliographies, directories, gazetteers, tabulations of parts of speech, named entities, sentiment analyses, topic models, etc.
    • maps – use place names and geographic coordinates to implement a geographic information service
    • audio-visual mash-ups – bring together all the media information and create things like slideshows, movies, analyses of colors, shapes, patterns, etc.
    • search interfaces – implement a search interface against the result, SPARQL or otherwise
    • facts – remember SPARQL queries can return more than just lists. They can return mathematical results such as sums, ratios, standard deviations, etc. It can also return Boolean values helpful in answering yes/no questions. You could have a set of canned fact queries such as, how many ontologies are represented in the store. Is the number of ontologies greater than 3? Are there more than 100 names represented in this set? The count of languages used in the set, etc.
  7. Allow the reader to identify a new URI of personal interest, specifically one garnered from the reports generated in Step #6.
  8. Go to Step #2, but this time have the inference engine be more selective by having it try to crawl back to your namespace and set of locally curated URIs.
  9. Return to the reader the URIs identified in Step #8, and by consequence, these URIs ought to share some of the same characteristics as the very first URI; you have implemented a “find more like this one” tool. You, as curator of the collection of URIs might have thought the relations between the first URI and set of final URIs was obvious, but those relationships would not necessarily be obvious to the reader, and therefore new knowledge would have been created or brought to light.
  10. If there are no new URIs from Step #7, then go to Step #6 using the newly harvested content.
  11. Done. If a system were created such as the one above, then the reader would quite likely have acquired some new knowledge, and this would be especially true the greater the size of n in Step #5.
  12. Repeat. Optionally, have a computer program repeat the process with every URI in your curated collection, and have the program save the results for your inspection. You may find relationships you did not perceive previously.

I believe many people perceive the ideas behind the Semantic Web to be akin to investigations in artificial intelligence. To some degree this is true, and investigations into artificial intelligence seem to come and go in waves. “Expert systems” and “neural networks” were incarnations of artificial intelligence more than twenty years ago. Maybe the Semantic Web is just another in a long wave of forays.

On the other hand, Semantic Web applications do not need to be so sublime. They can be as simple as discovery systems, browsable interfaces, or even word clouds. The ideas behind the Semantic Web and linked data are implementable. It just a shame that nothing is catching the attention of the wider audiences.

* Remember, URIs are identifiers intended to represent real world objects and/or descriptions of real-world objects. URIs are perfect for cultural heritage institutions because cultural heritage institutions maintain both.

SPARQL tutorial

This is the simplest of SPARQL tutorials. The tutorial’s purpose is two-fold: 1) through a set of examples, introduce the reader to the syntax of SPARQL queries, and 2) to enable the reader to initially explore any RDF triple store which is exposed as a SPARQL endpoint.

SPARQL (SPARQL protocol and RDF query language) is a set of commands used to search RDF triple stores. It is modeled after SQL (structured query language), the set of commands used to search relational databases. If you are familiar with SQL, then SPARQL will be familiar. If not, then think of SPARQL queries as formalized sentences used to ask a question and get back a list of answers.

Also, remember, RDF is a data structure of triples: 1) subjects, 2) predicates, and 3) objects. The subjects of the triples are always URIs — identifiers of “things”. Predicates are also URIs, but these URIs are intended to denote relationships between subjects and objects. Objects are preferably URIs but they can also be literals (words or numbers). Finally, RDF objects and predicates are defined in human-created ontologies as a set of classes and properties where classes are abstract concepts and properties are characteristics of the concepts.

Try the following steps with just about any SPARQL endpoint:

  1. Get an overview– A good way to begin is to get a list of all the ontological classes in the triple store. In essence, the query below says, “Find all the unique objects in the triple store where any subject is a type of object, and sort the result by object.”

  2. Learn about the employed ontologies– Ideally, each of the items in the result will be an actionable URI in the form of a “cool URL”. Using your Web browser, you ought to be able to go to the URL and read a thorough description of the given class, but the URLs are not always actionable.
  3. Learn more about the employed ontologies– Using the following query you can create a list of all the properties in the triple store as well as infer some of the characteristics of each class. Unfortunately, this particular query is intense. It may require a long time to process or may not return at all. In English, the query says, “Find all the unique predicates where the RDF triple has any subject, any predicate, or any object, and sort the result by predicate.”

  4. Guess– Steps #2 and Step #3 are time intensive, and consequently it is sometimes easier just browse the triple store by selecting one of the “cool URLs” returned in Step #1. You can submit a modified version of Step #1’s query. It says, “Find all the subjects where any RDF subject (URI) is a type of object (class)”. Using the
    LiAM triple store, the following query tries to find all the things that are EAD finding aids.

  5. Learn about a specific thing– The result of Step #4 ought to be a list of (hopefully actionable) URIs. You can learn everything about that URI with the following query. It says, “Find all the predicates and objects in the triple store where the RDF triple’s subject is a given value and the predicate and object are of any value, and sort the result by predicate”. In this case, the given value is one of the items returned from Step #4.

  6. Repeat a few times– If the results from Step #5 returned seemingly meaningful and complete information about your selected URI, then repeat Step #5 a few times to get a better feel for some of the “things” in the triple store. If the results were not meaningful, then got to Step #4 to browser another class.
  7. Take these hints– The first of these following two queries generates a list of ten URIs pointing to things that came from MARC records. The second query is used to return everything about a specific URI whose data came from a MARC record.

  8. Read the manual– At this point, it is a good idea to go back to Step #2 and read the more formal descriptions of the underlying ontologies.
  9. Browse some more– If the results of Step #3 returned successfully, then browse the objects in the triple store by selecting a predicate of interest. The following queries demonstrate how to list things like titles, creators, names, and notes.

  10. Read about SPARQL– This was the tiniest of SPARQL tutorials. Using the
    LiAM data setas an example, it demonstrated how to do the all but simplest queries against a RDF triple store. There is a whole lot more to SPARQL than SELECT, DISTINCT, WHERE, ORDER BY, AND LIMIT commands. SPARQL supports a short-hand way of denoting classes and properties called PREFIX. It supports Boolean operations, limiting results based on “regular expressions”, and a few mathematical functions. SPARQL can also be used to do inserts and deletes against the triple store. The next step is to read more about SPARQL. Consider reading the
    canonical documentationfrom the W3C, ”
    SPARQL by example“, and the Jena project’s ”
    SPARQL Tutorial“. [1, 2, 3]

Finally, don’t be too intimidated about SPARQL. Yes, it is possible to submit SPARQL queries by hand, but in reality, person-friendly front-ends are expected to be created making search much easier.

Linked data and archival practice: Or, There is more than one way to skin a cat.

Two recent experiences have taught me that — when creating some sort of information service — linked data will reside and be mixed in with data collected from any number of Internet techniques. Linked data interfaces will coexist with REST-ful interfaces, or even things as rudimentary as FTP. To the archivist, this means linked data is not the be-all and end-all of information publishing. There is no such thing. To the application programmer, this means you will need to have experience with a ever-growing number of Internet protocols. To both it means, “There is more than one way to skin a cat.”

Semantic Web in Libraries, 2013

Hamburg, Germany

Hamburg, Germany

In October of 2013 I had the opportunity to attend the Semantic Web In Libraries conference. [1, 2] It was a three-day event attended by approximately three hundred people who could roughly be divided into two equally sized groups: computer scientists and cultural heritage institution employees. The bulk of the presentations fell into two categories: 1) publishing linked data, and 2) creating information services. The publishers talked about ontologies, human-computer interfaces for data creation/maintenance, and systems exposing RDF to the wider world. The people creating information services were invariably collecting, homogenizing, and adding value to data gathered from a diverse set of information services. These information services were not limited to sets of linked data. They also included services accessible via REST-ful computing techniques, OAI-PMH interfaces, and there were probably a few locally developed file transfers or relational database dumps described as well. These people where creating lists of information services, regularly harvesting content from the services, writing cross-walks, locally storing the content, indexing it, providing services against the result, and sometimes republishing any number of “stories” based on the data. For the second group of people, linked data was certainly not the only game in town.

GLAM Hack Philly

Philadelphia, United States

Philadelphia, United States

In February of 2014 I had the opportunity to attend a hackathon called GLAM Hack Philly. [3] A wide variety of data sets were presented for “hacking” against. Some where TEI files describing Icelandic manuscripts. Some was linked data published from the British museum. Some was XML describing digitized journals created by a vendor-based application. Some of it resided in proprietary database applications describing the location of houses in Philadelphia. Some of it had little or no computer-readable structure at all and described plants. Some of it was the wiki mark-up for local municipalities. After the attendees (there were about two dozen of us) learned about each of the data sets we self-selected and hacked away at projects of our own design. The results fell into roughly three categories: geo-referencing objects, creating searchable/browsable interfaces, and data enhancement. With the exception of the resulting hack repurposing journal content to create new art, the results were pretty typical for cultural heritage institutions. But what fascinated me was way us hackers selected our data sets. Namely, the more complete and well-structured the data was the more hackers gravitated towards it. Of all the data sets, the TEI files were the most complete, accurate, and computer-readable. Three or four projects were done against the TEI. (Heck, I even hacked on the TEI files. [4]) The linked data from the British Museum — very well structured but not quite as through at the TEI — attracted a large number of hackers who worked together for a common goal. All the other data sets had only one or two people working on them. What is the moral to the story? There are two of them. First, archivists, if you want people to process your data and do “kewl” things against it, then make sure the data is thorough, complete, and computer-readable. Second, computer programmers, you will need to know a variety of data formats. Linked data is not the only game in town.


In summary, the technologies described in this Guidebook are not the only way to accomplish the goals of archivists wishing to make their content more accessible. [5] Instead, linked data is just one of many protocols in the toolbox. It is open, standards-based, and simpler rather than more complex. On the other hand, other protocols exist which have a different set of strengths and weaknesses. Computer technologists will need to have a larger rather than smaller knowledge of various Internet tools. For archivists, the core of the problem is still the collection and description of content. This — a what of archival practice — continues to remain constant. It is the how of archival practice — the technology — that changes at a much faster pace.


  1. SWIB13 –
  2. SWIB3 travelogue –
  3. hackathon –
  4. my hack –
  5. Guidebook –

Archival linked data use cases

What can you do with archival linked data once it is created? Here are three use cases:

  1. Do simple publishing – At its very root, linked data is about making your data available for others to harvest and use. While the “killer linked data application” has seemingly not reared its head, this does not mean you ought not make your data available at linked data. You won’t see the benefits immediately, but sooner or later (less than 5 years from now), you will see your content creeping into the search results of Internet indexes, into the work of both computational humanists and scientists, and into the hands of esoteric hackers creating one-off applications. Internet search engines will create “knowledge graphs”, and they will include links to your content. The humanists and scientists will operate on your data similarly. Both will create visualizations illustrating trends. They will both quantifiably analyze your content looking for patterns and anomalies. Both will probably create network diagrams demonstrating the flow and interconnection of knowledge and ideas through time and space. The humanist might do all this in order to bring history to life or demonstrate how one writer influenced another. The scientist might study ways to efficiently store your data, easily move it around the Internet, or connect it with data set created by their apparatus. The hacker (those are the good guys) will create flashy-looking applications that many will think are weird and useless, but the applications will demonstrate how the technology can be exploited. These applications will inspire others, be here one day and gone the next, and over time, become more useful and sophisticated.

  2. Create a union catalog – If you make your data available as linked data, and if you find at least one other archive who is making their data available as linked data, then you can find a third somebody who will combine them into a triple store and implement a rudimentary SPARQL interface against the union. Once this is done a researcher could conceivably search the interface for a URI to see what is in both collections. The absolute imperative key to success for this to work is the judicious inclusion of URIs in both data sets. This scenario becomes even more enticing with the inclusion of two additional things. First, the more collections in the triple store the better. You can not have enough collections in the store. Second, the scenario will be even more enticing when each archive publishes their data using similar ontologies as everybody else. Success does not hinge on similar ontologies, but success is significantly enhanced. Just like the relational databases of today, nobody will be expected to query them using their native query language (SQL or SPARQL). Instead the interfaces will be much more user-friendly. The properties of classes in ontologies will become facets for searching and browsing. Free text as well as fielded searching via drop-down menus will become available. As time goes on and things mature, the output from these interfaces will be increasingly informative, easy-to-read, and computable. This means the output will answer questions, be visually appealing, as well as be available in one or more formats for other computer programs to operate upon. 

  3. Tell a story – You and your hosting institution(s) have something significant to offer. It is not just about you and your archive but also about libraries, museums, the local municipality, etc. As a whole you are a local geographic entity. You represent something significant with a story to tell. Combine your linked data with the linked data of others in your immediate area. The ontologies will be a total hodgepodge, at least at first. Now provide a search engine against the result. Maybe you begin with local libraries or museums. Allow people to search the interface and bring together the content of everybody involved. Do not just provide lists of links in search results, but instead create knowledge graphs. Supplement the output of search results with the linked data from Wikipedia, Flickr, etc. You don’t have to be a purist. In a federated search sort of way, supplement the output with content from other data feeds such as (licensed) bibliographic indexes or content harvested from OAI-PMH repositories. Creating these sorts of things on-the-fly will be challenging. On the other hand, you might implement something that is more iterative and less immediate, but more thorough and curated if you were to select a topic or theme of interest, and do your own searching and story telling. The result would be something that is at once a Web page, a document designed for printing, or something importable into another computer program.

This text is a part of a draft sponsored by LiAM — the Linked Archival Metadata: A Guidebook.