Wednesday, 16 of April of 2014

Four “itineraries” for putting linked data into practice for the archivist

If you to go to Rome for a day, then walk to the Colosseum and Vatican City. Everything you see along the way will be extra. If you to go to Rome for a few days, do everything you would do in a single day, eat and drink in a few cafes, see a few fountains, and go to a museum of your choice. For a week, do everything you would do in a few days, and make one or two day-trips outside Rome in order to get a flavor of the wider community. If you can afford two weeks, then do everything you would do in a week, and in addition befriend somebody in the hopes of establishing a life-long relationship.

map of vatican cithyWhen you read a guidebook on Rome — or any travel guidebook — there are simply too many listed things to see & do. Nobody can see all the sites, visit all the museums, walk all the tours, nor eat at all the restaurants. It is literally impossible to experience everything a place like Rome has to offer. So it is with linked data. Despite this fact, if you were to do everything linked data had to offer, then you would do all of things on the following list starting at the first item, going all the way down to evaluation, and repeating the process over and over:

  • design the structure your URIs
  • select/design your ontology & vocabularies — model your data
  • map and/or migrate your existing data to RDF
  • publish your RDF as linked data
  • create a linked data application
  • harvest other people’s data and create another application
  • evaluate
  • repeat

Given that it is quite possible you do not plan to immediately dive head-first into linked data, you might begin by getting your feet wet or dabbling in a bit of experimentation. That being the case, here are a number of different “itineraries” for linked data implementation. Think of them as strategies. They are ordered from least costly and most modest to greatest expense and completest execution:

  1. Rome in a day – Maybe you can’t afford to do anything right now, but if you have gotten this far in the guidebook, then you know something about linked data. Discuss (evaluate) linked data with with your colleagues, and consider revisiting the topic a year.
  2. Rome in three days – If you want something relatively quick and easy, but with the understanding that your implementation will not be complete, begin migrating your existing data to RDF. Use XSLT to transform your MARC or EAD files into RDF serializations, and publish them on the Web. Use something like OAI2RDF to make your OAI repositories (if you have them) available as linked data. Use something like D2RQ to make your archival description stored in databases accessible as linked data. Create a triple store and implement a SPARQL endpoint. As before, discuss linked data with your colleagues.
  3. Rome in week – Begin publishing RDF, but at the same time think hard about and document the structure of your future RDF’s URIs as well as the ontologies & vocabularies you are going to use. Discuss it with your colleagues. Migrate and re-publish your existing data as RDF using the documentation as a guide. Re-implement your SPARQL endpoint. Discuss linked data not only with your colleagues but with people outside archival practice.
  4. Rome in two weeks – First, do everything you would do in one week. Second, supplement your triple store with the RDF of others’. Third, write an application against the triple store that goes beyond search. In short, tell stories and you will be discussing linked data with the world, literally.

Italian Lectures on Semantic Web and Linked Data

rome   croce   koha

Koha Gruppo Italiano has organized the following free event that may be of interest to linked data affectionatos in cultural heritage institutions:

Italian Lectures on Semantic Web and Linked Data: Practical Examples for Libraries, Wednesday May 7, 2014 at The American University of Rome – Auriana Auditorium (Via Pietro Roselli, 16 – Rome, Italy)

  • 9.00 – Benvenuto
    • Andrew Thompson (Executive Vice President and Provost AUR)
    • Juan Diego Ramírez (Direttore Biblioteca Pontificia Università della Santa Croce)
  • 9.15 – “So many opportunities! Which ones to choose?”, Eric Lease Morgan (University of Notre Dame)
  • 10.00 – “SKOS, Nuovo Soggettario e Wikidata: appunti per l’evoluzione dei sistemi di gestione dell’informazione bibliografica”, Giovanni Bergamin (Biblioteca Nazionale di Firenze)
  • 10.30 – “Open, Big, and Linked Data”, Stefano Bargioni (Biblioteca Pontificia Università della Santa Croce)
  • 11.00 – “La digitalizzazione di materiale archivistico e bibliotecario: un ulteriore elemento per valorizzare gli open data”, Bucap Spa
  • 11.15 – Coffee break
  • 11.45 – “xDams RELOADed: Cultural Heritage to the Web of Data”, Silvia Mazzini (Regesta.exe)
  • 12.00 – Discussion Panel: “L’avvento dei linked data e la fine del MARC”
    • Federico Meschini, moderatore (Università della Tuscia)
    • Lucia Panciera (Camera dei Deputati)
    • Fabio Di Giammarco (Biblioteca di Storia moderna e contemporanea)
    • Michele Missikoff e Marco Fratoddi (Stati Generali dell’Innovazione)
  • 13.00 – Conclusione dei lavori

Please RSVP to f.wallner at aur.edu by May 5.

This event is generously sponsored by regesta.exe, Bucap Document Imaging SpA, and SOS Archivi e Biblioteche.

regesta   bucap   sos


Linked Archival Metadata: A Guidebook

A new but still “pre-published” version of the Linked Archival Metadata: A Guidebook is available. From the introduction:

The purpose of this guidebook is to describe in detail what linked data is, why it is important, how you can publish it to the Web, how you can take advantage of your linked data, and how you can exploit the linked data of others. For the archivist, linked data is about universally making accessible and repurposing sets of facts about you and your collections. As you publish these fact you will be able to maintain a more flexible Web presence as well as a Web presence that is richer, more complete, and better integrated with complementary collections.

And from the table of contents:

  • Executive Summary
  • Introduction
  • Linked data: A Primer
  • Getting Started: Strategies and Steps
  • Projects
  • Tools and Visualizations
  • Directories of ontologies
  • Content-negotiation and cURL
  • SPARQL tutorial
  • Glossary
  • Further reading
  • Scripts
  • A question from a library school student
  • Out takes

There are a number of versions:

Feedback desired and hoped for.


What is linked data and why should I care?

“Tell me about Rome. Why should I go there?”

Linked data is a standardized process for sharing and using information on the World Wide Web. Since the process of linked data is woven into the very fabric of the way the Web operates, it is standardized and will be applicable as long as the Web is applicable. The process of linked data is domain agnostic meaning its scope is equally apropos to archives, businesses, governments, etc. Everybody can and everybody is equally invited to participate. Linked data is application independent. As long as your computer is on the Internet and knows about the World Wide Web, then it can take advantage of linked data.

Linked data is about sharing and using information (not mere data but data put into context). This information takes the form of simple “sentences” which are intended to be literally linked together to communicate knowledge. The form of linked data is similar to the forms of human language, and like human languages, linked data is expressive, nuanced, dynamic, and exact all at once. Because of its atomistic nature, linked data simultaneously simplifies and transcends previous information containers. It reduces the need for profession-specific data structures, but at the same time it does not negate their utility. This makes it easy for you to give your information away, and for you to use other people’s information.

The benefits of linked data boil down to two things: 1) it makes information more accessible to both people as well as computers, and 2) it opens the doors to any number of knowledge services limited only by the power of human imagination. Because it standardized, agnostic, independent, and mimics human expression linked data is more universal than the current processes of information dissemination. Universality infers decentralization, and decentralization promotes dissemination. On the Internet anybody can say anything at anytime. In the aggregate, this is a good thing and it enables information to be combined in ways yet to be imagined. Publishing information as linked data enables you to seamlessly enhance your own knowledge services as well as simultaneously enhance the knowledge of others.

“Rome is the Eternal City. After visting Rome you will be better equipped to participate in the global conversation of the human condition.”


Impressed with ReLoad

I’m impressed with the linked data project called ReLoad. Their data is robust, complete, and full of URIs as well as human-readable labels. From the project’s home page:

The ReLoad project (Repository for Linked open archival data) will foster experimentation with the technology and methods of linked open data for archival resources. Its goal is the creation of a web of linked archival data.
LOD-LAM, which is an acronym for Linked Open Data for Libraries, Archives and Museums, is an umbrella term for the community and active projects in this area.

The first experimental phase will make use of W3C semantic web standards, mash-up techniques, software for linking and for defining the semantics of the data in the selected databases.

The archives that have made portions of their institutions’ data and databases openly available for this project are the Central State Archive, and the Cultural Heritage Institute of Emilia Romagna Region. These will be used to test methodologies to expose the resources as linked open data.

For example, try these links:

Their data is rich enough so things like LodLive can visualize resources well:


Three RDF data models for archival collections

Listed and illustrated here are three examples of RDF data models for archival collections. It is interesting to literally see the complexity or thoroughness of each model, depending on your perspective.

rubinstein
This one was designed by Aaron Rubinstein. I don’t know whether or not it was ever put into practice.

lohac
This is the model used in Project LOACH by the Archives Hub.

pad
This final model — OAD — is being implemented in a project called ReLoad.

There are other ontologies of interest to cultural heritage institutions, but these three seem to be the most apropos to archivists.

This work is a part of a yet-to-be published book called the LiAM Guidebook, a text intended for archivists and computer technologists interested in the application of linked data to archival description.


LiAM Guidebook – a new draft

I have made available a new draft of the LiAM Guidebook. Many of the lists of things (tools, projects, vocabulary terms, Semantic browsers, etc.) are complete. Once the lists are done I will move back to the narratives. Thanks go to various people I’ve interviewed lately (Gregory Colati, Karen Gracy, Susan Pyzynski, Aaron Rubinstein, Ed Summers, Diane Hillman, Anne Sauer, and Eliot Wilczek) because without them I would to have been able to get this far nor see a path forward.


Linked data projects of interest to archivists (and other cultural heritage personnel)

While the number of linked data websites is less than the worldwide total number, it is really not possible to list every linked data project but only things that will presently useful to the archivist and computer technologist working in cultural heritage institutions. And even then the list of sites will not be complete. Instead, listed below are a number of websites of interest today. This list is a part of the yet-to-be published LiAM Guidebook.

Introductions

The following introductions are akin to directories or initial guilds filled with pointers to information about RDF especially meaningful to archivists (and other cultural heritage workers).

  • Datahub (http://datahub.io/) – This is a directory of data sets. It includes descriptions of hundreds of data collections. Some of them are linked data sets. Some of them are not.
  • LODLAM (http://lodlam.net/) – LODLAM is an acronym for Linked Open Data in Libraries Archives and Museums. LODLAM.net is a community, both virtual and real, of linked data aficionados in cultural heritage institutions. It, like OpenGLAM, is a good place to discuss linked data in general.
  • OpenGLAM (http://openglam.org) – GLAM is an acronym for Galleries, Libraries, Archives, and Museums. OpenGLAM is a community fostered by the Open Knowledge Foundation and a place to to discuss linked data that is “free”. for It, like LODLAM, is a good place to discuss linked data in general.
  • semanticweb.org (http://semanticweb.org) – semanticweb.org is a portal for publishing information on research and development related to the topics Semantic Web and Wikis. Includes data.semanticweb.org and data.semanticweb.org/snorql.

Data sets and projects

The data sets and projects range from simple RDF dumps to full-blown discovery systems. In between some simple browsable lists and raw SPARQL endpoints.

  • 20th Century Press Archives (http://zbw.eu/beta/p20) – This is an archive of digitized newspaper articles which is made accessible not only as HTML but a number of other metadata formats such as RDFa, METS/MODS and OAI-ORE. It is a good example of how metadata publishing can be mixed and matched in a single publishing system.
  • AGRIS (http://agris.fao.org/openagris/) – Here you will find a very large collection of bibliographic information from the field of agriculture. It is accessible via quite a number of methods including linked data.
  • D2R Server for the CIA Factbook (http://wifo5-03.informatik.uni-mannheim.de/factbook/) – The content of the World Fact Book distributed as linked data.
  • D2R Server for the Gutenberg Project (http://wifo5-03.informatik.uni-mannheim.de/gutendata/) – This is a data set of Project Gutenburgh content — a list of digitized public domain works, mostly books.
  • Dbpedia (http://dbpedia.org/About) – In the simplest terms, this is the content of Wikipedia made accessible as RDF.
  • Getty Vocabularies (http://vocab.getty.edu) – A set of data sets used to “categorize, describe, and index cultural heritage objects and information”.
  • Library of Congress Linked Data Service (http://id.loc.gov/) – A set of data sets used for bibliographic classification: subjects, names, genres, formats, etc.
  • LIBRIS (http://libris.kb.se) – This is the joint catalog of the Swedish academic and research libraries. Search results are presented in HTML, but the URLs pointing to individual items are really actionable URIs resolvable via content negotiation, thus support distribution of bibliographic information as RDF. This initiative is very similar to OpenCat.
  • Linked Archives Hub Test Dataset (http://data.archiveshub.ac.uk) – This data set is RDF generated from a selection of archival finding aids harvested by the Archives Hub in the United Kingdom.
  • Linked Movie Data Base (http://linkedmdb.org/) – A data set of movie information.
  • Linked Open Data at Europeana (http://pro.europeana.eu/datasets) – A growing set of RDF generated from the descriptions of content in Europeana.
  • Linked Open Vocabularies (http://lov.okfn.org/dataset/lov/) – A linked data set of linked data sets.
  • Linking Lives (http://archiveshub.ac.uk/linkinglives/) – While this project has had no working interface, it is a good read on the challenges of presenting link data people (as opposed to computers). Its blog site enumerates and discusses issues from provenance to unique identifiers, from data clean up to interface design.
  • LOCAH Project (http://archiveshub.ac.uk/locah/) – This is/was a joint project between Mimas and UKOLN to make Archives Hub data available as structured Linked Data. (All three organizations are located in the United Kingdom.). EAD files were aggregated. Using XSLT, they were transformed into RDF/XML, and the RDF/XML was saved in a triple store. The triple store was then dumped as a file as well as made searchable via a SPARQL endpoint.
  • New York Times (http://data.nytimes.com/) – A list of New York Times subject headings.
  • OCLC Data Sets & Services (http://www.oclc.org/data/) – Here you will find a number of freely available bibliographic data sets and services. Some are available as RDF and linked data. Others are Web services.
  • OpenCat (http://demo.cubicweb.org/opencatfresnes/) – This is a library catalog combining the authority data (available as RDF) provided by the National Library of France with works of a second library (Fresnes Public Library). Item level search results have URIs whose RDF is available via content negotiation. This project is similar to LIBRIS.
  • PELAGIOS (http://pelagios-project.blogspot.com/p/about-pelagios.html) – A data set of ancient places.
  • ReLoad (http://labs.regesta.com/progettoReload/en) – This is a collaboration between the Central State Archive of Italy, the Cultural Heritage Institute of Emilia Romagna Region, and Regesta.exe. It is the aggregation of EAD files from a number of archives which have been transformed into RDF and made available as linked data. Its purpose and intent are very similar to the the purpose and intent of the combined LOCAH Project and Linking Lives.
  • VIAF (http://viaf.org/) – This data set functions as a name authority file.
  • World Bank Linked Data (http://worldbank.270a.info/.html) – A data set of World Bank indicators, climate change information, finances, etc.

RDF tools for the archivist

This posting lists various tools for archivists and computer technologists wanting to participate in various aspects of linked data. Here you will find pointers to creating, editing, storing, publishing, and searching linked data. It is a part of yet-to-be published LiAM Guidebook.

Directories

The sites listed in this section enumerate linked data and RDF tools. They are jumping off places to other sites:

RDF converters, validators, etc.

Use these tools to create RDF:

  • ead2rdf (http://data.archiveshub.ac.uk/xslt/ead2rdf.xsl) – This is the XST stylesheet previously used by the Archives Hub in their LOCAH Linked Archives Hub project. It transforms EAD files into RDF/XML. A slightly modified version of this stylesheet was used to create the LiAM “sandbox”.
  • Protégé (http://protege.stanford.edu) – Install this well-respected tool locally or use it as a hosted Web application to create OWL ontologies.
  • RDF2RDF (http://www.l3s.de/~minack/rdf2rdf/) – A handy Java jar file enabling you to convert various versions of serialized RDF into other versions of serialized RDF.
  • Vapour, a Linked Data Validator (http://validator.linkeddata.org/vapour) – Much like the W3C validator, this online tool will validate the RDF at the other end of a URI. Unlike the W3C validator, it echoes back and forth the results of the content negotiation process.
  • W3C RDF Validation Service (http://www.w3.org/RDF/Validator/) – Enter a URI or paste an RDF/XML document into the text field, and a triple representation of the corresponding data model as well as an optional graphical visualization of the data model will be displayed.

Linked data frameworks and publishing systems

Once RDF is created, use these systems to publish it as linked data:

  • 4store (http://4store.org/) – A linked data publishing framework for managing triple stores, querying them locally, querying them via SPARQL, dumping their contents to files, as well as providing support via a number of scripting languages (PHP, Ruby, Python, Java, etc.).
  • Apache Jena (http://jena.apache.org/) – This is a set of tools for creating, maintaining, and publishing linked data complete a SPARQL engine, a flexible triple store application, and inference engine.
  • D2RQ (http://d2rq.org/) – Use this application to provide a linked data front-end to any (well-designed) relational database. It supports SPARQL, content negotiation, and RDF dumps for direct HTTP access or uploading into triple store.
  • oai2lod (https://github.com/behas/oai2lod) – This is a particular implementation D2RQ Server. More specifically, this tool is an intermediary between a OAI-PMH data providers and a linked data publishing system. Configure oai2lod to point to your OAI-PMH server and it will publish the server’s metadata as linked data.
  • OpenLink Virtuoso Open-Source Edition (https://github.com/openlink/virtuoso-opensource/) – An open source version of OpenLink Virtuoso. Feature-rich and well-documented.
  • OpenLink Virtuoso Universal Server (http://virtuoso.openlinksw.com) – This is a commercial version of OpenLink Virtuoso Open-Source Edition. It seems to be a platform for modeling and accessing data in a wide variety of forms: relational databases, RDF triples stores, etc. Again, feature-rich and well-documented.
  • openRDF (http://www.openrdf.org/) – This is a Java-based framework for implementing linked data publishing including the establishment of a triple store and a SPARQL endpoint.

Semantic Web browsers

This is a small set of Semantic Web browsers. Give them URIs and they allow you to follow and describe the links they include.

  • LOD Browser Switch (http://browse.semanticweb.org) – This is really a gateway to other Semantic Web browsers. Feed it a URI and it will create lists of URLs pointing to Semantic Web interfaces, but many of the URLs (Semantic Web interfaces) do not seem to work. Some of the resulting URLs point to RDF\ serialization converters
  • LodLive (http://en.lodlive.it) – This Semantic Web browser allows you to feed it a URI and interactively follow the links associated with it. URIs can come from DBedia, Freebase, or one of your own.
  • Open Link Data Explorer (http://demo.openlinksw.com/rdfbrowser2/) – The most sophisticated Semantic Web browser in this set. Given a URI it creates various views of the resulting triples associated with including lists of all its properties and objects, networks graphs, tabular views, and maps (if the data includes geographic points).
  • Quick and Dirty RDF browser (http://graphite.ecs.soton.ac.uk/browser/) – Given the URL pointing to a file of RDF statements, this tool returns all the triples in the file and verbosely lists each of their predicate and object values. Quick and easy. This is a good for reading everything about a particular resource. The tool does not seem to support content negotiation.

If you need some URIs to begin with, then try some of these: