Friday, 18 of April of 2014

CURL and content-negotiation

This is the tiniest introduction to cURL and content-negotiation. It is a part of the to-be-published-in-April Linked Archival Metadata guidebook.

CURL is a command-line tool making it easier for you to see the Web as data and not presentation. It is a sort of Web browser, but more specifically, it is a thing called a user-agent. Content-negotiation is an essential technique for publishing and making accessible linked data. Please don’t be afraid of the command-line though. Understanding how to use cURL and do content-negotiation by hand will take you a long way in understanding linked data.

The first step is to download and install cURL. If you have a Macintosh or a Linux computer, then it is probably already installed. If not, then give the cURL download wizard a whirl. We’ll wait.

Next, you need to open a terminal. On Macintosh computers a terminal application is located in the Utilities folder of your Applications folder. It is called “Terminal”. People using Windows-based computers can find the “Command” application by searching for it in the Start Menu. Once cURL has been installed and a terminal has been opened, then you can type the following command at the prompt to display a help text:

curl --help

There are many options there, almost too many. It is often useful to view only one page of text at a time, and you can “pipe” the output through to a program called “more” to do this. By pressing the space bar, you can go forward in the display. By pressing “b” you can go backwards, and by pressing “q” you can quit:

curl --help | more

Feed cURL the complete URL of Google’s home page to see how much content actually goes into their “simple” presentation:

curl http://www.google.com/ | more

The communication of the World Wide Web (the hypertext transfer protocol or HTTP) is divided into two parts: 1) a header, and 2) a body. By default cURL displays the body content. To see the header, add the -I (for a mnemonic, think “information”) to the command:

curl -I http://www.google.com/

The result will be a list of characteristics the remote Web server is using to describe this particular interaction between itself and you. The most important things to note are: 1) the status line and 2) the content type. The status line will be the first line in the result, and it will say something like “HTTP/1.1 200 OK”, meaning there were no errors. Another line will begin with “Content-Type:” and denotes the format of the data being transferred. In most cases the content type line will include something like “text/html” meaning the content being sent is in the form of an HTML document.

Now feed cURL a URI for Walt Disney, such as one from DBpedia:

curl http://dbpedia.org/resource/Walt_Disney

The result will be empty, but upon the use of the -I switch you can see how the status line changed to “HTTP/1.1 303 See Other”. This means there is no content at the given URI, and the line starting with “Location:” is a pointer — an instruction — to go to a different document. In the parlance of HTTP this is called redirection. Using cURL going to the recommended location results in a stream of HTML:

curl http://dbpedia.org/page/Walt_Disney | more

Most Web browsers automatically follow HTTP redirection commands, but cURL needs to be told this explicitly through the use of the -L switch. (Think “location”.) Consequently, given the original URI, the following command will display HTML even though the URI has no content:

curl -L http://dbpedia.org/resource/Walt_Disney | more

Now remember, the Semantic Web and linked data depend on the exchange of RDF, and upon closer examination you can see there are “link” elements in the resulting HTML, and these elements point to URLs with the .rdf extension. Feed these URLs to cURL to see an RDF representation of the Walt Disney data:

curl http://dbpedia.org/data/Walt_Disney.rdf | more

Downloading entire HTML streams, parsing them for link elements containing URLs of RDF, and then requesting the RDF is not nearly as efficient as requesting RDF from the remote server in the first place. This can be done by telling the remote server you accept RDF as a format type. This is accomplished through the use of the -H switch. (Think “header”.) For example, feed cURL the URI for Walt Disney and specify your desire for RDF/XML:

curl -H "Accept: application/rdf+xml" http://dbpedia.org/resource/Walt_Disney

Ironically, the result will be empty, and upon examination of the HTTP headers (remember the -I switch) you can see that the RDF is located at a different URL, namely, http://dbpedia.org/data/Walt_Disney.xml:

curl -I -H "Accept: application/rdf+xml" http://dbpedia.org/resource/Walt_Disney

Finally, using the -L switch, you can use the URI for Walt Disney to request the RDF directly:

curl -L -H "Accept: application/rdf+xml" http://dbpedia.org/resource/Walt_Disney

That is cURL and content-negotiation in a nutshell. A user-agent submits a URI to a remote HTTP server and specifies the type of content it desires. The HTTP server responds with URLs denoting the location of desired content. The user-agent then makes a more specific request. It is sort of like the movie. “One URI to rule them all.” In summary, remember:

  1. cURL is a command-line user-agent
  2. given a URL, cURL returns, by default, the body of an HTTP transaction
  3. the -I switch allows you to see the HTTP header
  4. the -L switch makes cURL automatically follow HTTP redirection requests
  5. the -H switch allows you to specify the type of content you wish to accept
  6. given a URI and the use of the -L and -H switches you are able to retrieve either HTML or RDF

Use cURL to actually see linked data in action, and here are a few more URIs to explore:

  • Walt Disney via VIAF – http://viaf.org/viaf/36927108/
  • origami via the Library of Congress – http://id.loc.gov/authorities/subjects/sh85095643
  • Paris from DBpedia – http://dbpedia.org/resource/Paris

Leave a comment