Uploading a new Treebanking Tagset

This is the standard workflow for creating and deploying a custom treebanking tagset. In order to implement this, we assume you have some working knowledge of github, and a xml document editor installed on your machine.

First you need to clone two repos

https://github.com/alpheios-project/arethusa-example-data

Which contains the sample data for the tagsets
https://github.com/alpheios-project/arethusa-configs which contains the configuration files for the tagsets

In both of these repos, you will need to make a branch. Name both branches after your name for your new tagset. So for this example we will call the branches “newtags”

A complete tagset contains at least four files, three configuration files, and one example file. It is possible to make more complex tagsets, which include more files. But the basic requirement for his workflow is four. When naming these files, one should use the same naming conventions used to name the branch. So If for a hypothetical tagset “newtags” we will have four files.

The main configuration file sits in the top level of the configs folder in Arethusa configs. We will call this file “newtags.json.” This file will be relatively short and look something like this.

 “plugins”: {
   “morph” : {
     “@include” : “./arethusa.morph/newtags_morph.json”
   },
   “relation” : {
     “@include” : “./arethusa.relation/newtags_realtion.json”
   }
 }
}

Notice this file points at the two other configuration files, “newtags_morph.json” and “newtags_relation.json.” When naming your morphology and relation config files follow this naming convention.

For instructions on creating a morphological configuration file look here. When you have completed your morphological tagset, place the file in the “arethusa.morph” folder.

Creating a relational config file is similar to creating a morphology config file. If you want to use the default relational elements, you do not need to write a relational config file, and can instead  remove the “relation” section from your main config file.

In the “treebank” folder in the example data, create a treebank file which has been associated with your tagset. Once again the name should match the name of the tagset group. In our example case, “newtags.xml”

Once you have created your treebank file, open in it an xml editor, or edit the xml on Perseids.
In the element, there is an attribute called “format.” To set the file to look at you config file simply change the value of format to the name of your main config file. Our sample would look something like this:

<treebank version=”1.5″
 xml:lang=”lat” format=”newtags”
xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”
xmlns:treebank=”http://nlp.perseus.tufts.edu/syntax/treebank/1.5″
xsi:schemaLocation=”http://nlp.perseus.tufts.edu/syntax/treebank/1.5 treebank-1.5.xsd”>

Once you have set up those files. You will put in a pull request to the main branch for both the arethusa-configs, and the arethusa-example-data repositories. We will test and process the files, and eventually merge them into the main repository. Once merged, they will be available for use in Arethusa via Perseids online.