Notes on the status of LSJ and Lewis and Short

July, 2013


History

Perseus and Open Source

Open Source Exclusions

Versions

Known Errors and Inconsistencies


History

As early as May 1985, prior to planning for Perseus had even begun, Gregory Crane delivered lectures on the topic of digitizing the Liddell-Scott-Jones Greek-English Lexicon, ninth edition (“LSJ9”). In 1993, the project corresponded with Anna Morpurgo-Davies of Oxford University and received permission to begin planning for an electronic version of LSJ9. A proposal submitted to the U.S. National Endowment for the Humanities received generous support in 1994[1]. By the fall of 1995, the Perseus Project made available on the WWW a working version of LSJ9.[2]

In September 1996, Perseus announced an award from the U.S. National Endowment for the Humanities for a Digital Library on Ancient Roman Culture[3] enabling the addition of Latin texts and tools. The first release of Perseus Latin, in October, 1997, included A Latin Dictionary by Charlton T. Lewis and Charles Short (“Lewis and Short” or “LS”).

Perseus and Open Source

Written mostly in Perl, the production version of the on-line Perseus text management system (the basis of Perseus 3.0, the Perseus Digital Library) evolved and grew over several years. With few precedents and examples to follow, however, the code behind this system reflected organic growth and experimentation, and became difficult to sustain, share, and modify. While all versions of the Perseus Digital Library system were designed to be open-source, each of the previous incarnations of Perseus were complex and difficult to document, which presented obstacles to new avenues of collaborative research and development.[4]

As digital library systems matured in the early 2000′s, the project sought third party solutions for delivering resources. At the time, most digital libraries concentrated on locating objects and then left it to the users to make sense of what they had found. In contrast, Perseus had increasingly focused on giving users the tools to understand what the digital library gave them: the project depended upon a range of automatic linking, information extraction and visualization services that existing, largely catalog-oriented systems could not support. The project chose to build a new digital library system, designing it from the start to be interoperable, modular, and open-source.[5] Released in May 2005, this is the system known as Perseus 4.0. This is the current version of Perseus as of July, 2013.

Open Source Exclusions

In March 2006, Perseus released downloadable XML-source texts of public domain materials. Most large reference works were excluded from this text collection, including LSJ and Lewis and Short. The decision was based on numerous factors, mainly the amount of work to be done improving these works and the overall quality of the text and markup. Some collaborators and researchers were given the data for various special projects and implementations.

Versions

There are numerous versions of the LSJ (and Lewis and Short) based on the Perseus source available on the internet. Due to the ad hoc nature of the distribution of these works and their dissemination through channels other than an explicit download link, users may not have been aware of the terms or restrictions under which Perseus text downloads are offered.[6] Many of these branched versions of the lexica available undoubtedly contain significant corrections and improvements to the source data. Similarly, corrections made by Perseus editors are not found in the various derivative versions of the LSJ and Lewis and Short.

Known Errors and Inconsistencies

These files have never been proofread completely. A subset of the data was checked for quality control, but no human being read through this file to check errors. Since its digitization, we have relied on the user community and collaborators to report errors or problems.

This is a brief overview of the types of errors and problems in these files.

Data entry:
This work was professionally double keyed by a data entry firm, but there are errors. Many of these errors can be attributed to the quality of the source documents or photocopies thereof. The small print in the original pages and unclear characters, particularly accents, made the original data entry challenging.

Tagging:
The large lexica structure and tags were applied through a combination of data entry and automated scripts. There is some data that was tagged incorrectly through the latter process. For instance, the scripts for translation endeavored to capture the first English text in a given entry that was not otherwise tagged as a cross reference or other item. This can therefore result in the wrong text being captured as a translation. Many similar types of errors are found in the lexica files.

Cross references:
Large reference works like the LSJ and Lewis and Short contain hundreds of thousands of cross references. Perseus used an automated process to find these citations and assign them to their most likely source. This process was by no means perfect and relied on a number of parameters and formatting patterns. There were many errors and omissions.

Work has been done since initial digitization on fixing the problems with cross references, but many problems remain.

Cross references take the format of a citation enclosed in a tag as follows:

<bibl n="Hom. Il. 5.130">Il. <biblScope>5.130</biblScope></bibl>

This tag need not point to any citation, however. This tag contains no “n” identifier:

<bibl><author>Hsch.</author></bibl>

In early versions of Perseus, the cross reference employed a standard scholarly abbreviation for each work. These abbreviations had to be 1) unique and 2) as short as possible while still being clear and 3) as intuitive as possible. In practice, a work was assigned a Perseus abbreviation when it was added to the digital library.

For example, a tag may read, as above,  <bibl n=”Hom. Il. 5.130″> where Hom. = Homer, Il. = Iliad, and 5.130 = Book 5, Line 130. These standardized abbreviations followed common conventions found in reference works such as LSJ or the OLD, or were based on prevalence in other secondary sources, such as grammars or commentaries. These abbreviations are still used in Perseus 4.0 as noted here: http://www.perseus.tufts.edu/hopper/abbrevhelp.

As Perseus grew, the list of abbreviations also grew, and conflicts arose. Some works were not easily abbreviated. Some obvious abbreviations applied to multiple authors.

There was also the problem of assigning cross references to citations not yet identified or included in the Perseus abbreviation scheme, which only included a small subset of classical works. In order to include cross references to works not yet assigned a Perseus abbreviation, Perseus assigned a standard identifier (an ABO, abstract bibliographic object)[8] based on either the TLG, the PHI, or the Stoa. Thus, there are numerous cross references that look like this:

<bibl n="Perseus:abo:tlg,0003,001:1:20"><author>Th.</author> <biblScope>1.20</biblScope></bibl>

The author is Thucydides and the work is his Histories, Book 1, Chapter 20.

An effort to convert all of the cross references to this abo-based scheme was made, so this is the more common format in the LSJ. A mixture of cross reference formats remains, however, as do many missed and incorrect cross references.

In instances where numbers alone are used as the basis for the cross reference, there are cases where the correct antecedent has been missed and another author and work has been assigned in its place.

Entry identification:
In the conversion to electronic format, Perseus made several changes to the text. One such change, immediately obvious to anyone familiar with the print version of the LSJ, is that each lemma was given its own entry. In print, these are nested within larger paragraphs.

Below is the LSJ entry for λακίζω on page 1025:
image00

Perseus displays these as four separate entries: λακίζω, λακίς, λάκισμα, λακιστός. This follows the convention of earlier editions of the lexicon.

There are problems with entries that may be split incorrectly, or have misidentified the stem or failed to split at all.


[1] NEH RT-21620-94.

[2] Crane, Gregory. “New Technologies for Reading: The Lexicon and the Digital Library.” Classical World 91, no. 6 (July-August 1998): 471-501, preprint. http://hdl.handle.net/10427/57004

[3] NEH ED 20458-96.

[5] ibid.

[6] As the source XML file(s) was not included in the original open source release, some users, particularly those using versions a few steps removed from the Perseus original, may not have been aware of the terms and restrictions of Perseus text downloads or known about the Creative Commons license. Downloads offered in the context of Perseus are available for download, “with the additional restriction that you offer Perseus any modifications you make. Perseus provides credit for all accepted changes, storing new additions in a versioning system.” This is reiterated in the header of these files indicating, amongst other restrictions: “You offer Perseus any modifications you make.”

[8] For more on the ABO identifiers, see Smith, Mahoney and Rydberg-Cox, 2000: http://xml.coverpages.org//perseus-hopperExtreme2000.pdf

 

 

Comments are closed.