Perseus Digital Library Updates » News and announcements from the Perseus Digital Library

Lucian: Updating Greek and adding English

Gregory Crane

Another update for our NEH-funded Next Thirty Years of Perseus work. We have now updated Lucian. First, we have fixed issues in the Greek for Lucian works 1-52 as editing by A. H. Harmon. These were originally entered years ago (c. 2010) with a version of Abbyy Finereader that only knew modern Greek. There were some residual OCR errors as well as incorrectly accented words (usually problems because we did not account for enclitics). We also added the textual notes. There were two versions of this Greek up until now but they have been consolidated.

We have also added the corresponding English translations by Harmon. These will all appear in the next upload to the Scaife Viewer, from work 1 (Phalaris) through work 52 (Disowned/Abdicatus)

Translations for all of Lucian are ready to be added, with more than one translation for most of Lucian’s works soon to be available.

Lucian text files are at here.

To examine this work by work, you can use URLs of the form: https://github.com/PerseusDL/canonical-greekLit/tree/master/data/tlg0062/tlg001.

by Gregory Crane Posted on August 10, 2023

Posted in Release | Comments Off

New translations of Thucydides added

Gregory Crane

Under our new Perseus the Next Thirty Years NEH grant, we have added a set of new translations for Thucydides, including translations in English, French, German, Italian and Latin. These are now available on Github and (with the exception of two German translations of part of Thucydides) can now be viewed in the Scaife Viewer. The opening books of Thucydides in the Zevort translation have been available. We now have the complete translation.

Thucydides. Histoire de la Guerre du Péloponnése. Zévort, Marie Charles, translator. Paris: Charpentier, 1852.
Thucydides. Histoire de la Guerre du Péloponnése. Bétant, Élie-Ami, translator. Paris: Librairie de L. Hachette, 1863.
Thucydides with an English translation. Smith, Charles Foster, translator. London and Cambridge, MA: Heinemann and Harvard University Press, 1919-1923.
The history of the Peloponnesian War. Dale, Henry, translator. London: Heinemann and Henry G. Bohn, 1851.
Historia Belli Peloponnesiaci. Haase, Friedrich, translator. Paris: Firmin Didot, 1869.
Vier Staatsreden aus Thucydides in deutscher Übersetzung. Gürsching, Heinrich, translator. Augsburg: Wirth, 1856.
Die Rede des Perikles für die Gefallenen. Binding, Rudolf G., translator. Mainz-Kastel: Hanns Marxen, 1937.
Della storia di Tucidide volgarizzata libri otto. Translator not named, translator. Florence: Tipografia Galileiana, 1835.

These are in addition to translations that have been available in Perseus for many years.

Thucydides. The English works of Thomas Hobbes of Malmesbury. Hobbes, Thomas. translator. London: John Bohn, 1843.
Thucydides. History of the Peloponnesian War. Crawley, Richard, translator. London and Toronto: J. M. Dent and Sons Ltd.; New York: E. P. Dutton and Co., 1914.
Thucydides. Geschichte des Peloponnesischen Kriegs. Wahrmund, Adolf, translator. Stuttgart: Krais and Hoffmann, 1864.
Thucydides. Geschichte des Peloponnesischen Kriegs. Braun, Theodor, translator. Leipzig: Insel-Verlag, 1917.

More Thucydides materials should appear in the coming months.

by Gregory Crane Posted on August 7, 2023

Posted in Release | Comments Off

Philo of Alexandria, Translations, and Perseus the Next Thirty Years

A lot is happening at Perseus. I am writing now to point out the first result from a new NEH grant that formally began one month ago (July 1, 2023). We have released a much revised Greek edition of Philo of Alexandria and a first English translation. The data is available in the First One Thousand Years of Greek Github repository and will find its way onto the Scaife Viewer in its next build.

First, the digital transcription of the Greek text of Philo (based on the Cohn/Wendland Teubner edition). We originally digitized this roughly 10 years ago with the first OCR open source OCR software that we found could manage Ancient Greek. There were issues with this work and we did a major revision. The new files will surely have residual issues and we look forward to finding these but they are a big improvement.

Second, we published the four volume translation that Charles Duke Yonge produced for the Bohn Classical Library in 1855. These are based on the Greek editions that precede the monumental work of Cohn and Wendland. In some cases, Cohn and Wendland reorganized the text and I have adjusted our version of Yonge to follow those changes.

There is a lot more in the pipeline. Our new NEH grant allows us to focus on adding to translations available in Perseus and this is only a first step. Our work with the NEH-funded Beyond Translation project (not to mention a great deal of work on translation alignment by others) has also opened up new possibilities for connecting translation and source texts. These services will begin to appear during the course of the next year.

Gregory Crane

by Gregory Crane Posted on August 2, 2023

Posted in Release | Comments Off

NEH Grant: Perseus on the Web — preparing for the next thirty years

National Endowment for the Humanities grant to the Perseus Digital Library at Tufts University, April 18, 2023

I am writing to express my gratitude to the National Endowment for the Humanities for awarding us a new grant entitled “Perseus on the Web: Preparing for the Next Thirty Years.” We will receive just under $348,881 for this project, which is scheduled to run from July 2023 through June 2026.

Development for what would become Perseus began at Harvard in 1985, with our first grant support from an equipment grant provided by Xerox Corporation. David A. Smith created the first initial web version of Perseus at Tufts in 1995. This new NEH grant will be active, and most of our planned development will be completed by 2025, marking thirty years since the first web version of Perseus. Looking to the next thirty years is an ambitious goal, and advances in fields such as AI may lead us to move beyond systems like Perseus. Nevertheless, David Mimno designed the first version of Perseus on which most users depend, Perseus 4.0 (“the Perseus Hopper”), twenty years ago in 2003.

A month ago, I posted about the soft release of an initial version of what we are calling Perseus 6.0. That work will continue through August of this year. Our goals are to finally transition from Perseus 4.0, “the Perseus Hopper.” By the end of summer 2023, we hope that Perseus 6.x will include the remaining key features from Perseus 4.0 (such as support for commentaries and dictionaries) as well as the scalability of Perseus 5.0 (“the Scaife Viewer”) and the new capabilities introduced in Perseus 6.0 (“Beyond Translation”), such as treebanks, aligned translations, metrical visualizations, improved linking from commentaries and lexica, new geospatial visualizations, and integration with IIIF.

This new grant will enable us to build upon the new, more modern codebase and general architecture. We will have more to say about that in the coming months. The bottom line, though, is that as we finish up work on a production version of Perseus 6, we are already in a position to begin planning for Perseus 7 in 2025 or 2026.

by Gregory Crane Posted on April 18, 2023

Posted in Announcement, Grants, Research | Comments Off

Perseus 6.0: Beyond Translation — the first version of a next generation Perseus

Gregory Crane
March 15, 2023
Medford MA, USA

Five years after the March 15, 2018, announcement of the Scaife Viewer, we are announcing Beyond Translation, the first version of the sixth generation Perseus (Perseus 6.0). The current NEH-funded phase of work runs through August 2023. We have a great deal of content to add and much to do with every aspect of the system, but the basic features of Perseus 6 are now largely in place.

You can experiment with Beyond Translation directly but it is not yet as transparent as it will (hopefully) become as to what features are available and where those features are available. A more proper splash screen will appear this summer but, in the meantime, we have put up a first draft of information about the new features and how you can get at them. We expect that documentation to evolve as well.

Read a general introduction to Perseus 6.0 here.
See an overview of new features in a single document here.
See information about Perseus 6 as a series of separate documents here.

Primary funding for Beyond Translation has come from the NEH Office of Digital Humanities program (HAA-266462-19), with major support from the Mellon Foundation (1802-05569), the Center for Hellenic Studies, the Tufts Data Intensive Science Center, the Tufts Springboard program, Tufts Technology Services, and Tufts Arts and Sciences. We particularly thank our developer partners James Tauber and Jacob Wegner.

by Gregory Crane Posted on March 15, 2023

Posted in Features, Release, Technology | Tagged Beyond Translation, Perseus 6.0 | Comments Off

Spring 2023 Course on Natural Language Processing and the Human Record

Tufts University will introduce a new course in spring 2023: “Natural Language Processing and the Human Record.” Students at Boston College and Boston University can already cross-register to take this course for credit but, insofar as space allows, it will be open to others in person and to a wider potential audience participating online. This project-based course will not only provide opportunities for students of Greek and Latin, but also for students of other historical languages. It also addresses a major gap between the curricula to which most students of historical languages have access and the realities of doing research in a digital age.

When Princeton, for example, announced a tenure-track job at the rank of Assistant Professor in Ancient Mediterranean Languages and Cultures to begin in Fall 2023, it specifically asked for someone “who can help us expand and diversify our offerings, for example by adding a language to those we already teach, and/or using digital methods and resources, and/or harnessing the insights of linguistics to illuminate broader cultural issues in the study of ancient Greece, Rome, and related ancient and later cultures.”

Language technologies allow students of the Greco-Roman world to address all three of the intellectual goals that this job posting requests. In particular, students of Ancient Greek and Latin who take advantage of contemporary digital methods will be positioned to work with a variety of languages. The figure below illustrates how we can, for example, now offer dense linguistic annotations for a growing number of sources in historical (and contemporary) languages.

Screenshot from the NEH-funded Beyond Translation Project that is building a next generation reading environment for the Perseus Digital Library.

A reading environment such as the one above depends upon a hybrid environment that integrates automated analysis based not only on both machine learning and traditional procedural programming but also on contributions by human experts on the particular source. Knowledge of the language (whether in the form of annotated training data or heuristic rules) provides the starting point for computation and the computation can improve based on expert feedback. We need participants with strengths on both the computational and the content sides (and ideally some participants who can contribute to both sides of this process).

Few programs (if any) are, however, designed to provide students of Greco-Roman culture with the skills that they need to apply digital methods. Those students who do acquire such skills often do so as undergraduates in Computer Science, working in a job involving computation before they begin graduate school, or as something they pick up on the side during graduate school.

In the Tufts Department of Classical Studies, we will be teaching a new course with the title “Natural Language Processing and the Human Record.” It will be taught for the first time in spring 2023 as CLS 191 (and will, in subsequent years, appear with its own number as CLS 162).

This class will be taught on Monday nights from 6:00-8:30 pm so that people who are not regularly on the Tufts campus (such as in-service teachers and students from other local institutions) would be able to participate in person. An existing agreement would allow students from Boston University and Boston College to cross-register, but, if space allows, others are welcome to join. The course is officially listed as in person but we would work to make it accessible remotely as well.

Any intellectually determined student could profitably take this class. First, those who wish to focus on the computational side would need to acquire core skills in programming with Python and in working with Jupyter Notebooks but that is certainly doable, given the variety of online tutorials available, between the end of class in fall 2022 and the beginning of this class. The work required would be non-trivial but effective computational scholarship has long required, and probably long will require, a great deal of informal, on-going, self-directed study.

Second, although ability to apply a range of Python based libraries will always be a big help and extend the intellectual range, students with an interest in how to organize philological data could take this class and focus on topics such as the application and assessment of methods such as morpho-syntactic analysis (particularly the Universal Dependencies Framework), co-reference resolution, translation alignment, named entity recognition and linking, topic modeling, sentiment analysis, and ontology development. Technical terms such as these may be unfamiliar to most practitioners (and do evoke blank stares from most senior scholars), but they represent foundational new building blocks upon which the study of historical languages must be based in a digital age. Early career researchers, as well as students who wish to use their study of the past to prepare them to flourish in the modern world, face very different challenges and opportunities than those who were fashioned by the limitations and assumptions of late print culture.

The course description specifically mentions Ancient Greek and Latin because those are two languages that we know we can support but students are welcome to focus on any language where we have independent expertise to help guide and evaluate their work.

Strong contributions from the course will have an opportunity to be published, with credits for individual authors and/or each member of the team, both in the Tufts Dataverse and (as appropriate) in the new Perseus.

CLS 191 01 Seminar on Current Topics in Digital Humanities: Natural Language Processing and the Human Record
Cross listed with GRK 191 01 and LAT 191 01
G. Crane
3 SHU
In Person (although we will make remote participation possible0
Mondays, 6:00 – 8:30 pm (local time in Boston, MA, USA)

This class explores the application of natural language processing to the study of the human record and serves two complementary audiences. First, students who are familiar with, or able quickly to develop familiarity with, Python and related technologies can use these skills to develop course projects. Second, students who do not yet have this technical background but who wish to focus on how to publish born-digital versions of historical sources can take his class to develop new ways of reading, organizing and analyzing texts. Students of Greek or Latin who wish to focus on the language can take this class as GRK 191 or LAT 191. Students who wish to focus on another language (including sources in English) are welcome but should consult the instructor. We will cover recent publications and examine current applications that define the state of the art in digital humanities and digital publication. While students will be able to work on their own, we will particularly support the development of collaborative projects in which students with complementary skill sets work together. 

Recommendations: CLS 162 – CLS 161, CS 10, CS 11;

For those enrolling in GRK 191 or LAT 191: three or more semesters of study recommended.

by Gregory Crane Posted on October 31, 2022

Posted in Course(s) | Tagged new course, NLP, Tufts | Comments Off

Attested Repetition in Homeric Epic

Gregory Crane
May 17, 2022

Figure 1: lines of Homeric epic that share vocabulary with the opening lines of the *Iliad*.

This paper announces the creation of a version of the Homeric Iliad and Odyssey that links each line of each poem with those other lines in the Iliad and the Odyssey that share the most significant vocabulary. Each line has at least one parallel. The line with the most parallels (Od. 2.569) has 227 parallels but that is exceptional. The average line has 24.4 parallels. Forty-eight files, one for each book in the Iliad and Odyssey, are available on GitHub and I expect to add them to other repositories in the future. This paper describes how similarity is calculated.

I had created this dataset for the sake of curiosity and I had not intended to publish it. I am publishing this now because, every time I wish to read a passage in Homer carefully, I find myself consulting this hypertextual version of the the poems. I constantly see new connections or simply marvel at the flexibility of Homeric formulaic composition as it reveals itself. I expect that most readers may use this to explore, according to various different approaches, Homeric formulaic composition. The algorithm for ranking lines, known by the abbreviation tf/idf, is (as I will explain) both venerable (not considered a good thing in computer science) and simple.

There are other ways by which to find patterns of traditional composition in Homeric poetry. Each method is quite tractable, given the openly licensed data that we have at our disposal about Homeric epic.

Word embeddings: Word embeddings are a relatively recent technique, made possible by advances in machine learning that have been, in turn, made possible by advances in computing hardware. Essentially, word embeddings can provide the context for particular words to identify related terms (e.g., “rock” and “stone” are similar in meaning). The most recent strategies can capture word senses (e.g., distinguish contexts were the English term “bank” designates a financial institution vs. the side of a river). Word embeddings work best when we have much larger collections than Homeric epic but we could probably make up for lack of data by using the Perseus Treebanks to show which words depend on which and in what function. Embeddings may allow us to measure the degree to which different words are actually synonymous and only appear as different options because they allow the tradition to express an identical idea in different metrical slots of the hexameter.
Collocations: We can measure the extent to which two or more words co-occuring exceeds random chance. This method can allow us to identify significant phrases (e.g., “Peleides Achilles”) even when those phrases are composed of very common words.
Metrical position: We can examine the tendencies of particular words to appear in one or more metrical slots in the line (as has Sansom 2021). If a word is not only in two different lines but in the same metrical position, we can increase the weight that we assign to that shared similarity.
The first and second halves of lines: Repeated phrases tend to cluster before or after the caesura (the main break that divides virtually every Homeric hexameter into two pars) and it would make sense to compare the first and second halves of lines. Anyone looking at this dataset will see, over and over, cases where two lines with relatively modest similarity scores have identical first or second halves (especially when the shared words are themselves common individually).

Each of the techniques above could retrieve more clearly formulaic expressions. In the terms of information retrieval, this would improve the precision, i.e., reducing false positives but at the expense of missing some valid results. Using tf/idf to compare whole lines sacrifices precision to find potentially interesting collocations, turning up related lines that more demanding algorithms would miss. The more general method also reveals the fuzzy line between formulaic composition and the semantics of language.

Consider, for example, the lines most similar to Od. 5.409:

Od. 5.409	Ζεύς , καὶ δὴ τόδε λαῖτμα διατμήξας ἐπέρησα ,
	… and indeed, I have cut my way through and crossed this gulf,
Od. 7.276	νηχόμενος τόδε λαῖτμα διέτμαγον , ὄφρα με γαίῃ
	by swimming I cut my way through this gulf …
Od. 5.174	ἥ με κέλεαι σχεδίῃ περάαν μέγα λαῖτμα θαλάσσης ,
	[you, Calypso] who urge me to cross the great gulf of the sea.

The variation τόδε λαῖτμα διατμήξας/διέτμαγον, of course, illustrates a single formula that starts in the same slot but varies in the number of subsequent slots it occupies (˘˘| ˉ˘˘ |ˉˉ|ˉ vs. ˘˘ |ˉ˘˘ |ˉ˘˘). What particularly caught my eye was the fact that linking these three lines immediately answered a question that I had in my own mind as a reader. What verb governs λαῖτμα? The translation above (adapted from A. T. Murray’s Loeb edition) takes the noun with both verbs. The two linked lines immediately support this decision. The comparanda are there in front of me as I read. The form διατμήξας may well most directly govern the noun λαῖτμα, but surely the Greek audience would understand that both verbs governed λαῖτμα.

If I want to understand the semantics of the verb peraô, “to cross,” I need to consider the fact that this verb governs the noun laitma, “gulf,” not only in Od. 5.174 but also, at least indirectly, at Od. 5.409. As a side note, this also highlights a fundamental weakness of the Perseus Treebanks – and, indeed, of the use of trees to represent syntax. Trees are a class of graphs in which each node can have one, and only one, ancestor. We cannot have a single word depend on two other words if we use a tree. We can, of course, augment what we store with the tree by adding another layer of annotation easily enough — anyone working with linguistic annotation will have multiple layers of annotation. But the fact is none of the Greek and Latin Treebanks with which I am familiar have added such a layer. On the other hand, our print reference works are often no better: Cunliffe’s Homeric Lexicon notes that perâo governs laitma in Od. 5.174 but does not mention its connection to laitma in Od. 5.409.

It would — and will ultimately — be easy to build an exploratory environment in which readers can modify the parameters to suit their needs and choose among the various linking techniques that I listed above. Such interactive environments are, however, difficult to maintain over time — there is, in fact, no way to be confident that any piece of code will function decades or generations from now. Modern scholarship on Homeric poetry goes back at least to Friedrich Wolf’s 1795 Prolegomena ad Homerum. In my own case, I have (to take three examples that happen to come to mind) recently looked at scholarship about Homeric scholarship published in 1872 (Düntzer), 1928 (Parry), and 1968 (Hainsworth) — even those three works represent a span of nearly one century and extend back more than half a century from the present. If I am consulting scholarship published a century and a half ago, what can I do to maximize the chance that my contributions would be useful in a century and a half?

It may well be more likely than not that the field changes so much that contemporary articles and monographs play little role in the future study of Homeric epic. Nevertheless, we still participate in a conversation that extends over decades, generations and centuries. We should, I believe, do what we can to maximize the survivability of what we do. To do so, I have chosen the following:

Open license: By publishing under an open license, I make it possible for others to make and preserve copies of what I produce. The more use this work finds, the more copies will be made and the better the chances that at least one will survive.
Publication in long term repositories. The most widely used repository for Digital Philology is surely GitHub but GitHub is a corporate project and no one can be sure of its future (although it is likely that a successor platform would take on content in the future if GitHub falters). We also deposit the results in the Tufts University Digital Collections and Archive, which is tasked with long term preservation, and the European Zenodo archive.
Very basic formatting. The tables, pictures, and paragraphs in this description should be supportable for the foreseeable future.
A tab-delimited file as data format. The core results of this work are published in tab-delimited files that should be easily understood. The technical terms tf/idf and edit distance (described below) are not currently well understood among students of Greek and Latin but they are very basic concepts from computer science and readers should be able to learn what they mean even without the narrative explanation offered here. Otherwise, the data includes citations to, and accompanying lines from, the Homeric epics with a list of dictionary entries (lexemes) that each line pair share.

The work described here depends upon two data sources. First, machine readable versions of the Iliad and Odyssey as edited by T. W. Allen more than a century ago (Allen 1920) provides the starting point for this work. Second, the Perseus Dependency Treebank Project used this text to create a linguistically annotated edition of the two epics, that includes part of speech, syntactic function, and a lexeme (the dictionary form) for each inflected form of the two epics. In this analysis, we want to capture different inflected forms of the same word because Homeric formulaic composition will change the form to suit the syntactic needs of the moment. We look for shared lexemes in different lines. The more shared lexemes, the higher the similarity score for any two lines. The tf/idf metric allows us to dampen the weight given to more common lexemes and emphasize less common lexemes.

Anyone with a list of mapping from the inflected forms in Homeric poetry to lexemes could apply this method. Results will differ, though probably not in ways that are significant for most readers of Homeric epic. Different lexica differ slightly in how they map forms to lexica — some lump very different word senses under a single lexeme, others split a forms into two different lexemes. Some editions choose different readings. In an ideal world, we would include variant readings and/or different critical editions — this would be particularly useful for those who are interested in different variants and see the Homeric epics as dynamic parts of a living tradition (multitexts, on which, see the Homer Multitext Project). Results would thus vary based on different schemes of lemmatization and different editions. Those studying Homeric epic as a multitext should create a version of this work that includes a range of different readings. The results published with only a single version of the text, however, already demonstrate the flexibility of Homeric formulaic composition.

After assigning a score to the shared vocabulary in each line of the Iliad and the Odyssey, the algorithm sorts the most highly scored lines first and then prints out the results until the similarity score drops below a preset cutoff point. Each output line contains the following eight tab-delimited fields.

index	1
base cit.	1:01.001
base line	μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος
tf/idf	10.4
edit distance	61
comp. citation	1:01.322
comp. line	ἔρχεσθον κλισίην Πηληϊάδεω Ἀχιλῆος ·
lexemes	{‘Ἀχιλλεύς’: [382, 4.287,1], ‘Πηλείδης’: [59, 6.155,1]}

The eight fields above consist of the following:

index: Each line of the Iliad and Odyssey is paired with at least one other line based on shared vocabulary.

base citation: An abbreviated version of the citation of the current line that is being compared to every other line in the Iliad and Odyssey. Citations that begin “12-1:” are from the Iliad and “12-2:” from the Odyssey (this follows the convention that the original Thesaurus Linguae Graecae (TLG) established giving Homer the serial number 12 and the Iliad and Odyssey 1 and 2 respectively). Book and line numbers are padded with extra zeroes so that users can easily sort them in a spreadsheet. Otherwise, we would jump from “12-1:1.1” to “12-1:10.1,” with book 2 only showing up after we had completed book 19. Thus, we use “12-1:01.001,” rather than “12-1:1.1”.

base line: the text of the line that the system is comparing to all other lines of the Iliad and the Odyssey.

tf/idf: This figure, rounded to one decimal point, provides a score for the similarity between base line and the line to which it is compared. More extensive discussion follows in a separate section. The key point is that tf/idf provides a way to give more weight to less common words and less to more common words — we do not want to give kai, “and,” the same weight as less common terms when we compare the similarity of two lines of Homer.

Edit Distance: This figure quantifies the similarity of two lines by measuring how many operations it would take to convert one line into the other. We use a standard Python to compute edit distance and the score varies from 0 (for completely different strings) to 100 (for strings that are identical. Very high values for edit distance can pick up small editorial changes that might otherwise not be noticed. Thus, the edit distance between the following lines is 99, rather than 100:

Il. 1.361	χειρί τέ μιν κατέρεξεν ἔπος τ’ ἔφατ’ ἔκ τ’ ὀνόμαζε·
Od. 5.181	χειρί τέ μιν κατέρεξεν ἔπος τ’ ἔφατ’ ἔκ τ’ ὀνόμαζεν·

In his edition of the Iliad (3rd edition 1920), T. W. Allen used the simple form onomaze, “she called him by name.” In his edition of the Odyssey (2nd edition, 1917), Allen chose to add the “nu-movable,” an extra consonant that can be added to the end of some forms to keep a short vowel from being elided before another vowel. The “nu-movable” has no effect on the meaning — and is irrelevant at the end of the line. The difference is small but could matter if researchers were studying linguistic usage. In this particular detail, the two editions are simply inconsistent. I only noticed the difference because the edit distance was 99 rather than 100 and I noticed the variation. There will surely be other places where a score just below 100 will reveal more significant changes.

Comparison citation and comparison line: These designate the citation and text of the line that is being compared to the base text line.

Lexemes: This lists shared dictionary words (lexemes) and provides information about their relative weight. Achilles, for example, shows up 382 lines in Homeric Epic and its tf/idf weight is 4.28, while common Peleides, which appears in 59 separate lines, has a tf/idf weight 6.15, a higher value than that assigned to Achilles because it is substantially less common. Both Achilles and Peleides appear once in both lines.

Deciding which lines to link:tf/idf

The tf/idf metric is, by computational standards, a very old method: the British Computer Scientist, Karen Spärck Jones (1935-2007) published the algorithm in 1972, fifty years ago as I write this. Nevertheless, this measure remains widely used and provides a simple mechanism by which to compare search for a query in a collection of documents.

In the case of the work described here, an algorithm uses the normalized dictionary forms for each word in each line of the Iliad and the Odyssey as a query, follows the tf/idf metric to determine the similarity of this query line against every other line in the two poems, and then returns a list with the most highly scored lines at the top. If our query contains a word such as de, the word that shows up in the most lines of Homeric Epic (12,138), we do not want to return all the other 12,137 lines. So we assign a cutoff score (rather arbitrarily set at 10 for now). Every time we return a line, we check to see if we have dropped below the cutoff score. If so, we end the list. We return one entry before checking the cutoff score because we want to have at least one example for each line in the Iliad and the Odyssey.

The lines that have the least similarity to the rest of Homeric Epic both appear in the catalogue of ships:

Il. 2.561	Τροιζῆν’ Ἠϊόνας τε καὶ ἀμπελόεντ’ Ἐπίδαυρον
	Troezen and Eïonae and vine-clad Epidaurus,
Il. 2.656	Λίνδον Ἰηλυσόν τε καὶ ἀργινόεντα Κάμειρον,
	Lindos and Ialysus and Cameirus, white with chalk.

The only words that these two lines share with any other lines in the Iliad and Odyssey are the connectors te and kai, which together are conventionally translated “both and.” These lines are both equally similar to every other line in the Iliad and Odyssey that have this common combination.

The two lines above also demonstrate another point: different methods detect different kinds of patterns. Readers familiar with Homeric formulaic composition will, of course, immediately see that the two lines above are very similar in form and clearly share much the same formulaic pattern. Each line contains, besides the connectors, only proper names and none of the proper names (surprisingly) shows up elsewhere in the epic poems. The two lines least similar to the rest of the epic poems when that similarity is measured by vocabulary are almost identical in terms of formulaic structure. The Perseus Treebanks do allow us to search for similarity based on the parts of speech or the syntactic labels and dependencies in each line. The work described here focuses strictly on vocabulary.

The expression tf/idf contains two acronyms: term frequency (tf) and inverse document frequency (idf). The measure assumes that we have a collection broken up into documents. Documents can, in turn, be journal articles, news stories, or chapters of a book. But documents can also be individual sentences — or, as in the case of the work described here, individual lines of poetry.

When we compare two lines, we want to give more weight to a word such as mênis (which is conventionally translated as “wrath” or “anger and which appears in 16 lines of Homeric poetry) than we do to the connector de (which is roughly equivalent to “and” and shows up in 12,138 lines), while occurrences of the name Achilles (which shows up in 382 lines) should be somewhere in between. The tf/idf metric yields weights of 0.92, 4.28, and 7.45 for de, Achilles, and mênis respectively. We calculate the similarity of two lines by adding the tf/idf values for each shared dictionary entry. Consider the opening line of the Odyssey and its closest match, which comes from the Catalogue of Ships in the Iliad.

Od. 1.1	ἄνδρα μοι ἔννεπε , μοῦσα , πολύτροπον , ὃς μάλα πολλὰ
	Tell me, Muse, of the man of many turns, who [wandered] much …
Il. 2.761	τίς τὰρ τῶν ὄχ’ ἄριστος ἔην σύ μοι ἔννεπε Μοῦσα
	Who was the best of these? Tell me, Muse.

These lines share the following lexemes:

{‘ἐγώ’: [2871, 2.316, 1], ‘ἐνέπω’: [38, 6.595, 1], ‘Μοῦσα’: [16, 7.46, 1]}

Each word is accompanied by the number of different lines of the Iliad and the Odyssey in which it occurs, the corresponding tf/idf score based on that frequency, and then number of times it appears in both lines (i.e., if it occurs 2x in one line and 3x in the other, we count it as 2x). If we add the tf/idf scores for egô (“I'”), enepô (“to utter”) and Mousa (“Muse”), we get a score of 16.3 (rounded to one digit).

Traditionally, tf/idf is applied to collections where the documents are long enough that the word frequency within documents matters. Consider a collection of 100 newspaper articles in which two words appear 20 times. The first of these words may be a typical, medium frequency word such as “president” that could apply to a number of figures while the other may be more tightly bound to a particular topic (“Macron”) and cluster in two or three documents. The tf/idf measure is designed to capture the fact that, in such a case, “Macron” tells us more about a document than does “president,” even though each appears the same number of times in the corpus.

Lines of Homer are much shorter than newspaper articles. At one extreme, 4 lines contain only 3 words: ranging from 3 words (4x: Il. 2.706, 11.427, Od. 10.137:

Il. 2.706	αὐτοκασίγνητος μεγαθύμου Πρωτεσιλάου
Il. 11.427	αὐτοκασίγνητον εὐηφενέος Σώκοιο
Od. 10.137	αὐτοκασιγνήτη ὀλοόφρονος Αἰήταο·
Il .15.679	κολλητὸν βλήτροισι δυωκαιεικοσίπηχυ

Three begin with with forms of αὐτοκασίγνητος. The fourth, 15.678, contains a single token that contains four distinct words (δυωκαιεικοσίπηχυ including kai, “and”).

On the long end, the treebank counts 13 tokens in only two lines, and in each case that count is two words higher than most would list because, in adding syntactic annotation, we split οὔτε into οὔ τε.

Od. 2.127	ἡμεῖς δʼ οὔ τʼ ἐπὶ ἔργα πάρος γʼ ἴμεν οὔ τε πῃ ἄλλῃ,
Od. 18.288	ἡμεῖς δʼ οὔ τʼ ἐπὶ ἔργα πάρος γʼ ἴμεν οὔ τε πῃ ἄλλῃ,

Three lines are listed as being 14 words long. If we combined οὔ τε, the first of these would be 13 words long.

Il. 20.205	ὄψει δʼ οὔ τʼ ἄρ πω σὺ ἐμοὺς ἴδες οὔ τʼ ἄρʼ ἐγὼ σούς.
Od. 17.466	ἂψ δʼ ὅ γʼ ἐπʼ οὐδὸν ἰὼν κατʼ ἄρʼ ἕζετο, κὰδ δʼ ἄρα πήρην
Od. 18.110	ἂψ δʼ ὅ γʼ ἐπʼ οὐδὸν ἰὼν κατʼ ἄρʼ ἕζετο· τοὶ δʼ ἴσαν εἴσω

The other two lines are both variations on a formula that begins ἂψ δʼ ὅ γʼ ἐπʼ οὐδὸν ἰὼν κατʼ ἄρʼ ἕζετο. The second occurs relatively quickly after the first (a gap of 250 lines, < 1% of the 27,000 lines of Homeric poetry).

Three quarters of all lines are 6, 7, or 8 words long (22%, 30%, and 23% respectively). Almost 99% are 4 to 10 words long.

words/line	lines	%	tot %
3	4	0.01	0.01
4	430	1.55	1.56
5	1996	7.18	8.74
6	6042	21.74	30.48
7	8331	29.98	60.46
8	6454	23.22	83.68
9	3112	11.20	94.88
10	1111	4.00	98.88
11	264	0.95	99.83
12	43	0.15	99.98
13	2	0.01	99.99
14	3	0.01	100.00

The tight clustering of line lengths stands out from bar chart.

Figure SLEN: There are sentences with 3, 12, 13, and 14 words.

Sentences with 3, 12, 13 and 14 words are so few in number that they are not even visible in the figure above.

Such lines obviously contain many fewer words than documents such as even short news articles and most words will only occur once in a line. Nevertheless, of the 27,793 lines in the Allen edition of the Homeric Epics used by the Perseus Treebanks, 3,989 (14%) have two or more instances of the same lexeme. The majority of these lexemes are, however, very high frequency words.

	lexeme	in lines 2+x	in lines 1+x
1	δέ	1055	12138
2	τε	907	4322
3	ὁ	308	5870
4	καί	262	5283
5	οὐ	246	2695
6	ἤ	139	729
7	ἐγώ	123	2871
8	σύ	99	1900
9	εἰμί	49	2118
10	ἄν	43	1449

The three connectors, de (which is the neutral particle indicating a new sentence) and te, and kai (which are often equivalent to “and”) account for more than 2,000 lines — more than half of all duplicates. Such repetitions are very useful but it is not clear that the line linking strategy described here is the most useful way to study such syntactic repetition. The tf/idf measure is most fully effective when documents are relatively consistent in size (as the lines of Homer are) but the lines are so short that most repetitions involves function words rather than content. The repetition of function words can reveal syntactic patterns (especially when we combine them with metrical information and with syntactic data from the treebank) but repetition of words within a line usually does not tell us much about shared semantic content.

There are, however, cases where repetition captures formulaic repetition across lines.

Il. 2.489	οὐ δ’ εἴ μοι δέκα μὲν γλῶσσαι , δέκα δὲ στόματ’ εἶεν ,
	not though ten tongues were mine and ten mouths
Il. 23.851	κὰδ δ’ ἐτίθει δέκα μὲν πελέκεας , δέκα δ’ ἡμιπέλεκκα ,
	ten double axes he set down, and ten single;

Il. 2.489 appears during the dramatic opening to the Catalog of Ships The poet is describing various limitations (including not having enough tongues and mouths) that make it impossible to describe the combatants at Troy. Il 23.851, by contrast, lists the prize that Achilles sets for the contest in archery.

The two lines above clearly share the same formulaic pattern with five words starting in identical metrical slots but the differing syntactic structure clearly brings out the flexibility of Homeric formulaic composition. In Il. 2.489, the tongues and mouths are the subjects of the verb eien, a form of the verb “to be,” and this verb appears at the end of the line. In Il. 23.851, the ten double-axes and ten single-axes are the objects of the main verb, tithei, “he set down” (taking κὰδ, “down,” as modifying the verb).

Two of the three shared lexemes, the connector de and the particle men have very little weight.

	found in lines of Homer	Weight	Shared frequency
δέ	12138	0.92	1
μέν	1872	2.699	1
δέκα	20	7.342	2

The repetition of deka doubles its weight to 14.684 — more than enough for the line to exceed the cutoff score of 10.

Likewise, in the following example, the repetition of heneka, “because of”:

Il. 3.100	εἵνεκ‘ ἐμῆς ἔριδος καὶ Ἀλεξάνδρου ἕνεκ‘ ἀρχῆς·
	because of my quarrel and because of the beginning of Alexandros
Il. 6.356	εἵνεκ‘ ἐμεῖο κυνὸς καὶ Ἀλεξάνδρου ἕνεκ‘ ἄτης,
	because of me, the bitch, and because of the madness of Alexandros

Each instance of heneka adds 5.733 to the score — the single repetition is enough by itself to create a link between the two lines. The two lines are clearly variations on one another. First, we can point out that the choice of two different forms of the preposition, heineka and then heneka, allows the same word to fill two differently shaped metrical slots (long-short in the first case, short-short in the second).

Second, each line is slightly different in meaning and even in structure. In Il. 3.100, Menelaus is speaking to the Greek and arguing that a proposed one-on-one combat should be between himself and Alexandros. He refers to “my quarrel,” because the war is ostensibly being fought to return Helen. The form emês is a possessive adjective that means “my.” The line ends with the word archês, “beginning,” because it was the Alexandros who began the war.

In Il. 6.356, Helen is speaking to Hector on his visit to Troy. She speaks of herself and of Alexandros is very harsh terms because they are responsible for the war. She, however, uses emeio, a genitive form of the first person pronoun (rather than the adjective “my”). The phrase means “of me, the dog” (properly translated by English “bitch”). Here the noun kunos, modifies emeio whereas in the earlier line emês, “my,” modifies eridos, “quarrel.” And Helen refers to the “idiocy” (atês) rather than simply the more matter-of-fact “beginning” of Alexandros that has caused the war. The information conveyed in the two lines differs somewhat. The second variation is far sharpe, even brutal. This offers an excellent example of Homeric traditional composition.

I have, as I will describe below, found this format useful for tasks such as computing similarities among different books of the Iliad and the Odyssey and for tracking the varying degree to which different lines in different parts of the epics share more or less vocabulary. I try to avoid stating that lines with fewer attested parallels are not formulaic and to point out, rather, the extent to which we can observe parallels in the 200,000 words of our Iliad and Odyssey.

Using the files

Each file contains one book of the Iliad or the Odyssey. Readers can simply scroll through and examine the parallels for each line. I use these files as a kind of dynamically generated commentary. Whenever I read a passage closely, I look to see how many — or how few — linked lines there are and what the parallels reveal.

Il. 1.1	μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος
	Sing, goddess, the wrath of Achilles, son of Peleus,

The first line of the Iliad shares two lexemes with sixteen other lines of Homeric epic. To make the exposition clearer, I have optimized the list below a bit. I don’t present all the information that each line in the data set includes. I have highlighted the shared lines. I have also resorted the results to make the word order and different spellings stand out. The changes only make more noticeable patterns that a careful viewer would quickly detect.

Il. 24.406	εἰ μὲν δὴ θεράπων Πηληϊάδεω Ἀχιλῆος
Il. 16.269	Μυρμιδόνες ἕταροι Πηληϊάδεω Ἀχιλῆος
Od. 24.15	εὗρον δὲ ψυχὴν Πηληϊάδεω Ἀχιλῆος
Il. 16.653	ὄφρ’ ἠῢς θεράπων Πηληϊάδεω Ἀχιλῆος
Il. 1.322	ἔρχεσθον κλισίην Πηληϊάδεω Ἀχιλῆος·
Il. 9.166	ἔλθωσ’ ἐς κλισίην Πηληϊάδεω Ἀχιλῆος.
Od. 8.75	νεῖκος Ὀδυσσῆος καὶ Πηλεΐδεω Ἀχιλῆος,
Il. 20.85	Πηλεΐδεω Ἀχιλῆος ἐναντίβιον πολεμίξειν;
Il. 15.64	Πηλεΐδεω Ἀχιλῆος· ὃ δ’ ἀνστήσει ὃν ἑταῖρον
Il. 23.542	Πηλεΐδην Ἀχιλῆα δίκῃ ἠμείψατ’ ἀναστάς·
Il. 21.557	Πηλεΐδῃ Ἀχιλῆϊ, ποσὶν δ’ ἀπὸ τείχεος ἄλλῃ
Il. 20.322	Πηλεΐδῃ Ἀχιλῆϊ· ὃ δὲ μελίην εὔχαλκον
Il. 17.105	Πηλεΐδῃ Ἀχιλῆϊ· κακῶν δέ κε φέρτατον εἴη.
Il. 17.701	Πηλεΐδῃ Ἀχιλῆϊ κακὸν ἔπος ἀγγελέοντα.
Il. 20.312	Πηλεΐδῃ Ἀχιλῆϊ δαμήμεναι, ἐσθλὸν ἐόντα.
Il. 24.59	αὐτὰρ Ἀχιλλεύς ἐστι θεᾶς γόνος , ἣν ἐγὼ αὐτὴ

The last line in the long list above would probably be considered noise to someone exploring the workings of formulaic poetry. Achilles shows up in both but the other shared word, theâ, “goddess,” refers to the Muse in Il. 1.1 and to Achilles’ mother, Thetis, in Il. 24.59. Observers with no knowledge of Greek would see this if they consulted the translations. All readers, whether they know Greek or not, would, of course, have to understand the context of Il. 1.1 and Il. 24.59 and thus to conclude that the pairing of Il. 1.1 and Il. 24.59 probably reflects a random collocation.

But even this line reveals an important feature of the formulaic system: in this, the last of sixteen examples, the name Achilles is spelled, for the first time, with two lambdas (-λλ- vs. -λ-). Homeric tradition exploits the ability to alter the spelling because the -i- in the name Achilles is short (in Greek, the letters a, i, and u can be long or short) but by doubling the following lambda, the short syllable is lengthened. The tradition offers two spellings for Achilles and the name can thus fit into different metrical slots in the line.

Most observers, with or without knowledge of ancient Greek, would immediately see the dominant feature of “Achilles the son of Peleus”: this pair of words always appears either at end or the beginning of a line.

Readers who look closer will notice that the words are not spelled identically. Ancient Greek is highly inflected and we have the same formula not only in different positions (line beginning and end) but also in different cases: genitive (Pêlêiadeô Achilêos, “of Achilles the son of Peleus”), accusative (Pêlêiadên Achilêa, “Achilles the son of Peleus as the object of a verb), and dative (Pêlêiadêi Achilêi, “to/for Achilles the son of Peleus). But even without knowing the details of the case system, we can see that there are both continuities and variations in this same formulaic expression.

The tab delimited files can easily be loaded into a spreadsheet and resorted. You could, for example, look at those lines of Iliad 1 (for example) that have the highest similarity scores. This will identify lines with identical vocabulary, with the least common shared vocabulary generating this highest scores.

Figure 2: The lines of Iliad 1 with the highest similarity scores to other lines in Homeric epic.

Lines that are not quite identical but that share less common vocabulary will sometimes score higher than identical lines with common lexemes (e.g., Il. 1.481 and Il. 1.480).

I cited the two lines that share the lowest similarity score to the rest of Homeric epic above. The lines with the highest similarity score in Homeric epic both have scores of 49.8:

Il. 1.2.471	ὥρῃ ἐν εἰαρινῇ ὅτε τε γλάγος ἄγγεα δεύει,
Il. 16.643	ὥρῃ ἐν εἰαρινῇ, ὅτε τε γλάγος ἄγγεα δεύει·
Il. 6.511	ῥίμφά ἑ γοῦνα φέρει μετά τ’ ἤθεα καὶ νομὸν ἵππων·
Il. 15.268	ῥίμφά ἑ γοῦνα φέρει μετά τ’ ἤθεα καὶ νομὸν ἵππων·

I also published two files with summary data. The first, booktots.tsv, publishes the average number of linked lines for each book of the Iliad and the Odyssey.

work:book	links/line
12-1:01	26.2291325695581
12-1:02	19.608893956670500
12-1:03	23.99349240780910
12-1:04	23.095588235294100
12-1:05	24.24312431243120

More linked lines reflects more attested repetition and says something about the visibly formulaic nature of each book. I add a graph that illustrates which books of the Iliad and the Odyssey have the most links per line. The range is quite wide — from under 19 links per line in Iliad 12 to almost 29 links per line in Odyssey 16. A range of 50% strikes me as large and surprising.

Figure 3: number of links per line for each book of the *Iliad* and the *Odyssey*

The figure reveals several features. First, there is more attested repetition in the Odyssey than in the Iliad — a much bigger difference than I would have expected (although some who see the Odyssey as derivative from the Iliad may claim not to be surprised). Second, the book of the Odyssey with the least attested repetition is book 9, which tells the story of Polyphemus. That suggests to me that this story draws from traditions that are unusually distinct from the rest of the Odyssey. Third, the book of the Iliad with the most repetition is Iliad 10 — which many have seen as both an outlier and as more Odyssean. Dué and Hackney (2010) have made strong arguments that Iliad 10 stands out because it draws on traditions of ambush that are not widely attested elsewhere in the Iliad and Odyssey and that that distinctive content makes this book appear as an outlier. Their argument makes the fact that Iliad 10 has by far the most attested repetitions of any book in the Iliad all the more striking.

Second, I include a file measuring how many linked lines connect each book in the Iliad and the Odyssey with each other. I have added the number of shared words in all linked lines for any two books and, to make up for the fact that different books vary so much in size, then divided by the sum of the number of lines in the two books combined. Each book then has a number ranking its closeness to the other 47 books. There is more to be done here but I think that the preliminary data is worth sharing. It is available as book2book.tsv. The following are the 10 links that, by this metric, have the lowest scores and thus are least similar.

source	srcwords	target	targwords	nlinks	links_per_word
12-1:12	3777	12-2:07	2910	3462	0.5177209510991480
12-2:07	2910	12-1:12	3777	3926	0.5871093165844180
12-1:12	3777	12-2:06	2890	5566	0.8348582570871460
12-2:06	2890	12-1:12	3777	5793	0.8689065546722660
12-1:16	7002	12-2:06	2890	8710	0.8805095026283870
12-1:12	3777	12-2:23	3231	6306	0.8998287671232880
12-2:06	2890	12-1:16	7002	9029	0.9127577840679340
12-1:17	6181	12-2:07	2910	8379	0.9216807831921680
12-1:12	3777	12-2:14	4623	7775	0.9255952380952380
12-1:12	3777	12-2:03	4274	7514	0.9333002111538940

Iliad 12 and Odyssey 7 have the least in common. Note that the score from Iliad 12 to Odyssey 7 is not quite the same as going from Odyssey 7 to Iliad 12. That results from the way we manage uncommon lines but the variation is not large.

The most closely connected books follows:

source	srcwords	target	targwords	nlinks	links_per_word
12-2:17	5319	12-2:16	4187	58336	6.136755733221120
12-2:19	5230	12-2:17	5319	64059	6.072518722153760
12-2:16	4187	12-2:17	5319	57720	6.071954555017880
12-2:17	5319	12-2:19	5230	63286	5.999241634278130
12-2:11	5444	12-2:10	4983	59526	5.708832837824880
12-2:10	4983	12-2:11	5444	59023	5.660592692049490
12-2:17	5319	12-2:15	4811	53620	5.293188548864760
12-2:04	7326	12-2:17	5319	66699	5.274733096085410
12-2:04	7326	12-2:15	4811	63941	5.268270577572710
12-2:16	4187	12-2:15	4811	47263	5.252611691487000

Odyssey 16 and 17 are the most similar books by this metric and Odyssey books are more like each other than Iliad books are. That may reflect the fact that the Odyssey does not need a stream of otherwise unknown figures who appear only to die in battle at the hands of better known heroes. But I suspect that it says something about the composition of the Odyssey.

The full file contains 2256 different links and others can compare them at greater length.

Acknowledgements

This work was made possible by the Beyond Translation Project, funded by NEH HAA-266462-19, by Harvard’s Center for Hellenic Studies, by the Data Intensive Studies Center at Tufts University, and the German Academic Exchange Service.

Citations

Perseus Iliad Treebank: https://github.com/gregorycrane/gAGDT/blob/master/data/xml/tlg0012.tlg002.perseus-grc1.tb.xml

Perseus Odyssey Treebank: https://github.com/gregorycrane/gAGDT/blob/master/data/xml/tlg0012.tlg002.perseus-grc1.tb.xml

Allen, Thomas William, Homeri Opera (Oxford 1920, 3rd edition).

Düntzer, Heinrich. Homerische Abhandlungen. Hahn, 1872, https://catalog.hathitrust.org/Record/100114579.

Dué, Casey, and Mary Ebbott. 2010. Iliad 10 and the Poetics of Ambush: A Multitext Edition with Essays and Commentary. Hellenic Studies Series 39. Washington, DC: Center for Hellenic Studies. http://nrs.harvard.edu/urn-3:hul.ebook:CHS_Due_Ebbott.Iliad_10_and_the_Poetics_of_Ambush.2010.

Hainsworth, J. B (1968). The Flexibility of the Homeric Formula. Clarendon P., 1968, https://catalog.hathitrust.org/Record/001223030.

Parry, Milman (1928). “L’Épithète Traditionnelle dans Homère: Essai sur un problème de style Homérique.” The Center for Hellenic Studies, 1928, https://chs.harvard.edu/book/parry-milman-lepithete-traditionnelle-dans-homere-essai-sur-un-probleme-de-style-homerique/.

Parry, Milman (1928b). “Les formules et la métrique d’Homère.” The Center for Hellenic Studies, https://chs.harvard.edu/book/parry-milman-les-formules-et-la-metrique-dhomere/.

Sansom, Stephen A. “Sedes as Style in Greek Hexameter: A Computational Approach.” TAPA (Society for Classical Studies), vol. 151, no. 2, 2021, pp. 439–467, https://doi.org/10.1353/apa.2021.0017.

Spärck Jones, K. (1972). “A Statistical Interpretation of Term Specificity and Its Application in Retrieval”. Journal of Documentation. 28: 11–21.

by Gregory Crane Posted on May 17, 2022

Posted in Ancient Greek | Tagged Ancient Greek, epic, Homer, NLP | Comments Off

New ways to read Greek and Persian epic and to explore diverse cultures

Gregory Crane1, Alison Babeu1, Farnoosh Shamsian2, James Tauber3, Jacob Wegner3

1 – Tufts University; 2 – Leipzig University; 3 – Eldarion.com

[Preprint of the abstract for Digital Humanities 2020: Responding to Asian Diversity – https://dh2022.adho.org/]

Our work explores the hypothesis that a new mode of reading is taking shape, one in which dense, machine actionable annotations allow readers to work directly and effectively with sources in languages that they do not know – a new middle space between reliance on translation and mastery of the source text (Crane et al. 2019, Crane 2019). This hypothesis has substantial potential importance for our ability to use source texts to explore cultural diversity in general and the diversity of Asian cultures in particular. Our particular work focuses on two challenges for a traditionally Eurocentric subject, Classics (or Classical Studies), which is still used to describe the study of Greco-Roman culture. On the one hand, university students without training in Greek and Latin in secondary school have difficulty mastering the languages and learning about the subject. In spring 2021, the Princeton Classics Department provoked controversy when it made it possible for majors to study Greco-Roman antiquity without learning any Greek or Latin — too few students, especially students of color, had access to Latin, much less Greek, before college (Wood 2021). At the same time, Classics and Classical Studies are far too narrow – we must include other classical languages – Sanskrit, Classical Chinese, Classical Arabic, etc. – if we are to continue using these terms. We report on work that addresses both challenges.

In order to explore this broad topic, we chose to focus on two complementary corpora in two Classical languages: the Homeric epics in Ancient Greek and the Shahnameh in early-modern Persian (Firdawsī 1430, 1988). The goal is both to support Persian speakers who wish to work directly with Homeric Epic and English speakers who wish to engage directly with the Shahnameh. In the case of Persian culture, the links with Greco-Roman culture are deep, the information in Greek sources about ancient Persian history is extensive, and the influence of Greek philosophy, medicine and science are extensive. At the same time, few institutions in Europe and North America, for example, teach modern Persian, much less the early modern Persian of the Shahnameh. We hope to increase the role that the Persian epic, in Persian as well as in translation, plays beyond the Persian speaking world.

The use of dense linguistic annotation to make sources accessible to a broader audience is, of course, hardly new. In his late seventeenth-century description of Ottoman Turkish, Arabic, and Persian, for example, Franciszek Meniński (1680) introduced Persian poetry to a European audience by transliterating a passage from a poem by Hafez, providing a word-by-word translation, and providing detailed explanations of the metrical, morphological, and syntactic function of each word. Contemporary linguists depend upon exhaustively annotated text to work across sources from the thousands of languages, ancient as well as modern, in the human record (Werning 2009).

Digital methods, however, fundamentally change our ability (Berti 2019, Schulz 2021). First, we can use natural language processing pipelines such as Stanza and Spacy (Papantoniou and Tzitzikas 2020), multilingual language models such as BERT (Bamman and Burns 2020), machine translation (Kontogianni et al. 2020), most effective for now between modern languages (Bowker 2021), and similar openly licensed resources. In the work that we present, we document the categories which we have found as starting points to augment machine readable texts. The Homeric epics have provided a useful starting point because a particularly rich set of preexisting, open digital resources are available upon which to build. The work with Homeric epic provides us with the framework upon which we are building work with the Shahnameh. We will report on the work with Homeric Greek and summarize progress with pre-modern Persian (building on Pizzi 1881).

Exhaustive Metrical Analysis and accompanying sound: Machine-actionable metrical analyses (Schoisswohl and Papakitsos 2020) for every syllable in every line of the Iliad and the Odyssey and readings for a substantial portion of the epics are available under an open license (Chamberlain 2021). From the time they learn the Greek alphabet learners engage immediately with the Homeric epic as performed poetry, following metrical diagram and performance.

Treebanks document features such as the dictionary entry, part of speech, and syntactic function of every word in a source (Keersmakers, 2019). Treebanks are available for both the Iliad and Odyssey and for more than 1 million words of Greek and Latin (Bamman and Crane, 2011; Celano 2019). We can use these to identify and quantify grammatical structures that students will encounter in the corpus that they will learn.

Paradigm information identifies the morphemes within each individual word and allows learners to see which morphological patterns are most common and to prioritize their learning.

Born-digital aligned translations are created from the start to expose the linguistic structures of a source. From the first lessons, learners can explore the meaning of vocabulary by seeing passages where these words appear (Palladino 2020, Palladino et al. 2021). They can focus on words introduced in a lesson but, in using the aligned translation, they gain constant incidental exposure to words that they have not learned. With translation, the problem of the learner language becomes far more pressing and we can report on the varying challenges of articulating Greek language with translations into English and Persian (Foradi 2020).

Grammatical explanations explain the patterns that learners encounter in the annotations (Mugelli 2021) described above (e.g., the various uses of the dative in Greek or the way we translate the imperfect tense). Grammatical explanations build upon the aligned translations. Grammatical explanations cannot simply be translated but must be adapted to bridge the gap between the target language and the learner language. We report on the differences that arise for speakers of Persian and English.

We will report upon experiences of learners from both Iran and the United States, upon our experiences opening historical students to new audiences (e.g., Classics majors who do not know Greek or Latin) and cross-cultural explanation of content (e.g., Homeric Epic for Persian speakers and Persian poetry such as the Shahnameh for English speakers).
Bibliography

Bamman, D., Burns, P. (2020). “Latin BERT: A Contextual Language Model for Classical Philology.” https://arxiv.org/abs/2009.10053

Bamman, D., Crane, G. (2011). “The Ancient Greek and Latin Dependency Treebanks,” in Language Technology for Cultural Heritage, ed. Caroline Sporleder, Antal van den Bosch, and Kalliopi Zervanou, Theory and Applications of Natural Language Processing (2011), pp. 79–98.

Berti, Monica, ed. (2019). Classical Philology. Ancient Greek and Latin in the Digital Revolution. Berlin: De Gruyter Saur.

Bowker, Lynne. (2021). “Digital humanities and translation studies.” Handbook of Translation Studies: Volume 5 5 (2021): 37-.

Celano, Giuseppe. (2019). “The Dependency Treebanks for Ancient Greek and Latin”. In Digital Classical Philology. Ancient Greek and Latin in the Digital Revolution, edited by Monica Berti, 279–298. Berlin: De Gruyter Saur.

Chamberlain, David. (2021). “A Reading of Homer (work in progress).” Greek and Roman Verse, https://hypotactic.com/my-reading-of-homer-work-in-progress/. Accessed 8 December 2021.

Crane, G. R; Shamsian, F.; et al . (2019). “Confronting Complexity of Babel in a Global and Digital Age.” DH2019: Digital Humanities Conference, Book of Abstracts (2019), pp. 127–138.
https://dev.clariah.nl/files/dh2019/boa/0611.html

Crane, Gregory. (2019). “Beyond Translation: Language Hacking and Philology.” Harvard Data Science Review 1, no. 2. https://doi.org/10.1162/99608f92.282ad764.

Firdawsī, A. & Khaleghi-Motlagh, D. (1988). The Shahnameh: Book of kings. New York: Bibliotheca Persica.

Firdawsī, A. & Ja’Far, P. C. (1430) The Book of Kings. Tehran: Cultural Heritage Organization. [Pdf] Retrieved from the Library of Congress, https://www.loc.gov/item/2021667287/.

Foradi, Maryam. (2020). Engagement with Classical Literature in the Framework of a Citizen Science Project Using Translation Alignment: Date Accuracy and Pedagogical Effectiveness, [Doctoral dissertation, University of Leipzig]

Keersmaekers, Alek. (2019). “Creating, Enriching, and Valorizing Treebanks of Ancient Greek.” 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019),
https://doi.org/10.18653/v1/W19-7812

Kontogianni, A., et al. (2020). “Computer-Assisted Translation of Egyptian-Coptic into Greek.” Journal of Integrated Information Management, http://ejournals.uniwa.gr/index.php/JIIM/article/view/4470

Meniński, F. (1680). Thesaurus Linguarum Orientalium Turcicae, Arabicae, Persicae, Praecipuas earum opes a Turcis peculariter usurpatas continens, Vienna: Franciscus a Mesgnien Meninski.

Mugelli, Gloria et al. (2021). “Learning Greek and Latin Through Digital Annotation: The EuporiaEDU System.” In: Teaching Classics in the Digital Age. Kiel: Universitätsverlag Kiel | Kiel University Publishing. S. 25–36. (= Think! Historically: Teaching History and the Dialogue of Disciplines). Online unter: https://macau.uni-kiel.de/receive/macau_mods_00001367.

Palladino, Chiara (2020). ”Reading Texts in Digital Environments: Applications of Translation Alignment for Classical Language Learning.” The Journal of Interactive Technology Pedagogy, Issue 18, December 10, 2020, https://jitp.commons.gc.cuny.edu/reading-texts-in-digital-environments-applications-of-translation-alignment-for-classical-language-learning/

Palladino, C., Foradi, M., Yousef, T. (2021). “Translation Alignment for Historical Language Learning: A Case Study.” Digital Humanities Quarterly, 15 (3),
http://digitalhumanities.org/dhq/vol/15/3/000563/000563.html

Papantoniou, K., Tzitzikas, Y. (2020). “NLP for the Greek Language: a Brief Survey.” SETN 2020: 11th Hellenic Conference on Artificial Intelligence, pp. 101-109.
https://doi.org/10.1145/3411408.3411410

Pizzi, I. Manuale della lingua persiana. Grammatica, antologia e vocabolario, Leipzig, W.
Gerhard (1881). Available at https://archive.org/details/manualedellalin00pizzgoog/

Schoisswohl, O., Papakitsos , E. C. (2020). “Automated metric profiling and comparison of Ancient Greek verse epics in Hexameter.” Linguistik Online, 103(3), 159–177. https://doi.org/10.13092/lo.103.719

Schulz, Konstantin (2021). “Natural Language Processing for Teaching Ancient Languages.” In: Teaching Classics in the Digital Age. Kiel: Universitätsverlag Kiel | Kiel University Publishing. S. 37–48. (= Think! Historically: Teaching History and the Dialogue of Disciplines). Online unter: https://macau.uni-kiel.de/receive/macau_mods_00001368.

Werning, Daniel A. (2009) “Glossing Ancient Egyptian. Suggestions for Adapting the Leipzig Glossing Rules.” Lingua Aegyptia. Journal of Egyptian Language Studies, https://www.academia.edu/1484975/Glossing_Ancient_Egyptian_Suggestions_for_Adapting_the_Leipzig_Glossing_Rules.

Wood, G. (2021, June 9). Princeton Cancels Latin and Greek. The Atlantic. Retrieved December 10, 2021, from https://www.theatlantic.com/ideas/archive/2021/06/princeton-greek-latin-requirement/619136/

Acknowledgements

This work was made possible by the Beyond Translation Project, funded by NEH HAA-266462-19, by support from the Data Intensive Studies Center at Tufts University and by the German Academic Exchange Service.

by Gregory Crane Posted on April 21, 2022

Posted in Essays, Historical languages | Tagged Ancient Greek, classical studies, DH2022, epic, Homer, metrical analysis, NLP, Persian, Shahnameh | Comments Off

Visualizing progress in a historical language (2)

Gregory Crane

In part one, I discussed the frequency of different vocabulary items and ways by which learners could track how many words they knew in a particular corpus as they went through some sequence of vocabulary acquisition. I focused upon the first book of the Homeric Iliad as an immediate target corpus and on how learning the vocabulary of this chunk would transfer when readers moved to other sections of the Iliad and Odyssey.

This second part focuses on the problem of visualizing what learners have and have not seen, what new vocabulary they have just encountered, and what new and future vocabulary is above or below some threshold separating common from uncommon words. The goal here is not to provide a finished design but to present a draft that can be developed further. The technology employed is relatively simple — the visualizations are implemented as Support Vector Graphics (SVG) objects. The next version will probably be in some dialect of Javascript but the figures below demonstrate one model by which to convey the more granular view of vocabulary.

I assume the ability to visualize aggregate changes in vocabulary knowledge as in Figure 1 (below):

Figure 1: increasing knowledge of vocabulary in Iliad 1 as a learner moves through the 77 chapters of Clyde Pharr’s Homeric Greek.

Figure 1 shows how learners working with Clyde Pharr’s Homeric Greek encounter 100% of all vocabulary items in Iliad 1 over the course of the book’s 77 chapters. Running words belonging to known vocabulary items are represented as green, new words as blue, unseen and common words as red, and unseen but uncommon words as orange.

Figure 2 illustrates how known, new, and unknown words are distributed throughout Iliad 1. I should emphasize that the visualization is itself a general tool designed to show the distribution of different phenomena across the 48 books of Homeric Epic with a particular focus on one book at a time. It could track any variables — e.g., dialogue vs. narrative, references to Achilles vs. Hector, or to optatives or particular clusters of repeated words. In this case, we just happen to be tracking and prioritizing what words learners have and have not seen.

Figure 2: Five different features about each word in book 1 of the Iliad (details below). Green dots represent vocabulary items that the learner has encountered in the opening 10 chapters of Clyde Pharr’s *Homeric Greek*

Figure 2 (above) provides a visualization of the Homeric Epic, with a focus on book 1 of the Iliad. The visualization is a simple mockup to explore features that a more mature and technologically sophisticated visualization would exhibit. It consists of a SVG figure in an HTML file. It offers limited interactivity — mousing over the representations of individual words and books of the Iliad and Odyssey causes information to appear (in figure 1, there is information about the word boulê, “will, plan” (as in the “plan of Zeus”) that appears 8th (and as the final) word in line 5 of Iliad 1.

The visualization contains two major parts. The left part, in a combination of black and grey, provides an overview of the Iliad and the Odyssey.

Figure 3: an overview of the 24 books of the Iliad and the Odyssey, with Iliad 1 highlighted. Each column corresponds to the relative length of each book in lines.

The left side of the figure above provides a representation of each book of the Iliad and of the Odyssey, 48 books in all. The length of each column reflects the relative number of lines in each book. The book illustrated in the main part of the visualization is highlighted as black, with the other 47 books as gray.

Figure 4: Detail from Figure 1: color coding used to describe status

Colors were chosen based on the recommendations of Bang Wong [“Points of View: Color Blindness,” Nature Methods 8 (2011)] to minimize the impact of different forms of color blindness to make the color palette of visualization as accessible as possible. Colors describe six different features:

Vocabulary items that students have seen so far (in figure 1, the first 10 chapters of Pharr’s Homeric Greek) are lime green.
Vocabulary items introduced in the most recent lesson (chapter 10 of Pharr’s Homeric Greek) that appear more than 3 times in Iliad 1 appear as blue.
Vocabulary items introduced in the most recent lesson that appear 3 or less times appear as light sky-blue. The goal is to help learners prioritize the vocabulary on which they focus, emphasizing more active command for more common words.
Vocabulary items that learners have not encountered and that will appear more than 3 times in Iliad 1 appear as vermilion (a darker reddish color). Learners can skip ahead and see which frequent terms appear.
Vocabulary items that learners have not encountered and that will appear 3 times or less in Iliad 1 appear as orange.
Punctuation marks are listed as black. This allows readers to get an overview of sense breaks in the poetry (insofar as editors have chosen to represent these sense breaks with punctuation). At present, every form of punctuation is marked. It may well make more sense to include only full stops (and not commas) as a default.

The main section of the visualization presents one small box for each word in the Iliad 1. The length of each row reflects the number of words in each line. There are breaks between every 10 lines and each column contains up to 100 lines.

Figure 5: detail from figure 1, showing Iliad 1.1-14 and 101-114.

Figure 5 shows a close-up from figure 1. Almost every word in lines 1-5 is green (a known vocabulary item) or black (punctuation): Pharr begins with the start of Iliad 1 and moves through the poem a few lines at a time in each chapter. The one word that is unseen (word 3 in line 5) is a form of the adjective pas, “all, each.” The grammar has not introduced third declension adjectives yet and thus this word is not yet in the vocabulary (it was covered with a note to the text)

Figure 6: mousing over the dark blue dot, the reader sees that it corresponds to the Greek word Danaans, which appears 10x in *Iliad* 10.

Figure 6 illustrates information about a particular word. The reader has moused over the 4th word in Iliad 1.109, sees the text of this line, discovers that it is a form of the name Danaans (one of several words to describe the Greeks) and that this word shows up 10 times in Iliad 1. A gloss from Pharr, the inflected form and a code for the part of speech (noun, plural, masculine, dative) follows.

Figures 7, 8, and 9 (below) show how the learners knowledge of vocabulary changes as they move through the text book, with snapshots showing where they stand.

Figure 7: known vocabulary after 20 chapters of Pharr.

Figure 7 shows the state after chapter 20 of Pharr. This chapter covers Il. 1.28-32. A cluster of dark and light blue dots (depending on their frequency) in these lines shows how new vocabulary presented in this chapter was designed to fill in the gaps for these new lines. Places where the vocabulary presented in chapter 20 appears later in book 1 also appear as light or dark blue dots.

Even at the relatively low resolution of the image above, a solid block of green now appears at 1.371-379 and that block shows, not only to those of us who have worked on Homeric poetry for years but also to those learning Homeric Greek, how the oral tradition works. Iliad 1.372-375 repeats 1.13-16; 1.376-379 repeats 1.22-25. Iliad 1.371 does not repeat 1.12 but every word in 1.371 has already occurred. We can also note that 1.371-379 repeats two different chunks that are separated earlier in book 1 by a speech: Chryses’ request that the sons of Atreus accept ransom for his daughter (1.17-21). We can see at work here how the tradition can change the length of an account by selectively adding or subtracting lines.

Figure 8: known vocabulary after 30 chapters of Pharr.

Chapter 30 covers Il. 1.81-85. Almost all of the words through line 82 appear as green. New vocabulary clusters in the lines covered in this chapter.

Figure 9: known vocabulary after 40 chapters of Pharr

Figure 9 shows the state in chapter 40, which covers Il. 1.158-164. Blue dots cluster around the seven newly introduced lines. At the same time, the share of green dots, representing known vocabulary, has increased dramatically since figure 1.

At some point, learners will presumably want to move on another text beyond the first book of the Iliad. At the moment, we are able to recalculate what vocabulary readers will have seen if they move to a new book of Homer after any given chapter of Pharr.

Figure 10: known vocabulary in *Iliad* 16 after 40 chapters of Pharr.

Figure 11: known vocabulary in *Odyssey* 5 after 40 chapters of Pharr.

Work to be done

The images above are, as noted earlier, only mockups to help think through what a more finished system for learners to track what they (should) have learned and what they will need to learn going forward. I list below some of the more obvious topics.

Support for word senses: the visualizations above assume that once learners have encountered the piles of glosses and synonyms that Pharr offers in his glosses, they can recognize the meaning of that word in any subsequent passage. Since words have multiple meanings –some words have very many meanings — that assumption is clearly false. In fact, we have several machine readable Greek-English dictionaries such as Liddell-Scott Jones, Cunliffe’s Homeric Dictionaries (including his dictionary of people and places), and the new Cambridge Greek Lexicon (which will hopefully appear on the Scaife Viewer in 2022). We spent a fair amount of time capturing the structure of the dictionary entries and giving each sense that had a label in the print lexicon a unique digital identifier. We could experiment with the benefits of linking to particular word senses and not just to the dictionary headwords.
Support for inflection: visualizations above assume that once learners have encountered a dictionary entry, they can understand all of its forms. Greek verbs, however, have many different inflectional pattern and traditional learners of Greek spend much of their time learning the many ways Greek verb forms can be generated. We actually have segmented words (as appropriate) into preverbs, augments, stems, and endings, with labels for different inflection categories (e.g., first vs. second declension). We could identify which forms of a new verb learners could already parse and which forms follow paradigms that they have not yet learned. In an earlier version of this work, I tracked which forms learners could parse at any given time but decided that the added complexity was not worthwhile (at least until we have a more customizable system). As it is, we have the inflectional analyses for every word in the Homeric Epics. Learners develop practice using lookup tools to parse unknown forms.
Relative frequency in each of the 48 books of Homeric Epic: The left-hand side macro-view should show how many words readers would know for each of the 48 books of Homeric Epic. Readers could see at a glance if some books stand out because they share more or fewer vocabulary items. Such a visualization requires normalization — the longest books of the Iliad are twice as long as the shortest books in the Odyssey.
Implementation as a fully-interactive, browser based visualization: The SVG implementation shown above does support some interaction but a more mature implementation would allow for many different use cases and could be customized in many different ways.

A great deal of work in digital philology goes into the backend processing of textual data and into the production of new conclusions. We also need a great deal of work on the design of very basic visualizations that can reveal through our highly developed visual abilities linguistic patterns that our ears will not catch and that our eyes will not see as they scan text a word at a time. We need — and we will ultimately have — new dashboards — aggregations with multiple visualizations at a glance — that allow us to work with our textual sources at scale. Once we happen upon visualizations that reach a certain level of functionality, the field will cluster around them. Soon they will be taken for granted as if they had always been there. We are, however, a long way from that state. My goal here is to push the process along, whether that leads to improving what I have presented or to a wholly different, and more effective, approach.

Acknowledgements

by Gregory Crane Posted on April 21, 2022

Posted in Ancient Greek | Tagged Homer, NLP, visualization | Comments Off

Visualizing Progress in Homeric Greek (1)

Gregory Crane*

This paper is designed to be the first of two that explore the degree to which learners can track how much of the vocabulary as a whole in a target corpus they have encountered and to see the frequency in the rest of the corpus of each newly encountered vocabulary item. We focus here upon the Ancient Greek Iliad and the Odyssey, a corpus of just over 200,000 running words. Homeric Epic provides a useful starting point because a growing cluster of openly-licensed, digital resources for this corpus are available, including links from each form in the epic to a dictionary entry, the starting point for vocabulary analysis.

In other work, as part of this larger, NEH-funded effort that we call Beyond Translation, we are developing ways for learners to see any and all other passages where a new Greek word appears and thus to see for themselves what the word means, rather than depending upon textbook glosses — from the very first day, learners should be able to begin studying words in context by using translations aligned at the word and phrase level to the original as well as the linguistic analyses explaining the function of each word. This paper, however, focuses simply on how learners can organize vocabulary acquisition by corpus frequency and then how they can track their progress.

This paper also builds upon Clyde Pharr’s Homeric Greek (D.C. Heath & Co., 1925), a brilliant and innovative introduction to Ancient Greek, produced almost a century ago. Pharr introduces learners to the vocabulary of Iliad 1.1-5 on page 1 of his grammar and begins including small, but gradually growing chunks, of Iliad 1 from lesson 13 on, ultimately covering all 611 lines in this book by the end of lesson 77. Book 1 of the Iliad thus serves as the target corpus and leaners acquire a basic knowledge of Greek by learning as much as possible about the 4,563 running words in this book.

Researchers studying modern languages commonly report that readers need to understand at least 95% and preferably 98% of the running words in a text to gain adequate comprehension (e.g., Schmitt et al., 2011). The 95% threshold is often cited by professional teachers of Greek and Latin. Homeric Epic may be different and more amenable to reading before readers approach 95% vocabulary knowledge. Students of Greek commonly report that they can quickly begin to read Homeric Greek with speed and facility — that was certainly my personal experience when I read Homer intensively in the 1970s, long before any digital tools were available. It may well be that the formulaic nature of Homeric poetry means that it is easier to skip words when reading — there are certainly many common epithets for which the conventional translations are (in my view) guesses based on context.

Furthermore, when we explore Ancient Greek literature as a whole, we shift from genre to genre — from Homer to Attic Tragedy, from historians such as Herodotus (who write in the Ionic dialect) to Thucydides (who writes in Attic), from the comedies of Aristophanes to the dialogues of Plato. The various authors and genres differ in style and vocabulary. It is difficult to learn enough Ancient Greek from so many sources for any but the most experienced to expect they would understand 95-98% of the words in a text they had not seen. Most readers of Ancient Greek will depend upon using online dictionaries. My goal is to help learners to internalize as much knowledge as possible and to understand these sources with as much fluency, precision and pleasure as possible.

While Pharr brings his readers into contact with Homer as soon as he feels he can, he does not confine himself to authentic Greek text. Pharr includes exercises composed in artificial, textbook Greek and asks learners to produce exercises in prose, but the textbook vocabulary is almost entirely drawn from Iliad 1 and designed to help learners acquire both active command and passive recognition of Iliad 1.

Such exercises, of course, carefully track what learners have and have not learned in any given lesson. While John Wright and Paula Debnar completely rewrote the grammatical explanation in Pharr, they carefully preserved Pharr’s exercises (Pharr 2012). Ultimately, we should be able to generate such exercises automatically from an annotated corpus — we could almost certainly do so now, given the state of natural language processing — but for now we will build upon the manually constructed exercises that Pharr constructed a century ago.

I assume that learners are working to master an existing corpus, rather than to produce and understand an effectively infinite number of utterances in a living language. This use case can, of course, describe use cases with modern language where a learner wishes to understand a particular corpus — e.g., the screenplay of a Kurosawa movie or the lyrics of the Grateful Dead. They are able to measure their progress against a corpus that is, at least for the moment, relatively fixed.

Manually curated annotations listing the dictionary form, part of speech, and syntactic function for more than 1 million running words of Ancient Greek are available on GitHub from various openly licensed treebanks. This includes the c. 200,000 running words in the Iliad and the Odyssey. We can thus begin to measure how factors such as how many dictionary words and how many inflected forms we find in Homeric Epic.

Vocabulary in the Homeric Epics

The precise counts will vary according to the edition chosen (the Perseus Treebank for Homer uses Allen’s 1920 Oxford Classical Text), to the way words are counted (the Perseus Treebank treats words such as mête (“neither”) as two words: mê (“not”) and te (“also, and”)) and to decisions as to whether particular words get assigned to one or to two or more dictionary entries. These factors are unlikely to change the picture that emerges in the figure below.

Figure 1: Relative number of dictionary entries and inflected forms in the Iliad and *Odyssey*

Figure 1 plots the contribution of each dictionary entry and inflected form to an understanding of Homeric Epic as a whole. It starts with the most common dictionary forms and inflected forms, and then moves down the lists, adding the frequency of each new item and measuring how many words in Homeric epic for which we can now account. Figure 1 drives home the fact that Greek is a highly inflected language, where each dictionary entry shows up — on the average — in 3.6 different forms.

The table below shows most common forms in Homeric epic and how much each contributes to the overall corpus of 200,518 running words. Most of the forms are indeclinable particles and prepositions. Only a few of this top 20 are, strictly speaking, inflected. Note that no attempt was made to normalize forms: thus, kai (“and”) with a grave accent, for example, appears as a distinct form of kai with a grave accent, although the two words are identical in meaning. Normalization might reduce the number of forms but would not significantly change the relationship between dictionary entries and forms in Figure 1.

1	δ’	δέ	but	7014	7014	0.03
2	καὶ	καί	and	4812	11826	0.05
3	δὲ	δέ	but	3461	15287	0.07
4	τε	τε	and	2766	18053	0.09
5	δέ	δέ	but	1640	19693	0.09
6	μὲν	μέν	on the one hand	1634	21327	0.10
7	οὐ	οὐ	not	1560	22887	0.11
8	ἐν	ἐν	in	1407	24294	0.12
9	τ’	τε	and	1222	25516	0.12
10	ὣς	ὡς	as	1215	26731	0.13
11	γὰρ	γάρ	for	967	27698	0.13
12	ἀλλ’	ἀλλά	otherwise	933	28631	0.14
13	τὸν	ὁ	the	904	29535	0.14
14	ἐπὶ	ἐπί	on	829	30364	0.15
15	οἱ	ἕ	nodef	804	31168	0.15
16	αὐτὰρ	ἀτάρ	but	759	31927	0.15
17	μοι	ἐγώ	I (first person pronoun)	745	32672	0.16
18	δὴ	δή	[interactional particle: S&H on same page]	741	33413	0.16
19	οὔ	οὐ	not	677	34090	0.17
20	μιν	μιν	him	644	34734	0.17

From the table above, we can see that most of the most frequent forms in the corpus are not inflected and thus the ratio of 3.6 forms per dictionary entry underestimates how many different forms of many words audiences actually encounter. (When, in summer 2019, Ethan Yates generated the forms that Clyde Pharr listed in the paradigms for Homeric Greek textbook, he produced a list of almost 14,000 different inflected forms, illustrating the various paradigms. If he had included every inflected participle, the list would have been substantially longer).

The following table shows the counts for the 20 most common dictionary words in Homeric Epic.

1	δέ	but	12138	12138	0.06
2	ὁ	the	5870	18008	0.08
3	καί	and	5283	23291	0.11
4	τε	and	4322	27613	0.13
5	ἐγώ	I (first person pronoun)	2871	30484	0.15
6	οὐ	not	2695	33179	0.16
7	εἰμί	to be	2118	35297	0.17
8	ἐν	in	2076	37373	0.18
9	ὅς	who	2043	39416	0.19
10	ὡς	as	2007	41423	0.20
11	σύ	you (personal pronoun)	1900	43323	0.21
12	μέν	on the one hand	1872	45195	0.22
13	ἄρα	particle: ‘so’	1772	46967	0.23
14	τις	any one	1457	48424	0.24
15	ἄν	modal particle	1449	49873	0.24
16	ἀλλά	otherwise	1430	51303	0.25
17	γάρ	for	1401	52704	0.26
18	ἐπί	on	1371	54075	0.26
19	αὐτός	unemph. 3rd pers.pronoun; -self; [the] same	1121	55196	0.27
20	πᾶς	all	1089	56285	0.28

The figure above reports that 8,792 different dictionary words account for all 200,518 running words counted in the Perseus Treebank for Homeric Epic. For those who do not know what a power law is, the figures above provide a good example. The top 20 dictionary words — less than 1 out of 450 or 0.23% — account for 56,285 running words — more than a quarter of the corpus as a whole (28%).

The top thousand dictionary entries (11% of the whole) account for 165,922 words (82% of the whole). The table below lists every 50th entry for perspective and to illustrate how much progress learners could make if they based the vocabulary on frequency. (Of course, they would probably use a modified strategy where they started with the most common nouns, verbs etc. rather than the most common forms).

1	δέ	but	12138	12138	0.06
50	μή	not	614	80123	0.39
100	σφεῖς	personal and (ind.) reflexive pronoun	294	100538	0.50
150	προσεῖπον	to address	190	112061	0.55
200	ἄνωγα	to command	145	120344	0.60
250	πάσχω	to experience	118	126800	0.63
300	ἡμέτερος	our	99	132248	0.65
350	θέω	to run	83	136814	0.68
400	ὅστις	indef. relative or indirect interrogative	73	140722	0.70
450	ἕλκω	to draw	65	144164	0.71
500	πάλιν	back	57	147208	0.73
550	τέμνω	to cut	51	149911	0.74
600	ἔνθεν	whence; thence	45	152289	0.75
650	ἀλέομαι	to avoid	42	154446	0.77
700	φόβος	fear	39	156458	0.78
750	ὀρέγω	to reach	36	158317	0.78
800	ἄρειος	nodef	34	160061	0.79
850	δατέομαι	to divide among themselves	31	161695	0.80
900	χαμᾶζε	to the ground	29	163190	0.81
950	ἐρετμόν	oar	27	164601	0.82
1000	βλάπτω	to disable	26	165922	0.82

The model above makes at least one assumption that would be problematic in print culture: once a dictionary word is encountered, it assumes that all forms of that word will be understood. For traditional learners, that is a major problem with Greek verbs because much of first year Greek is usually devoted to the many arcane ways Greek verbs can be produced.

In a digital environment, however, readers can see the analysis of each and every verb form long before they choose to internalize the ability to generate those forms. In this scenario, learners focus early on the outlines of Greek grammar (e.g., system of tenses, moods and voices) so that they have an understanding of what, for example, an aorist optative contributes. Full annotation of source texts would include links to a grammar explaining the function of each and every word.

The Homer treebank will, for example, tell you that the Greek words “spear” and “shoulder” are both datives and modify the verb ballô, “to strike with a thrown object” (a verb that shows up 469 times in Homer), but it will not tell you that the first dative reflects instrumentality (you strike someone “with a spear”) while the second designates location (you strike someone “on the shoulder”). We need to use either more complex tags (such as my Tufts colleague Matthew Harrington has developed) or you can create a separate annotation layer (as my collaborator Farnoosh Shamsian and I are doing).

The case explored below is based upon Clyde Pharr’s Homeric Greek (D.C. Heath & Co., 1925), a brilliant and innovative introduction to Ancient Greek, produced a century ago. It took a data-driven approach in its design when data had to be collected by hand. Pharr had two fundamental principals: (1) design the grammar so that learners engage directly with the corpus of interest as soon as possible and (2) use frequency data so that learners can engage with more frequent phenomena first.

There is, however, tension between the principles of engaging as quickly as possible with the source texts and the of focusing on the most common words. On the one hand, Pharr’s first year grammar introduces Iliad 1.1-5 in lesson 13 — the beginning of week 5 if learners move through the book 3 chapters each week. Pharr has, however, been preparing learners for this from the first page of the textbook. Learners are asked to learn the Greek alphabet by spelling and pronouncing the dictionary entries for each word in Il. 1.1-5 , each of which is offered along with a definition.

Figure 2: Pharr, chapter 1 –dictionary forms and definitions for each word in Iliad 1.1-5

Pharr introduces the vocabulary for the first five lines of the Iliad on page 1. The immediate goal is for learners to use these words to practice reading the Greek alphabet. At the same time, they have an opportunity for incidental learning, as they are exposed to the definitions and to words that they will study in more detail much later (such as tithêmi, “put, place, cause”, the complex morphology of which is introduced in lesson 32).

Before addressing the question of word frequencies and vocabulary acquisition, I want to point out in passing that we could take a very different approach to having learners begin by sounding out words from Iliad 1.1-5 on day one. In a digital environment, for example, we easily add sound as a guideline that learners can use as a starting point. We can certainly link a simple reading of the Greek words and definition. We can, however, also provide metrical analysis and reading as well.

Figure 3: David Chamberlain’s metrical analysis and performance of Iliad 1.1-15 on Hypotactic.

Figure 4: Chamberlain’s analysis and performance on the new Beyond Translation version of Perseus.

Figures 3 and 4 illustrate how data published under an open license can be available in multiple venues. The use of either venue would make possible a different approach for the beginning learner, one that made the sound of the dactylic hexameter an object of study from day 1. That would certainly add to the complexity of what can be a daunting start, but it would also enable learners to begin approaching Homeric Epic as poetry immediately. In a world of smart phones, learners can also listen to recordings anywhere and there is no reason why we cannot make sound the sound of Homeric poetry a fundamental part of their experiences. Many of us listen to music online in languages that we do not know and do not wait to study the grammar before listening to songs that capture our imagination.

Returning to the question of vocabulary acquisition, the focus on exhaustive coverage of particular passages does, however, come at a price. Any substantial passage in Homeric epic — and any regular text — will contain words that rarely appear. Those who follow Pharr’s exercises will find themselves practicing an active command of words such as proiaptô, “hurl forward, send forth,” and helôrion, “booty, prey, spoil,” even though, in the rest of the Iliad and Odyssey, proiaptô appears only 3 times elsewhere and helôrion is never seen again.

We can, however, now easily indicate to the learner how often each of the words presented in this list actually occurs in the Iliad and Odyssey.

1	μῆνις	16	wrath, fury, madness, rage.
2	ἀείδω	40	sing (of), hymn, chant.
3	θεά	199	goddess.
4	Πηλείδης	59	son of Peleus, Achilles.
5	Ἀχιλλεύς	382	Achilles.
6	οὐλόμενος	14	accursed, destructive, deadly.
7	ὅς	2043	his, her(s), its (own).
8	μυρίος	32	countless, innumerable.
9	Ἀχαιός	722	Achaean, Greek.
10	ἄλγος	92	grief, pain, woe, trouble.
11	τίθημι	377	put, place, cause.
12	πολύς	874	much, many, numerous,
13	δέ	12138	post. conj. but, and, so, for.
14	ἴφθιμος	44	mighty, valiant, stout-hearted, brave.
15	ψυχή	81	soul, breath, life, spirit.
16	Ἅιδης	2	Hades, god of the lower world.
17	προιάπτω	4	hurl forward, send forth.
18	ἥρως	114	HERO, mighty warrior, protector, savior.
19	αὐτός	1121	self, him(self), her(self), it(self), same. (one), he, she, it.
20	ἑλώριον	1	booty, spoils, prey.

We could easily identify dictionary words that fall below some threshold of frequency in Homeric Epic as a whole. Learners could prioritize active mastery of dictionary words that are above that threshold and passive mastery of uncommon words (such as helôrion). Figure 5 shows the progress that learners would make as they moved through the 77 chapters of Pharr.

Figure 5: Vocabulary acquisition in Iliad 1 following Pharr with frequency cutoff of 20+x in Homeric Epic.

Figure 5 divides words into four classes. The y-axis show the number of running words in Iliad 1 while the x-axis tracks the 77 chapters in Pharr. The orange band at the top represents unseen words that fall below the selected threshold (here 20, so that orange words occur 19 times or less in Homeric Epic). The red band designates unseen vocabulary that meets the cutoff for common words (20 or more occurrences in Homeric Epic). The blue band shows how many new running words in Iliad 1 learners can understand after mastering the vocabulary in the current chapter. The green bars then allow learners to see how many words they should be able to recognize. Following Pharr’s plan, they achieve 100% coverage for all words in Iliad 1 at the end of the final chapter.

The 4563 running words in Iliad 1 contain 1118 separate dictionary entries. If, however, we focus on dictionary words that occur 20 or more times in the Homeric epics, we reduce that total to 655 dictionary entries for active mastery. The other 463 dictionary entries only account for 593 running words (pictured in orange above), i.e., the vast majority of them only occur once in Iliad 1. Learners can concentrate on recognizing these lesson common words in context rather than on actively mastering them.

If we were to increase the cutoff point to dictionary words that appear 30 or more times in Homeric epic, we would reduce the number of words for active recall from 655 to 565 — 90 dictionary entries from Iliad 1 occur between 21 and 30 times in Homeric Epic. Of the 4563 running words in Iliad 1, 730 would now fall into the less common category and could be recognized passively in context.

Figure 6: Vocabulary acquisition in *Iliad* 1 following Pharr with frequency cutoff of 30+x in Homeric Epic.

In increasing the cutoff from 20 to 30, we increase the number of words that we do not try to actively master, but Figure 6 shows that the change is not drastic — most readers will have to look carefully to see the difference from Figure 5.

The problem with focusing on Iliad 1 (or on any one book of the Iliad or the Odyssey) is that we will miss many words that are common in Homeric Epic as a whole but do not appear in our chosen book.

Figure 7: learning the vocabulary of *Iliad* 1 vs. that of the Homeric Epics as a whole

Where the green region in Figures 5 and 6 ultimately cover every single word in Iliad 1, being able to recognize every vocabulary item in Iliad 1 only prepares learners to recognize 145,026 of the 200,518 (72%) running words in the Homeric Epics as a whole.

If we return to the cutoff of 20 occurrences in Homer, 655 lemmas fall into this category. Adding this extra 90 dictionary entries for active mastery raises the the number of running words in Homer Epic that learners can understand from 145,026 to 150,848 (from 72% to 75%).

The problem is that many dictionary words that are common in Homer do not appear in Iliad 1. The following list shows the 20 most common such words in Homer that do not appear in the first book of the Iliad (with Telemachus, not surprisingly, at the top of the list).

	Tot.	Freq.	Lemma	Short Gloss (after Chicago Lemmas)
1	244	244	Τηλέμαχος	Telemachus
2	484	240	ἔγχος	a spear
3	716	232	μνηστήρ	a wooer
4	929	213	ξένος	a guest
5	1121	192	τεῦχος	a weapon
6	1311	190	κελεύω	to urge, command
7	1487	176	ἀλλήλων	of one another
8	1650	163	Ἄρης	Ares
9	1796	146	δόμος	a house; a course of stone
10	1932	136	ταχύς	quick
11	2067	135	ὀτρύνω	to stir up
12	2198	131	ἄστυ	a city
13	2328	130	πατρίς	fatherland
14	2457	129	δύω	dunk
15	2583	126	ἐκεῖνος	that one over there
16	2709	126	τῷ	therefore
17	2827	118	πάσχω	to experience
18	2943	116	δῆμος	people; (originally) a country-district
19	3059	116	ἀμφότερος	each of two
20	3173	114	πεδίον	a plain

Unsurprisingly, we do not find suitors (mnêstêres) or guests (xenos) — these are, like the name Telemachus, more typical of the Odyssey. But Iliad 1 does not have common words for spears (enchos) or weapons (teuchos) that appear frequently in the Iliad.

One could adopt a different approach and learn each common dictionary word in Homeric Epic, roughly starting from the top and working downwards. (You probably would start not only with the most common words but also with the most common regular nouns and verbs.) This would require looking at examples of Homeric Greek from outside of Iliad 1 to see how words such as mnêstêres and xenos are used. That would take away from the focus on Iliad 1 and/or require additional work. But it could produce a more balanced result.

Figure 8: two approaches to acquiring Homeric vocabulary

Figure 8 illustrates two approaches to learning Homeric vocabulary. The top and bottom lines show vocabulary acquisition for learners working through the vocabulary in the 77 lessons of Pharr. If learners mastered every single dictionary word in Iliad 1 (without setting aside words of lower frequency), the would have seen 76% of all words in Homeric Epic. The top line represents 100% coverage for Iliad 1.

The two middle lines reflect results from the top-down approach, where we learn words purely by frequency (hence these lines have very smooth curves). Instead of learning the 1118 separate dictionary entries covered in Pharr, we learn the most common 1000 dictionary entries in Homer. These two lines are much closer. Learners would have only mastered 85.5% of the words in Iliad 1 but they would have seen 83% of all words in Homeric Epic. A stricter adherence to the overall frequencies of Homeric vocabulary and less focus on the vocabulary of a particular passage might prove more satisfying and effective for learners. They would learn to recognize the 15% of vocabulary that was not on their list.

Overall, the point is that learners can now track their progress and the future value of new vocabulary in real time. Our hypothesis is that this will increase motivation and satisfaction, keeping learners more fully engaged and for a longer period of time. A next step will be to do design ways to test that hypothesis. First, however, we need to present, in the second part of this paper, another method by which to track seen and unseen vocabulary more precisely.

Acknowledgements

This work was made possible by the Beyond Translation Project, funded by NEH HAA-266462-19 and by support from the Data Intensive Studies Center at Tufts University.

Citations

Iliad Treebank: https://github.com/gregorycrane/gAGDT/blob/master/data/xml/tlg0012.tlg001.perseus-grc1.tb.xml.

Odyssey Treebank: https://github.com/gregorycrane/gAGDT/blob/master/data/xml/tlg0012.tlg002.perseus-grc1.tb.xml

Pharr, Clyde. Homeric Greek. Rev. ed., D.C. Heath & Co., 1925.

Pharr, Clyde, et al. Homeric Greek: A Book for Beginners. Fourth edition, University of Oklahoma Press, 2012.

Schmitt, Norbert, Jiang, Xiangying, and Grabe, William, Percentage of words known in a text and reading comprehension (February 2011) Modern Language Journal 95: 26-43.

by Gregory Crane Posted on April 19, 2022

Posted in Ancient Greek | Tagged Homer, NLP, visualization | Comments Off

Deciding which lines to link:tf/idf

Using the files

Acknowledgements

Citations

Acknowledgements

Work to be done

Acknowledgements

Vocabulary in the Homeric Epics

Acknowledgements

Citations

Archives