In part one, I discussed the frequency of different vocabulary items and ways by which learners could track how many words they knew in a particular corpus as they went through some sequence of vocabulary acquisition. I focused upon the first book of the Homeric Iliad as an immediate target corpus and on how learning the vocabulary of this chunk would transfer when readers moved to other sections of the Iliad and Odyssey.
This second part focuses on the problem of visualizing what learners have and have not seen, what new vocabulary they have just encountered, and what new and future vocabulary is above or below some threshold separating common from uncommon words. The goal here is not to provide a finished design but to present a draft that can be developed further. The technology employed is relatively simple — the visualizations are implemented as Support Vector Graphics (SVG) objects. The next version will probably be in some dialect of Javascript but the figures below demonstrate one model by which to convey the more granular view of vocabulary.
I assume the ability to visualize aggregate changes in vocabulary knowledge as in Figure 1 (below):
Figure 1: increasing knowledge of vocabulary in Iliad 1 as a learner moves through the 77 chapters of Clyde Pharr’s Homeric Greek.
Figure 1 shows how learners working with Clyde Pharr’s Homeric Greek encounter 100% of all vocabulary items in Iliad 1 over the course of the book’s 77 chapters. Running words belonging to known vocabulary items are represented as green, new words as blue, unseen and common words as red, and unseen but uncommon words as orange.
Figure 2 illustrates how known, new, and unknown words are distributed throughout Iliad 1. I should emphasize that the visualization is itself a general tool designed to show the distribution of different phenomena across the 48 books of Homeric Epic with a particular focus on one book at a time. It could track any variables — e.g., dialogue vs. narrative, references to Achilles vs. Hector, or to optatives or particular clusters of repeated words. In this case, we just happen to be tracking and prioritizing what words learners have and have not seen.
Figure 2: Five different features about each word in book 1 of the Iliad (details below). Green dots represent vocabulary items that the learner has encountered in the opening 10 chapters of Clyde Pharr’s Homeric Greek
Figure 2 (above) provides a visualization of the Homeric Epic, with a focus on book 1 of the Iliad. The visualization is a simple mockup to explore features that a more mature and technologically sophisticated visualization would exhibit. It consists of a SVG figure in an HTML file. It offers limited interactivity — mousing over the representations of individual words and books of the Iliad and Odyssey causes information to appear (in figure 1, there is information about the word boulê, “will, plan” (as in the “plan of Zeus”) that appears 8th (and as the final) word in line 5 of Iliad 1.
The visualization contains two major parts. The left part, in a combination of black and grey, provides an overview of the Iliad and the Odyssey.
Figure 3: an overview of the 24 books of the Iliad and the Odyssey, with Iliad 1 highlighted. Each column corresponds to the relative length of each book in lines.
The left side of the figure above provides a representation of each book of the Iliad and of the Odyssey, 48 books in all. The length of each column reflects the relative number of lines in each book. The book illustrated in the main part of the visualization is highlighted as black, with the other 47 books as gray.
Figure 4: Detail from Figure 1: color coding used to describe status
Colors were chosen based on the recommendations of Bang Wong [“Points of View: Color Blindness,” Nature Methods 8 (2011)] to minimize the impact of different forms of color blindness to make the color palette of visualization as accessible as possible. Colors describe six different features:
Vocabulary items that students have seen so far (in figure 1, the first 10 chapters of Pharr’s Homeric Greek) are lime green.
Vocabulary items introduced in the most recent lesson (chapter 10 of Pharr’s Homeric Greek) that appear more than 3 times in Iliad 1 appear as blue.
Vocabulary items introduced in the most recent lesson that appear 3 or less times appear as light sky-blue. The goal is to help learners prioritize the vocabulary on which they focus, emphasizing more active command for more common words.
Vocabulary items that learners have not encountered and that will appear more than 3 times in Iliad 1 appear as vermilion (a darker reddish color). Learners can skip ahead and see which frequent terms appear.
Vocabulary items that learners have not encountered and that will appear 3 times or less in Iliad 1 appear as orange.
Punctuation marks are listed as black. This allows readers to get an overview of sense breaks in the poetry (insofar as editors have chosen to represent these sense breaks with punctuation). At present, every form of punctuation is marked. It may well make more sense to include only full stops (and not commas) as a default.
The main section of the visualization presents one small box for each word in the Iliad 1. The length of each row reflects the number of words in each line. There are breaks between every 10 lines and each column contains up to 100 lines.
Figure 5: detail from figure 1, showing Iliad 1.1-14 and 101-114.
Figure 5 shows a close-up from figure 1. Almost every word in lines 1-5 is green (a known vocabulary item) or black (punctuation): Pharr begins with the start of Iliad 1 and moves through the poem a few lines at a time in each chapter. The one word that is unseen (word 3 in line 5) is a form of the adjective pas, “all, each.” The grammar has not introduced third declension adjectives yet and thus this word is not yet in the vocabulary (it was covered with a note to the text)
Figure 6: mousing over the dark blue dot, the reader sees that it corresponds to the Greek word Danaans, which appears 10x in Iliad 10.
Figure 6 illustrates information about a particular word. The reader has moused over the 4th word in Iliad 1.109, sees the text of this line, discovers that it is a form of the name Danaans (one of several words to describe the Greeks) and that this word shows up 10 times in Iliad 1. A gloss from Pharr, the inflected form and a code for the part of speech (noun, plural, masculine, dative) follows.
Figures 7, 8, and 9 (below) show how the learners knowledge of vocabulary changes as they move through the text book, with snapshots showing where they stand.
Figure 7: known vocabulary after 20 chapters of Pharr.
Figure 7 shows the state after chapter 20 of Pharr. This chapter covers Il. 1.28-32. A cluster of dark and light blue dots (depending on their frequency) in these lines shows how new vocabulary presented in this chapter was designed to fill in the gaps for these new lines. Places where the vocabulary presented in chapter 20 appears later in book 1 also appear as light or dark blue dots.
Even at the relatively low resolution of the image above, a solid block of green now appears at 1.371-379 and that block shows, not only to those of us who have worked on Homeric poetry for years but also to those learning Homeric Greek, how the oral tradition works. Iliad 1.372-375 repeats 1.13-16; 1.376-379 repeats 1.22-25. Iliad 1.371 does not repeat 1.12 but every word in 1.371 has already occurred. We can also note that 1.371-379 repeats two different chunks that are separated earlier in book 1 by a speech: Chryses’ request that the sons of Atreus accept ransom for his daughter (1.17-21). We can see at work here how the tradition can change the length of an account by selectively adding or subtracting lines.
Figure 8: known vocabulary after 30 chapters of Pharr.
Chapter 30 covers Il. 1.81-85. Almost all of the words through line 82 appear as green. New vocabulary clusters in the lines covered in this chapter.
Figure 9: known vocabulary after 40 chapters of Pharr
Figure 9 shows the state in chapter 40, which covers Il. 1.158-164. Blue dots cluster around the seven newly introduced lines. At the same time, the share of green dots, representing known vocabulary, has increased dramatically since figure 1.
At some point, learners will presumably want to move on another text beyond the first book of the Iliad. At the moment, we are able to recalculate what vocabulary readers will have seen if they move to a new book of Homer after any given chapter of Pharr.
Figure 10: known vocabulary in Iliad 16 after 40 chapters of Pharr.Figure 11: known vocabulary in Odyssey 5 after 40 chapters of Pharr.
Work to be done
The images above are, as noted earlier, only mockups to help think through what a more finished system for learners to track what they (should) have learned and what they will need to learn going forward. I list below some of the more obvious topics.
Support for word senses: the visualizations above assume that once learners have encountered the piles of glosses and synonyms that Pharr offers in his glosses, they can recognize the meaning of that word in any subsequent passage. Since words have multiple meanings –some words have very many meanings — that assumption is clearly false. In fact, we have several machine readable Greek-English dictionaries such as Liddell-Scott Jones, Cunliffe’s Homeric Dictionaries (including his dictionary of people and places), and the new Cambridge Greek Lexicon (which will hopefully appear on the Scaife Viewer in 2022). We spent a fair amount of time capturing the structure of the dictionary entries and giving each sense that had a label in the print lexicon a unique digital identifier. We could experiment with the benefits of linking to particular word senses and not just to the dictionary headwords.
Support for inflection: visualizations above assume that once learners have encountered a dictionary entry, they can understand all of its forms. Greek verbs, however, have many different inflectional pattern and traditional learners of Greek spend much of their time learning the many ways Greek verb forms can be generated. We actually have segmented words (as appropriate) into preverbs, augments, stems, and endings, with labels for different inflection categories (e.g., first vs. second declension). We could identify which forms of a new verb learners could already parse and which forms follow paradigms that they have not yet learned. In an earlier version of this work, I tracked which forms learners could parse at any given time but decided that the added complexity was not worthwhile (at least until we have a more customizable system). As it is, we have the inflectional analyses for every word in the Homeric Epics. Learners develop practice using lookup tools to parse unknown forms.
Relative frequency in each of the 48 books of Homeric Epic: The left-hand side macro-view should show how many words readers would know for each of the 48 books of Homeric Epic. Readers could see at a glance if some books stand out because they share more or fewer vocabulary items. Such a visualization requires normalization — the longest books of the Iliad are twice as long as the shortest books in the Odyssey.
Implementation as a fully-interactive, browser based visualization: The SVG implementation shown above does support some interaction but a more mature implementation would allow for many different use cases and could be customized in many different ways.
A great deal of work in digital philology goes into the backend processing of textual data and into the production of new conclusions. We also need a great deal of work on the design of very basic visualizations that can reveal through our highly developed visual abilities linguistic patterns that our ears will not catch and that our eyes will not see as they scan text a word at a time. We need — and we will ultimately have — new dashboards — aggregations with multiple visualizations at a glance — that allow us to work with our textual sources at scale. Once we happen upon visualizations that reach a certain level of functionality, the field will cluster around them. Soon they will be taken for granted as if they had always been there. We are, however, a long way from that state. My goal here is to push the process along, whether that leads to improving what I have presented or to a wholly different, and more effective, approach.
This paper is designed to be the first of two that explore the degree to which learners can track how much of the vocabulary as a whole in a target corpus they have encountered and to see the frequency in the rest of the corpus of each newly encountered vocabulary item. We focus here upon the Ancient Greek Iliad and the Odyssey, a corpus of just over 200,000 running words. Homeric Epic provides a useful starting point because a growing cluster of openly-licensed, digital resources for this corpus are available, including links from each form in the epic to a dictionary entry, the starting point for vocabulary analysis.
In other work, as part of this larger, NEH-funded effort that we call Beyond Translation, we are developing ways for learners to see any and all other passages where a new Greek word appears and thus to see for themselves what the word means, rather than depending upon textbook glosses — from the very first day, learners should be able to begin studying words in context by using translations aligned at the word and phrase level to the original as well as the linguistic analyses explaining the function of each word. This paper, however, focuses simply on how learners can organize vocabulary acquisition by corpus frequency and then how they can track their progress.
This paper also builds upon Clyde Pharr’s Homeric Greek (D.C. Heath & Co., 1925), a brilliant and innovative introduction to Ancient Greek, produced almost a century ago. Pharr introduces learners to the vocabulary of Iliad 1.1-5 on page 1 of his grammar and begins including small, but gradually growing chunks, of Iliad 1 from lesson 13 on, ultimately covering all 611 lines in this book by the end of lesson 77. Book 1 of the Iliad thus serves as the target corpus and leaners acquire a basic knowledge of Greek by learning as much as possible about the 4,563 running words in this book.
Researchers studying modern languages commonly report that readers need to understand at least 95% and preferably 98% of the running words in a text to gain adequate comprehension (e.g., Schmitt et al., 2011). The 95% threshold is often cited by professional teachers of Greek and Latin. Homeric Epic may be different and more amenable to reading before readers approach 95% vocabulary knowledge. Students of Greek commonly report that they can quickly begin to read Homeric Greek with speed and facility — that was certainly my personal experience when I read Homer intensively in the 1970s, long before any digital tools were available. It may well be that the formulaic nature of Homeric poetry means that it is easier to skip words when reading — there are certainly many common epithets for which the conventional translations are (in my view) guesses based on context.
Furthermore, when we explore Ancient Greek literature as a whole, we shift from genre to genre — from Homer to Attic Tragedy, from historians such as Herodotus (who write in the Ionic dialect) to Thucydides (who writes in Attic), from the comedies of Aristophanes to the dialogues of Plato. The various authors and genres differ in style and vocabulary. It is difficult to learn enough Ancient Greek from so many sources for any but the most experienced to expect they would understand 95-98% of the words in a text they had not seen. Most readers of Ancient Greek will depend upon using online dictionaries. My goal is to help learners to internalize as much knowledge as possible and to understand these sources with as much fluency, precision and pleasure as possible.
While Pharr brings his readers into contact with Homer as soon as he feels he can, he does not confine himself to authentic Greek text. Pharr includes exercises composed in artificial, textbook Greek and asks learners to produce exercises in prose, but the textbook vocabulary is almost entirely drawn from Iliad 1 and designed to help learners acquire both active command and passive recognition of Iliad 1.
Such exercises, of course, carefully track what learners have and have not learned in any given lesson. While John Wright and Paula Debnar completely rewrote the grammatical explanation in Pharr, they carefully preserved Pharr’s exercises (Pharr 2012). Ultimately, we should be able to generate such exercises automatically from an annotated corpus — we could almost certainly do so now, given the state of natural language processing — but for now we will build upon the manually constructed exercises that Pharr constructed a century ago.
I assume that learners are working to master an existing corpus, rather than to produce and understand an effectively infinite number of utterances in a living language. This use case can, of course, describe use cases with modern language where a learner wishes to understand a particular corpus — e.g., the screenplay of a Kurosawa movie or the lyrics of the Grateful Dead. They are able to measure their progress against a corpus that is, at least for the moment, relatively fixed.
Manually curated annotations listing the dictionary form, part of speech, and syntactic function for more than 1 million running words of Ancient Greek are available on GitHub from various openly licensed treebanks. This includes the c. 200,000 running words in the Iliad and the Odyssey. We can thus begin to measure how factors such as how many dictionary words and how many inflected forms we find in Homeric Epic.
Vocabulary in the Homeric Epics
The precise counts will vary according to the edition chosen (the Perseus Treebank for Homer uses Allen’s 1920 Oxford Classical Text), to the way words are counted (the Perseus Treebank treats words such as mête (“neither”) as two words: mê (“not”) and te (“also, and”)) and to decisions as to whether particular words get assigned to one or to two or more dictionary entries. These factors are unlikely to change the picture that emerges in the figure below.
Figure 1: Relative number of dictionary entries and inflected forms in the Iliad and Odyssey
Figure 1 plots the contribution of each dictionary entry and inflected form to an understanding of Homeric Epic as a whole. It starts with the most common dictionary forms and inflected forms, and then moves down the lists, adding the frequency of each new item and measuring how many words in Homeric epic for which we can now account. Figure 1 drives home the fact that Greek is a highly inflected language, where each dictionary entry shows up — on the average — in 3.6 different forms.
The table below shows most common forms in Homeric epic and how much each contributes to the overall corpus of 200,518 running words. Most of the forms are indeclinable particles and prepositions. Only a few of this top 20 are, strictly speaking, inflected. Note that no attempt was made to normalize forms: thus, kai (“and”) with a grave accent, for example, appears as a distinct form of kai with a grave accent, although the two words are identical in meaning. Normalization might reduce the number of forms but would not significantly change the relationship between dictionary entries and forms in Figure 1.
1
δ’
δέ
but
7014
7014
0.03
2
καὶ
καί
and
4812
11826
0.05
3
δὲ
δέ
but
3461
15287
0.07
4
τε
τε
and
2766
18053
0.09
5
δέ
δέ
but
1640
19693
0.09
6
μὲν
μέν
on the one hand
1634
21327
0.10
7
οὐ
οὐ
not
1560
22887
0.11
8
ἐν
ἐν
in
1407
24294
0.12
9
τ’
τε
and
1222
25516
0.12
10
ὣς
ὡς
as
1215
26731
0.13
11
γὰρ
γάρ
for
967
27698
0.13
12
ἀλλ’
ἀλλά
otherwise
933
28631
0.14
13
τὸν
ὁ
the
904
29535
0.14
14
ἐπὶ
ἐπί
on
829
30364
0.15
15
οἱ
ἕ
nodef
804
31168
0.15
16
αὐτὰρ
ἀτάρ
but
759
31927
0.15
17
μοι
ἐγώ
I (first person pronoun)
745
32672
0.16
18
δὴ
δή
[interactional particle: S&H on same page]
741
33413
0.16
19
οὔ
οὐ
not
677
34090
0.17
20
μιν
μιν
him
644
34734
0.17
From the table above, we can see that most of the most frequent forms in the corpus are not inflected and thus the ratio of 3.6 forms per dictionary entry underestimates how many different forms of many words audiences actually encounter. (When, in summer 2019, Ethan Yates generated the forms that Clyde Pharr listed in the paradigms for Homeric Greek textbook, he produced a list of almost 14,000 different inflected forms, illustrating the various paradigms. If he had included every inflected participle, the list would have been substantially longer).
The following table shows the counts for the 20 most common dictionary words in Homeric Epic.
1
δέ
but
12138
12138
0.06
2
ὁ
the
5870
18008
0.08
3
καί
and
5283
23291
0.11
4
τε
and
4322
27613
0.13
5
ἐγώ
I (first person pronoun)
2871
30484
0.15
6
οὐ
not
2695
33179
0.16
7
εἰμί
to be
2118
35297
0.17
8
ἐν
in
2076
37373
0.18
9
ὅς
who
2043
39416
0.19
10
ὡς
as
2007
41423
0.20
11
σύ
you (personal pronoun)
1900
43323
0.21
12
μέν
on the one hand
1872
45195
0.22
13
ἄρα
particle: ‘so’
1772
46967
0.23
14
τις
any one
1457
48424
0.24
15
ἄν
modal particle
1449
49873
0.24
16
ἀλλά
otherwise
1430
51303
0.25
17
γάρ
for
1401
52704
0.26
18
ἐπί
on
1371
54075
0.26
19
αὐτός
unemph. 3rd pers.pronoun; -self; [the] same
1121
55196
0.27
20
πᾶς
all
1089
56285
0.28
The figure above reports that 8,792 different dictionary words account for all 200,518 running words counted in the Perseus Treebank for Homeric Epic. For those who do not know what a power law is, the figures above provide a good example. The top 20 dictionary words — less than 1 out of 450 or 0.23% — account for 56,285 running words — more than a quarter of the corpus as a whole (28%).
The top thousand dictionary entries (11% of the whole) account for 165,922 words (82% of the whole). The table below lists every 50th entry for perspective and to illustrate how much progress learners could make if they based the vocabulary on frequency. (Of course, they would probably use a modified strategy where they started with the most common nouns, verbs etc. rather than the most common forms).
1
δέ
but
12138
12138
0.06
50
μή
not
614
80123
0.39
100
σφεῖς
personal and (ind.) reflexive pronoun
294
100538
0.50
150
προσεῖπον
to address
190
112061
0.55
200
ἄνωγα
to command
145
120344
0.60
250
πάσχω
to experience
118
126800
0.63
300
ἡμέτερος
our
99
132248
0.65
350
θέω
to run
83
136814
0.68
400
ὅστις
indef. relative or indirect interrogative
73
140722
0.70
450
ἕλκω
to draw
65
144164
0.71
500
πάλιν
back
57
147208
0.73
550
τέμνω
to cut
51
149911
0.74
600
ἔνθεν
whence; thence
45
152289
0.75
650
ἀλέομαι
to avoid
42
154446
0.77
700
φόβος
fear
39
156458
0.78
750
ὀρέγω
to reach
36
158317
0.78
800
ἄρειος
nodef
34
160061
0.79
850
δατέομαι
to divide among themselves
31
161695
0.80
900
χαμᾶζε
to the ground
29
163190
0.81
950
ἐρετμόν
oar
27
164601
0.82
1000
βλάπτω
to disable
26
165922
0.82
The model above makes at least one assumption that would be problematic in print culture: once a dictionary word is encountered, it assumes that all forms of that word will be understood. For traditional learners, that is a major problem with Greek verbs because much of first year Greek is usually devoted to the many arcane ways Greek verbs can be produced.
In a digital environment, however, readers can see the analysis of each and every verb form long before they choose to internalize the ability to generate those forms. In this scenario, learners focus early on the outlines of Greek grammar (e.g., system of tenses, moods and voices) so that they have an understanding of what, for example, an aorist optative contributes. Full annotation of source texts would include links to a grammar explaining the function of each and every word.
The Homer treebank will, for example, tell you that the Greek words “spear” and “shoulder” are both datives and modify the verb ballô, “to strike with a thrown object” (a verb that shows up 469 times in Homer), but it will not tell you that the first dative reflects instrumentality (you strike someone “with a spear”) while the second designates location (you strike someone “on the shoulder”). We need to use either more complex tags (such as my Tufts colleague Matthew Harrington has developed) or you can create a separate annotation layer (as my collaborator Farnoosh Shamsian and I are doing).
The case explored below is based upon Clyde Pharr’s Homeric Greek (D.C. Heath & Co., 1925), a brilliant and innovative introduction to Ancient Greek, produced a century ago. It took a data-driven approach in its design when data had to be collected by hand. Pharr had two fundamental principals: (1) design the grammar so that learners engage directly with the corpus of interest as soon as possible and (2) use frequency data so that learners can engage with more frequent phenomena first.
There is, however, tension between the principles of engaging as quickly as possible with the source texts and the of focusing on the most common words. On the one hand, Pharr’s first year grammar introduces Iliad 1.1-5 in lesson 13 — the beginning of week 5 if learners move through the book 3 chapters each week. Pharr has, however, been preparing learners for this from the first page of the textbook. Learners are asked to learn the Greek alphabet by spelling and pronouncing the dictionary entries for each word in Il. 1.1-5 , each of which is offered along with a definition.
Figure 2: Pharr, chapter 1 –dictionary forms and definitions for each word in Iliad 1.1-5
Pharr introduces the vocabulary for the first five lines of the Iliad on page 1. The immediate goal is for learners to use these words to practice reading the Greek alphabet. At the same time, they have an opportunity for incidental learning, as they are exposed to the definitions and to words that they will study in more detail much later (such as tithêmi, “put, place, cause”, the complex morphology of which is introduced in lesson 32).
Before addressing the question of word frequencies and vocabulary acquisition, I want to point out in passing that we could take a very different approach to having learners begin by sounding out words from Iliad 1.1-5 on day one. In a digital environment, for example, we easily add sound as a guideline that learners can use as a starting point. We can certainly link a simple reading of the Greek words and definition. We can, however, also provide metrical analysis and reading as well.
Figure 3: David Chamberlain’s metrical analysis and performance of Iliad 1.1-15 on Hypotactic.Figure 4: Chamberlain’s analysis and performance on the new Beyond Translation version of Perseus.
Figures 3 and 4 illustrate how data published under an open license can be available in multiple venues. The use of either venue would make possible a different approach for the beginning learner, one that made the sound of the dactylic hexameter an object of study from day 1. That would certainly add to the complexity of what can be a daunting start, but it would also enable learners to begin approaching Homeric Epic as poetry immediately. In a world of smart phones, learners can also listen to recordings anywhere and there is no reason why we cannot make sound the sound of Homeric poetry a fundamental part of their experiences. Many of us listen to music online in languages that we do not know and do not wait to study the grammar before listening to songs that capture our imagination.
Returning to the question of vocabulary acquisition, the focus on exhaustive coverage of particular passages does, however, come at a price. Any substantial passage in Homeric epic — and any regular text — will contain words that rarely appear. Those who follow Pharr’s exercises will find themselves practicing an active command of words such as proiaptô, “hurl forward, send forth,” and helôrion, “booty, prey, spoil,” even though, in the rest of the Iliad and Odyssey, proiaptô appears only 3 times elsewhere and helôrion is never seen again.
We can, however, now easily indicate to the learner how often each of the words presented in this list actually occurs in the Iliad and Odyssey.
1
μῆνις
16
wrath, fury, madness, rage.
2
ἀείδω
40
sing (of), hymn, chant.
3
θεά
199
goddess.
4
Πηλείδης
59
son of Peleus, Achilles.
5
Ἀχιλλεύς
382
Achilles.
6
οὐλόμενος
14
accursed, destructive, deadly.
7
ὅς
2043
his, her(s), its (own).
8
μυρίος
32
countless, innumerable.
9
Ἀχαιός
722
Achaean, Greek.
10
ἄλγος
92
grief, pain, woe, trouble.
11
τίθημι
377
put, place, cause.
12
πολύς
874
much, many, numerous,
13
δέ
12138
post. conj. but, and, so, for.
14
ἴφθιμος
44
mighty, valiant, stout-hearted, brave.
15
ψυχή
81
soul, breath, life, spirit.
16
Ἅιδης
2
Hades, god of the lower world.
17
προιάπτω
4
hurl forward, send forth.
18
ἥρως
114
HERO, mighty warrior, protector, savior.
19
αὐτός
1121
self, him(self), her(self), it(self), same. (one), he, she, it.
20
ἑλώριον
1
booty, spoils, prey.
We could easily identify dictionary words that fall below some threshold of frequency in Homeric Epic as a whole. Learners could prioritize active mastery of dictionary words that are above that threshold and passive mastery of uncommon words (such as helôrion). Figure 5 shows the progress that learners would make as they moved through the 77 chapters of Pharr.
Figure 5: Vocabulary acquisition in Iliad 1 following Pharr with frequency cutoff of 20+x in Homeric Epic.
Figure 5 divides words into four classes. The y-axis show the number of running words in Iliad 1 while the x-axis tracks the 77 chapters in Pharr. The orange band at the top represents unseen words that fall below the selected threshold (here 20, so that orange words occur 19 times or less in Homeric Epic). The red band designates unseen vocabulary that meets the cutoff for common words (20 or more occurrences in Homeric Epic). The blue band shows how many new running words in Iliad 1 learners can understand after mastering the vocabulary in the current chapter. The green bars then allow learners to see how many words they should be able to recognize. Following Pharr’s plan, they achieve 100% coverage for all words in Iliad 1 at the end of the final chapter.
The 4563 running words in Iliad 1 contain 1118 separate dictionary entries. If, however, we focus on dictionary words that occur 20 or more times in the Homeric epics, we reduce that total to 655 dictionary entries for active mastery. The other 463 dictionary entries only account for 593 running words (pictured in orange above), i.e., the vast majority of them only occur once in Iliad 1. Learners can concentrate on recognizing these lesson common words in context rather than on actively mastering them.
If we were to increase the cutoff point to dictionary words that appear 30 or more times in Homeric epic, we would reduce the number of words for active recall from 655 to 565 — 90 dictionary entries from Iliad 1 occur between 21 and 30 times in Homeric Epic. Of the 4563 running words in Iliad 1, 730 would now fall into the less common category and could be recognized passively in context.
Figure 6: Vocabulary acquisition in Iliad 1 following Pharr with frequency cutoff of 30+x in Homeric Epic.
In increasing the cutoff from 20 to 30, we increase the number of words that we do not try to actively master, but Figure 6 shows that the change is not drastic — most readers will have to look carefully to see the difference from Figure 5.
The problem with focusing on Iliad 1 (or on any one book of the Iliad or the Odyssey) is that we will miss many words that are common in Homeric Epic as a whole but do not appear in our chosen book.
Figure 7: learning the vocabulary of Iliad 1 vs. that of the Homeric Epics as a whole
Where the green region in Figures 5 and 6 ultimately cover every single word in Iliad 1, being able to recognize every vocabulary item in Iliad 1 only prepares learners to recognize 145,026 of the 200,518 (72%) running words in the Homeric Epics as a whole.
If we return to the cutoff of 20 occurrences in Homer, 655 lemmas fall into this category. Adding this extra 90 dictionary entries for active mastery raises the the number of running words in Homer Epic that learners can understand from 145,026 to 150,848 (from 72% to 75%).
The problem is that many dictionary words that are common in Homer do not appear in Iliad 1. The following list shows the 20 most common such words in Homer that do not appear in the first book of the Iliad (with Telemachus, not surprisingly, at the top of the list).
Tot.
Freq.
Lemma
Short Gloss (after Chicago Lemmas)
1
244
244
Τηλέμαχος
Telemachus
2
484
240
ἔγχος
a spear
3
716
232
μνηστήρ
a wooer
4
929
213
ξένος
a guest
5
1121
192
τεῦχος
a weapon
6
1311
190
κελεύω
to urge, command
7
1487
176
ἀλλήλων
of one another
8
1650
163
Ἄρης
Ares
9
1796
146
δόμος
a house; a course of stone
10
1932
136
ταχύς
quick
11
2067
135
ὀτρύνω
to stir up
12
2198
131
ἄστυ
a city
13
2328
130
πατρίς
fatherland
14
2457
129
δύω
dunk
15
2583
126
ἐκεῖνος
that one over there
16
2709
126
τῷ
therefore
17
2827
118
πάσχω
to experience
18
2943
116
δῆμος
people; (originally) a country-district
19
3059
116
ἀμφότερος
each of two
20
3173
114
πεδίον
a plain
Unsurprisingly, we do not find suitors (mnêstêres) or guests (xenos) — these are, like the name Telemachus, more typical of the Odyssey. But Iliad 1 does not have common words for spears (enchos) or weapons (teuchos) that appear frequently in the Iliad.
One could adopt a different approach and learn each common dictionary word in Homeric Epic, roughly starting from the top and working downwards. (You probably would start not only with the most common words but also with the most common regular nouns and verbs.) This would require looking at examples of Homeric Greek from outside of Iliad 1 to see how words such as mnêstêres and xenos are used. That would take away from the focus on Iliad 1 and/or require additional work. But it could produce a more balanced result.
Figure 8: two approaches to acquiring Homeric vocabulary
Figure 8 illustrates two approaches to learning Homeric vocabulary. The top and bottom lines show vocabulary acquisition for learners working through the vocabulary in the 77 lessons of Pharr. If learners mastered every single dictionary word in Iliad 1 (without setting aside words of lower frequency), the would have seen 76% of all words in Homeric Epic. The top line represents 100% coverage for Iliad 1.
The two middle lines reflect results from the top-down approach, where we learn words purely by frequency (hence these lines have very smooth curves). Instead of learning the 1118 separate dictionary entries covered in Pharr, we learn the most common 1000 dictionary entries in Homer. These two lines are much closer. Learners would have only mastered 85.5% of the words in Iliad 1 but they would have seen 83% of all words in Homeric Epic. A stricter adherence to the overall frequencies of Homeric vocabulary and less focus on the vocabulary of a particular passage might prove more satisfying and effective for learners. They would learn to recognize the 15% of vocabulary that was not on their list.
Overall, the point is that learners can now track their progress and the future value of new vocabulary in real time. Our hypothesis is that this will increase motivation and satisfaction, keeping learners more fully engaged and for a longer period of time. A next step will be to do design ways to test that hypothesis. First, however, we need to present, in the second part of this paper, another method by which to track seen and unseen vocabulary more precisely.
This paper makes a simple, but significant, observation. Vocabulary keeps growing in any corpus — there is no final, fixed set of words. That phenomenon appears with any natural language corpus. Here I emphasize the significance for students of Homeric epic. The Homeric Iliad and Odyssey contain about 200,000 running words and we can see how the number of dictionary entries (e.g, anêr, ‘man’) and of inflected forms derived from dictionary entries (e.g, andros, ‘of a man,’ andri) increase slowly but continuously: 8,792 dictionaries appearing as 31,664 different account for the 200,581 running words that appears in the Perseus Dependency Treebanks of the two epics.
Figure 1: Dictionary entries and inflected forms in the Iliad and Odyssey.
First, readers should pause and take note of the fact that two curves in Figure 1, smooth though they are, were not produced by a mathematical formula. These are individual data points. A program sorted the dictionary entries and inflected forms in the Homeric Epic by frequency, started with the most frequent, then worked its way down the list, adding the frequency of each new term to the running total. There is an underlying pattern to vocabulary frequencies such as the we generate curves that exhibit no outliers at this resolution. This is not a scatterplot that we have fitted to a curve. This figure shows two series of linguistic data that follow an extraordinarily regular pattern.
Heaps’ Law is an empirical law that provides a mathematical model for the growth of vocabulary in a corpus and for the growth of dictionary entries and of forms through the epics. The mathematical model offered by Heaps’ law is based on the observation that vocabulary growth follows a power law, i.e., a few very common entries account for a disproportionate percentage of total words in a corpus. As we move through the corpus, the number of new dictionary entires and inflected forms slows but it does keep growing. If we had 500,000 running words, we would have more dictionary entires and inflected forms, if we had 1,000,000 running words, more still and so on. Many words that appear only once in the first 200,000 words would reappear.
Figure 2: New vocabulary encountered in each 5,000 chunk of the Iliad and the Odyssey.
Figure 2 illustrates the number of new dictionary entries encountered in each 5,000 word chunk of the Iliad and the Odyssey. The sequence above departs from the traditional ordering of books. It scans the books of the Iliad and Odyssey in this order: books 1, 10-19, 2, 20-24. Chunk 12 above contains an unexpected spike. Students of Homer may well guess that chunk 12 contains the catalogue of ships (hence, the unexpectedly large new vocabulary reflects a surge in the number of new nouns).
Chunk 10, which exhibits a smaller, but sudden, spike covers Iliad 18.28-19.107 and, thus, the description of Hephaestus’ workshop and Achilles’ shield, both of which would plausibly introduce new vocabulary, including more verbs as well as nouns.
By contrast, chunk 9 drops precipitously after the first 8 chunks had declined according to a fairly clear pattern. This contains Iliad 16.259-17.95, most of the aristeia and death of Patroclus. At this point, the narrative may be reprising, and pointing backwards towards, language and formulae that it had cultivated in the preceding battles books.
If we reverse the order and start with the Odyssey, the basic picture remains the same.
Figure 2: counting new vocabulary beginning with the Odyssey and then scanning the Iliad.
In figure 2, the bulge caused by the catalogue of ships in Iliad 2 is less pronounced, both because it is counted later (when some of the place names have been encountered) and because it is split between two of the 5,000 word chunks (29 and 30). The point of both figures 2 and 3 is to show that we continue to encounter new vocabulary in the last 5,000 word chunk, even after analyzing the previous 195,000 running words.
What are the new words that we encounter after reading through 195,000 words of Homeric poetry? If we start with the Iliad and follow the order of books as described above, the final chunk begins with Odyssey 8.399 and the first 20 (of 41) unseen words are:
Greek word
short definition
1
πρόειμι
go forward
2
νεόπριστος
fresh-sawn
3
ὕμνος
a hymn
4
ἐπαρτύω
to fit on; fix; prepare
5
αὐτόδιον
straightway
6
οἰνοποτήρ
a wine-drinker
7
βιώσκομαι
to quicken
8
ἀποπροτέμνω
to cut off from
9
ἔμμορος
partaking in
10
μεταβαίνω
to pass over from one place to another
11
δουράτεος
of planks
12
ἀκρόπολις
the upper city
13
ἐκπρολείπω
to forsake
14
ἀμφιπίπτω
to fall around
15
εἴρερος
bondage
16
εἰσανάγω
to lead up into
17
ἐπιψαύω
to touch on the surface
18
ἀνώνυμος
without name
19
κυβερνητήρ
pilot
20
Νήριτον
Neritus (a personal name)
Of these twenty words, only one is a proper name, i.e., new vocabulary does not mainly consist of unfamiliar people or places.
If we scan the Odyssey first, then the first 20 unseen words in our final chunk are:
Greek word
short definition
1
μέσφα
until
2
πολιοκρόταφος
with gray hair on the temples
3
θεόδμητος
god-built
4
ὑγιής
sound
5
κηρεσσιφόρητος
urged on by the Κῆρες
6
φυλάζω
to divide into tribes
7
νήνεμος
without wind
8
Θρῄκηθεν
nodef
9
κορθύνω
to lift up
10
φῦκος
rouge
11
κλήδην
by name
12
Μυκήνηθεν
from Mycene
13
ἀφρήτωρ
without brotherhood
14
ἀνέστιος
without hearth and home
15
ὑποδεξίη
the reception of a guest
16
σέυω
to rush
17
Δηίοπιτης
Deiopites (a personal name)
18
ἀπομυθέομαι
to dissuade
19
ἀλήϊος
without share of booty
20
ἀκτήμων
without property
This chunk begins at Iliad 8.508. It also contains only one unseen proper name.
We cannot talk about what words and expressions were active in the oral tradition. We cannot say that lines with no parallels are not formulaic, i.e., that they were not learned and repeated in multiple contexts. We can only say that we have no parallels. We can, if we can combine a good understanding of probability with a solid grounding in how poetry works, begin to build more sophisticated models about what might appear if we had more epic on more subjects or more epic poetry on subjects that are not common in the current corpus (e.g., night raids such as Iliad 10). But then we are making our assumptions and our models explicit.
If you are an accomplished performer, you should be able to begin performing Greek poetry and prose, with an understanding of every syllable of what you are reading, by the end of one semester. I base that on the preliminary results that my collaborator Farnoosh Shamsian observed after 30 hours of instructing Persian speaking students Homeric Greek. We desperately need passionate and compelling performances of Greek and other languages to bring these sources to life. We can use podcasts and YouTube videos to reach a global audience. We have compelling sources. We need performances in different voices by people from different backgrounds.
Performances would be published in the new version of “beyond-translation,” version of Perseus, alongside exhaustively annotated editions of the source text. Ideally, performers would create performances in both Greek and English, with viewers able to go back and forth over the same statements in each language, comparing performances in both languages at the word and phrase level.
We could start with the Antigone — we have a lot of annotation for this play and are hoping to make it possible for learners to internalize all the grammar that they need to understand every last word in the play (insofar as anyone understands it). But we also have a lot of poetry and prose with dense linguistic annotation (treebanking) that would allow readers to get past a translation and quickly (one semester) begin to see how the Greek works.
In print culture, we trained people to work with inert editions available on paper pages that could not interact or answer questions. Now the dominant use case must be to train learners how to use a growing network of annotations to go back and forth between one or more translations and the source text (in multiple editions if that should be relevant). Learning how to perform Sophocles or Sappho is doable. Those who wish to do do or who wish to go on to study ancient languages professionally can internalize more.
Will people learn less over time? Will they have to go back and relearn things more thoroughly if they start pragmatically? Or will the incidental learning that they acquire from interacting with Greek early and often provide a broader foundation in which more active mastery of traditional paradigms and production will be more firmly grasped? Or will we simply have more people who get started and don’t quit after plodding through made-up exercises and learning paradigms and vocabulary that they may rarely see in their actual reading?
At Tufts University, the course on Classical Historians (Classics 141 — details in the departmental course booklet) will focus on Classical Arabic sources composed in, and about, pre-colonial West Africa. While we will consider Arabic sources produced outside of West Africa and accounts of European travelers, we will focus primarily on two different historical sources from West Africa istself: the Tārīkh al-Sūdān and (what has traditionally been called) Tārīkh al-Fattāsh. Our goal is not just to learn about the Mali and Songhai empires but to use what we learn to create openly licensed, digital sources of various kinds that will help others explore a major historical period that has attracted far too attention in the teaching and research.
Students will have an opportunity to explore emerging, digitally enabled methods by which global audiences can begin exploring the human record. In particular, we will exploit techniques by which we can begin to make the Arabic source text itself accessible to a general audience. We will begin publishing sections of these sources in the new version of the Perseus Digital Library that we are developing with support from the NEH. The development site for this is Beyond Translation and will be augmented between now and the fall semester.
Figure 1:Conclusion of an unpublished historical source in Arabic from Mali, preserved by Yaro family collection and hosted by the British Library – one of more than 2,000 West African manuscripts that the British Library has made available.
The course itself will meet during Tufts’ fall semester Monday evenings from 6:00-8:30. Space allowing, we hope to see students from other institutions participate, whether by direct cross-registration or by getting credit through a directed study authorized by a faculty member at their own institution.
We will also offer a weekly reading group for those who wish to go over sections of the Arabic. This will can be taken as an optional addition to the Monday class or as a separate class. The Arabic reading group would be 1 credit (vs. 3 for the Monday class). Any students taking both would receive 4 credits.
During the summer, I will also be working on the digital edition of these two histories and of other sources. If others are interested learning more and in possibly contributing, they should contact me. There are a number of ways to contribute that match a range of skillsets. The basic requirement would be an ability to read English carefully but there are also clearly opportunities for those with knowledge of French, of various aspects of Computer and Data Science, and of Classical Arabic.
I am hoping this summer to resuscitate my own Arabic and to see how far that helps me with the language of these Islamic scholars from Timbuktu. I will be using tools such as the suite of Arabic natural language processing tools developed by the CAMeL Lab at New York University Abu Dhabi to extend my (not very advanced) knowledge of Arabic. The goals are to create exhaustive annotations, included translations aligned at the word and phrase levels, for (1) a small but extensible set of passages and (2) sets of sentences that allow readers to trace the meaning of Arabic words which cannot properly be translated.
The larger goal of this class and the larger project that it represents is to create openly licensed digital materials that are not only of immediate use but that also can be modified and, wherever possible, reused under a Creative Commons CC-BY license. Such a license is not suitable for publications that seek to represent a particular scholarly voice at a particular time. We are, however, managing the sources in Github and so each particular contribution is recorded in the versioning history. We are supporting a collaborative model of authorship that may be familiar to readers from Wikipedia. That said, individuals will be able to use the Github versioning history to document what they have done and to create hybrid publications that contain their own accounts of what they did and what critical decisions they needed to take.
The Tārīkh al-Sūdān and the Tārīkh al-Fattāsh
The fall course will focus primarily on two histories — tarikh, pl. tewarikh — composed by Islamic scholars in West Africa.
(1) The Tārīkh al-Sūdān (TS) by al-Sa’di (1594-1655+ CE) focuses on the history of the Songhay Empire from the mid-fifteenth century until 1591 and then the Moroccan invasions and subsequent administration down to 1655. Houdas (Sa’dī 1900 and 1900a) published the Arabic text and a French translation. In 1999, John Hunwick published a scholarly translation with notes that covered the 28 of 35 chapters most relevant to Timbuktu and the Songhay Empire.
Figure 2: Map of places mentioned in Chapter 5 of the Tarikh al-Fettash: interactive map developed with the Tufts Datalab.
(2) Thirteen years later, Delafosse (1913 and 1913a) published the Arabic text and a French translation for a far more challenging source that has been known as the Tārīkh al-Fattāsh (TF), the “Chronicle of the Researcher,” an account of the Songhay Empire through 1599 and thus includes the early years of the Moroccan occupation. Almost a century later, Wise and Taleb (2011) published an English translation of the Tārīkh al-fattāsh based on Delafosse’s French translation and the Arabic text. Tārīkh al-fattāsh is a novel chronicle written in the 19th century, and not the effort of three generations of scholars who worked on it starting from the early 16th century and eventually interpolated in the 19th century, as previously advanced by most scholars. This 19th century TF was composed by a substantial rework of a 17th century anonymous work. With support from the NEH Translation Program, Mauro Nobili and Ali Diakite are publishing a new edition of this work that contains an English translation, the Arabic text and clear indications of how the 16th and 19th century texts relate to each other.
We will, however, also consider other sources. We have a reasonably accurate transcription for the Arabic of Ibn Battuta’s description of West Africa and an accompanying French translation, long in the public domain. We can use DeepL or Google Translate to create a quick first English version and then edit this as we align it to the Arabic original.
Where the project stands.
Work on this project begin in summer 2021, when Ayah Aboelela (UMass Boston CS ’21) led preliminary work. We found not only PDF versions of the public domain Arabic/French editions of the TS and TF but also text automatically generated by Optical Character Recognition.
The French OCR-generated text was good enough as a base for further work. We applied DeepL and Google Translate to convert the French into English. The results were surprisingly good: we did find ourselves making occasional changes to the English but such correction did not materially add to the overall work of adding base TEI XML tags, marking footnote markers in the text and footnotes on the bottom of the page, adding occasional Arabic words in the notes, and adding Arabic numbers in the translation that pointed to the corresponding pages in the Arabic edition. Readers can examine a sample of such work, chapters 21-22 of the TS on Github.
The Arabic text was more problematic and, in two different OCR-workflows, had a character error rate of c. 5%. The text is still sufficiently accurate to support a range of text mining (such as topic modeling and text reuse detection). It is also good enough as a starting text if the goal is to add extensive annotation to relatively short passages and to create an initial reader. Ayah produced initial versions of curated passages where she took the time to correct errors in the OCR-generated text.
A few thousand words of carefully edited Arabic with aligned translation and annotation would be a useful start and allow readers at an intermediate level to familiarize themselves with the style and content of these sources before moving to passages without curated annotations and translations.
Ayah published exploratory work using natural language processing tools for Arabic available from Spacy and CAMel on Github. She tested services for morphological analysis and disambiguation and dependency parsing for the Arabic.
Figure 4: Dependency parsing of Arabic text from the Chronicles, produced by Stanza and visualized with Displacy (NB: this system writes text out left to right rather than right to left).
For named entity recognition (NER) (determining whether names are people, places or groups), we decided to apply tools from Spacy to English generated from the French with machine translation. A relatively modest amount of training improved the accuracy of NER from 60% to 82%.
Figure 5: NER visualization, using SpaCy’s visualizer Displacy, on passages from the chronicles
Not only do these two histories constantly refer to places with which most readers in the US are probably not familiar, but there are many personal names, often complex in form, and rarely familiar.` Identifying these names and creating links between related characters, and linking this information directly to the source text will, we hope, make it easier for readers to see who is who and to identify characters on whom they should focus their attention.
Figure 6: Social networks for Roman history developed by Zach Sowerby, MA student in Digital Humanities in the Tufts Classics Department6
Figure 5 illustrates social networks derived from primary sources about Roman history. We use simple collocation to posit connections. Our hope is to include more complex information about relationships (son-of, occupation, etc.). Zach Sowerby was working with a much larger text corpus than the two histories and his relatively rapid prototyping already reveals basic patterns of who is important and which characters are connected. We can get useful information starting with this fairly rapid work.
In fall 2021, I taught the first iteration of this class. We spent much of our time reading Michael Gomez’s 2018 book African Dominion, now the standard English account of early and medieval West Africa and then examining the accounts in the two histories on which Gomez bases much of his work (and which he carefully cites). The map in Figure 2 (above) was produced for this class by members of the Tufts Data Lab as a model for additional student work. I will adding results from student projects in this class during the summer.
Let’s see what we can do in the summer and then in the fall!
The study of Greco-Roman culture can exert a purposeful and transformative role in Europe’s development of a more just multinational and multiethnic society. This is a topic about which I have thought and on which I have spoken for years. Technology has begun to change the ways in which we can relate to sources in different languages from different cultural contexts. Those nascent changes are by no means deterministic – the technology can develop in more and less helpful directions and its development can move under very different influences. Likewise, the study of Greece and Rome can be appropriated by groups such as white supremacists or more narrowly Eurocentric nationalists (for whom whiteness is a necessary but far less sufficient condition). Therefore my own work seeks for constructive ways by which emerging technologies for reading and the study of Greco-Roman culture can together help foster a society that is more just and affords greater happiness to its citizens.
The rest is published on the Classical Continuum, a publication venue supported by the New Alexandria Foundation and the Harvard Comparative Literature Department.
Gregory Crane Tufts University March 25 Gregory.Crane@tufts.edu
Tufts University will offer two different sections of introductory Ancient Greek in fall 2022, each of which takes a complementary approach. Both sections of the class have been designed to exploit increasingly powerful digital tools for understanding Ancient Greek and other languages — the skills that you learn will also help you exploit, and go far beyond, what you can do with translation, whether those are literary translations by human beings or the product of systems such as Google Translate or DeepL. Both sections build directly on an emerging new version of the Perseus Digital Library. Neither section has any prerequisites.
The first section will follow a textbook and will teach you to produce, as well as to understand, ancient Greek. It will, however, also give students far more exposure to ancient Greek source texts from the opening weeks of the semester. The second section, which will be online at a time to be determined, will focus on exploiting increasingly sophisticated digital tools to analyze ancient Greek sources.
Figure 1: the first line of the Iliad with exhaustive annotation in a new reading environment being developed for the Perseus Digital Library, with translations and glosses into English and Persian. More than 1 million words of Greek has this level of linguistic annotation.
The first section follows a traditional textbook but exploits a range of digital methods to enhance the experience of learning Ancient Greek, providing substantial immediate feedback as you practice traditional exercises. Instead of translating Greek into English or English into Greek and then waiting days for correction, you will be able to receive substantial feedback. We will also spend as much time possible seeing how the vocabulary and grammar are used in actual Greek sources and minimize use of artificial textbook Greek. The goal is to give you active as well as passive command of the Greek. This section is better suited to your needs if you feel you may wish to go beyond first year Greek. It will meet Mondays and Wednesdays 1:30-2:45 PM local time.
This section will be primarily in person but will be open to those who wish to participate remotely. If you are at an institution where you can cross-register with Tufts (such as Boston College or Brandeis), you would not have to travel across town — scheduling may prevent you from taking your local introduction to Greek or you may wish to participate in this novel approach. Those seeking credit should be able do so through Tufts’ University College.
Figure 2: A translation of Iliad 1 by Amelia Parrish (Tufts ’21) designed to be aligned at the word and phrase level with the Greek original to expose the working of the source language.
The second section will meet online at a time to be determined. It will focus entirely on reading and is designed for those who may have only one year — or even one semester — to study Ancient Greek. This second section represents a more radical departure from traditional approaches as it focuses on annotated texts themselves and could be applied to any corpus with sufficient annotation. After one semester, practice with digitally enabled tools will allow you to compare a translation of Homeric epic to the original Greek, to explore what the words really mean in Homeric Greek (end not just how they are translated), and to engage with the epics on your own. In the second semester, you will be able to move on to more syntactically complex sources such as Plato.
If space allows, we would particularly encourage participation in this online section by students from outside of Tufts. We want to understand how to apply this more radical departure from traditional pedagogy. We are building on work done by Farnoosh Shamsian, Phd student at the University of Leipzig and participants in this class will be contributing not only to her research but to an ongoing reimagining of how we work with historical languages.
Figure 3: Metrical analysis for the Iliad and Odyssey (and much else) published by David Camberlain, with a recording of Camberlain reading those lines: see the original with recording at Hypotactic.com.
We are aware of no modern language programs that will provide such transferable skills. You will not only learn how to work with sources in Ancient Greek but will have tools to analyze Latin as well as modern languages such as French, German, and Italian but also Croatian and Latvian, Arabic and Mandarin. Our goal is not to help you check into a hotel or order dinner. Our goal is to allow you to work directly and quickly with not only Ancient Greek primary sources but with scholarship about these sources in a variety of modern languages. Our goal is to transform who can participate in traditional scholarship about the Greco-Roman world and then to enable new forms of scholarship and new intellectual communities that were never possible in print culture.
Description of this version of Greek 1 as it appears in the course book for the Department of Classical Studies at Tufts University.
Greek 1: Fall 2022 Introduction to Ancient Greek Section 1: Monday/Wednesday 1:30-2:45 Section 2: To be scheduled. Gregory Crane, Professor of Classical Studies, Editor-in-Chief, Perseus Digital Library Christopher Petrik, Tufts ’24 Farnoosh Shamsian, Phd Candidate, Leipzig University
The rise of digital methods and, increasingly, of machine learning has begun to enable a transformation in the study of Ancient Greek. What you can learn in an introduction to Ancient Greek can be far greater now than was ever possible before. At the same time, what you can do with what you learn will take you much farther now than was possible before. Tufts University has been at the forefront of this transformation. In taking Ancient Greek, you not only can benefit from this work but will have an opportunity to contribute yourself, creating during the course of first year Greek materials that will serve other language learners and advanced researchers alike.
You will have more exposure to authentic Greek in this introductory class than has ever been. Exhaustive annotation exists explaining the function of more than a million words of Ancient Greek while a new generation of translations, designed to clarify the working of Greek for those who do not know the language makes it possible to see how grammar and vocabulary actually work in some of the most famous works of Greek literature, from the time you learn your first words. The very same methods that you learn to begin working with Ancient Greek have been applied to dozens of other languages
A major barrier to learning historical languages has been the slow pace and limited reach of the feedback that you receive. You do an assignment one day, hand it in the next, and then see how you did in the next class, two days or more later. When you practice what you have learned, you will often be able to get immediate feedback and then be able to practice what you have learned until you have mastered it.
We offer two different sections, each with a complementary approach aimed to serve different audiences. The first section builds off of a traditional textbook, offering all exercises online with immediate feedback. Class time will be devoted to questions that you cannot resolve on your own and to seeing how what we have learned in class helps us begin to understand real texts. Students will also begin working with short passages from the Iliad and Odyssey, Sophocles, Thucydides, Plato, and the New Testament. The second section is designed to support those who may be able to devote only a year or even a semester to the study of Ancient Greek. You will learn enough of the grammar to understand the basic working of highly inflected languages such as Ancient Greek (and Latin and Russian and many other languages) but you will spend most of your time learning how to apply the rich set of tools available to help you read Ancient Greek – and many other languages. If you do choose to continue your study beyond the first year, we will provide you with a framework by which you can do that.
Abstract: This paper consists of three complementary parts. The first section describes three instances where very technical scholarship on Greek literature overlaps with, and draws attention to, particularly dramatic historical contexts. This section describes an aspect of Greco-Roman studies that is both too demanding and too narrow — too demanding because it assumes that anglophone researchers work with scholarship in languages such as French, German, and Italian, but too narrow because it does not engage with scholarship that is not in a major European language. The second section talks about the general need for Classics and Classical Studies in a country such as the United States to extend beyond Greece and Rome. This section builds on work that I have published in the past distinguishing Greco-Roman from Classical Studies. The third section describes a more concerted attempt to expand beyond North Africa and to include sources from Sub-Saharan Africa. I report on developing for a spring 2021 course on Epic Poetry a 10,000-line Mandinka/English corpus of stories produced by West African Griots. I will also briefly discuss the use of Classical Arabic to explore locally produced sources about West African history and culture. As a first step, the fall 2021 course on Classical historians at Tufts University will center not only on sources such as Herodotus, Thucydides, Livy and Tacitus but on two histories that focus on the Songhay Empire: the Tarikh al-Fattâsh, begun c. 1593 by Mahmud Kati, and the Tarikh as-Sudân, composed by al-Sadi (c. 1594–1655). This class will expand the role of Classical Arabic in Classical Studies at Tufts.
Visitors to the Open Greek and Latin digital library often ask us about what appear to be fragments of “code” alongside the list of authors in the library’s collection:
If you expand the list of works by a given author, you’ll notice that a similar line of “code” appears next to the title of each work, but with an extra element added at the end:
And once you actually go to read a work, you’ll notice an even longer sequence of characters in the right sidebar, and an identical one in your browser’s address window:
These character sequences are called CTS URNs (Canonical Text Services Universal Resource Names), and they are an essential component of the Open Greek and Latin infrastructure. Simply put, CTS URNs are unique identifiers that make it possible to retrieve a specific passage of text from a database. In this blog post we’ll take a closer look at how CTS URNs work, and why they are so important to building the digital Classics library of the future.
Needle in a Digital Haystack: Universal Resource Names
Suppose you pay a visit to your local library to check out a copy of your favourite Jane Austen novel. If your library is very small—say, 200 items or less—you will probably be able to locate the book quite easily just by scanning your eyes over the shelf. But this method would quickly become impractical in a library with thousands or, in some cases, millions of items in its collection.
In order to simplify the search and retrieval process, libraries assign each book a unique call number, and then use the call numbers to arrange books in a logical order across floors and shelves. Armed with a call number and a floorplan of the library, you can easily find a specific book from among millions of others—assuming no one has misplaced or stolen it!
Call numbers are an example of metadata: information about an object, such as its location, size, or creation date, that is separate from the object’s contents. Metadata is important for keeping track of items in a collection and understanding how they relate to one another. Good metadata also makes it possible to perform statistical analyses that can yield insights into the collection as a whole.
In many ways, Universal Resource Names, or URNs, are analogous to the call numbers in a library. Each item in a digital collection is assigned a unique URN that distinguishes it from every other item. When you log on to the collection, your computer downloads an inventory containing the URN of every available text—this is what you see when you browse the OGL library. The inventory is updated whenever a new text is added to the database, so that you never end up with “dead” links or an incomplete catalog.
When you select a text you want to read, your computer sends the URN of that text to the OGL server, which responds by sending back a copy of the text in the form of an XML document (on which more in a future post).
Finding Your Way: Canonical Text Services
In theory, a URN could be any random sequence of characters, as long as no two URNs are the same. This kind of system would tell you what texts are available to read, but nothing about the way in which the collection is organized or how different texts relate to one another. In particular, it would be difficult to group together different texts by the same author, an essential feature of both physical and digital libraries.
To solve this problem, projects such as OGL use a system known as Canonical Text Services. Despite the name, CTS has nothing to do with labeling texts as either “canonical” or “non-canonical”. Rather, CTS provides a set of rules for generating URNs that reflects the logical organization of texts into groups and subgroups.
If you examine the list of works by Lucian of Samosata in the screenshot above, you will notice that each URN begins with the same sequence of characters: urn:cts:greekLit:tlg0062:. The letters urn:cts: appear in every URN, and indicate that we are employing the CTS citation format. greekLit locates the text within one of OGL’s main subcollections, Greek Literature (other subcollections include Latin Literature,latinLit, and Hebrew Literature, hebLit). Finally, tlg0062 is the sequence that has been assigned to the author Lucian. In fact, urn:cts:greekLit:tlg0062: is a complete URN on its own: it distinguishes the author Lucian of Samosata from all other authors in the OGL library. Individual works are identified by appending a suffix to the URN of the author: tlg001, tlg002, and so forth. This way, all works by Lucian appear together as a single text group.
This sort of system, in which smaller categories are nested within larger ones, is an example of a hierarchy. In addition to grouping together works by the same author, the hierarchical format of CTS URNs makes it possible to identify a specific passage within a text in a way that mirrors the text’s internal structure.
Navigating the Text
Classicists will be familiar with the system of citing passages of text by canonical reference. Depending on their genre and length, most ancient works are divided into segments such as books, chapters, or, in the case of poetry, line numbers. Longer segments, such as books, are themselves usually divided into shorter ones, so that the result once again is a nested hierarchy. For example, the citation “Thuc. 5.84.1” refers to book 5, chapter 84, section 1 of Thucydides’ History of the Peloponnesian War, which happens to be the opening scene of the famous Melian Dialogue. Longer passages can be identified by using a range: to cite the Melian Dialogue as a whole, we can write “Thuc. 5.84-116,” that is, Thucydides book 5, chapters 84 to 116.
The advantage of canonical references is that, unlike page numbers, they remain valid no matter what edition of a work is being used. They are also more suitable for citing texts in a digital environment, where the concept of physical page numbers is no longer very meaningful.
To identify a specific passage of text within the CTS framework, the URN of the text can be extended in a way that resembles the canonical references above. Here is the URN for book 5, chapter 84, section 1 of Thucydides: urn:cts:greekLit:tlg0003.tlg001.perseus-grc2:5.84.1. In this example, the sequence perseus-grc2 identifies a particular version of the text stored in the OGL database, while 5.84.1 points to the specific passage. A longer passage can likewise be expressed as a range: urn:cts:greekLit:tlg0003.tlg001.perseus-grc2:5.84-5.116. Note that the same hierarchical levels must be included on either side of the range: 5.84-5.116, not 5.84-116, which would prompt an error message.
When you access a text on the OGL library, your computer is provided with information about the text’s citation structure. As you navigate through the text, your computer sends a URN of the target passage to the OGL server, which returns a copy of the passage, again in the form of an XML document. While you can progress forwards and backwards through the text sequentially, you can also enter a specific URN into your browser’s address window: assuming the URN is formatted correctly, this will take you directly to whatever passage you are interested in.
Planning for the Future
We have seen that CTS URNs provide a logical way of retrieving texts and passages of texts that reflects the organization of a collection and the internal structure of the texts themselves. But perhaps the question remains: Why is such a system important in the first place?
While a simple reading environment is possible without URNs, the CTS framework allows us to unlock the full potential of a digital edition. By assigning a unique identifier to every passage of every text, CTS URNs make possible large-scale textual analysis that in the past would have required hundreds of hours of manual tabulation. With the proper software, we can easily find out how word frequencies, metrical patterns, and even syntactical structures vary within and across texts. The discovery of statistically-significant variations might help resolve disputes over the authorship of a text, for example, or to precisely quantify the way in which an author’s style developed over the course of their career.
In addition to this, the CTS framework helps protect a digital repository from becoming obsolete in the face of changing technology. Since URNs are just strings of characters, they will remain valid no matter how the technology for processing and displaying texts evolves in the future. By investing in this system, the Open Greek and Latin Project is positioning itself to take advantage of exciting innovations in the field of digital humanities, and to serve as an invaluable resource to Classicists for generations to come.
The Perseus Digital Library is pleased to acknowledge the recent contributions of the Center for Hellenic Studies (CHS) 2019 digital humanities summer interns and research team on the corpus of Plutarch’s Moralia. As a part of the First Thousand Years of Greek initiative, a project of Open Greek & Latin, the CHS provided support for the conversion of older Perseus data into Canonical Text Services (CTS) and EpiDoc compliant files.
All of these files will be made available in Open Greek & Latin via the Scaife Viewer.
The 2019 CHS Summer Digital Humanities Interns, Karina Cooper, Ethan Della Rocca, Sophia Elzie, and Lucy Parr, worked on proofreading files, updating the TEI-XML markup, and managing the workflow via the Perseus GitHub Greek text repository. Improvements included file naming consistency, structural review, and header standardization. Perseus and Open Greek & Latin/First Thousand Years of Greek are grateful for their efforts and attention to detail.
Angelia Hanhardt, Editorial Assistant at the CHS trained and supervised the interns in keeping with her work on open access publication.
In the fall of 2019, Michael Konieczny, PhD, Classical Philology and CHS Library Assistant for the academic year, joined Lia in completing and reviewing the interns’ work (and has blogged about his experience).
This work added over 1.8 million words to Open Greek & Latin — over half a million in Greek. The CHS team created over 75 metadata files and converted over 200 editions and translations, accounting for over 12% of the entire Perseus Greek open-source primary text corpus.
Perseus and our collaborators are grateful to Ethan, Karina, Lucy, Sophia, Michael and Lia for their hard work on this corpus and we thank the CHS for funding their efforts.