Gregory Crane*
This paper is designed to be the first of two that explore the degree to which learners can track how much of the vocabulary as a whole in a target corpus they have encountered and to see the frequency in the rest of the corpus of each newly encountered vocabulary item. We focus here upon the Ancient Greek Iliad and the Odyssey, a corpus of just over 200,000 running words. Homeric Epic provides a useful starting point because a growing cluster of openly-licensed, digital resources for this corpus are available, including links from each form in the epic to a dictionary entry, the starting point for vocabulary analysis.
In other work, as part of this larger, NEH-funded effort that we call Beyond Translation, we are developing ways for learners to see any and all other passages where a new Greek word appears and thus to see for themselves what the word means, rather than depending upon textbook glosses — from the very first day, learners should be able to begin studying words in context by using translations aligned at the word and phrase level to the original as well as the linguistic analyses explaining the function of each word. This paper, however, focuses simply on how learners can organize vocabulary acquisition by corpus frequency and then how they can track their progress.
This paper also builds upon Clyde Pharr’s Homeric Greek (D.C. Heath & Co., 1925), a brilliant and innovative introduction to Ancient Greek, produced almost a century ago. Pharr introduces learners to the vocabulary of Iliad 1.1-5 on page 1 of his grammar and begins including small, but gradually growing chunks, of Iliad 1 from lesson 13 on, ultimately covering all 611 lines in this book by the end of lesson 77. Book 1 of the Iliad thus serves as the target corpus and leaners acquire a basic knowledge of Greek by learning as much as possible about the 4,563 running words in this book.
Researchers studying modern languages commonly report that readers need to understand at least 95% and preferably 98% of the running words in a text to gain adequate comprehension (e.g., Schmitt et al., 2011). The 95% threshold is often cited by professional teachers of Greek and Latin. Homeric Epic may be different and more amenable to reading before readers approach 95% vocabulary knowledge. Students of Greek commonly report that they can quickly begin to read Homeric Greek with speed and facility — that was certainly my personal experience when I read Homer intensively in the 1970s, long before any digital tools were available. It may well be that the formulaic nature of Homeric poetry means that it is easier to skip words when reading — there are certainly many common epithets for which the conventional translations are (in my view) guesses based on context.
Furthermore, when we explore Ancient Greek literature as a whole, we shift from genre to genre — from Homer to Attic Tragedy, from historians such as Herodotus (who write in the Ionic dialect) to Thucydides (who writes in Attic), from the comedies of Aristophanes to the dialogues of Plato. The various authors and genres differ in style and vocabulary. It is difficult to learn enough Ancient Greek from so many sources for any but the most experienced to expect they would understand 95-98% of the words in a text they had not seen. Most readers of Ancient Greek will depend upon using online dictionaries. My goal is to help learners to internalize as much knowledge as possible and to understand these sources with as much fluency, precision and pleasure as possible.
While Pharr brings his readers into contact with Homer as soon as he feels he can, he does not confine himself to authentic Greek text. Pharr includes exercises composed in artificial, textbook Greek and asks learners to produce exercises in prose, but the textbook vocabulary is almost entirely drawn from Iliad 1 and designed to help learners acquire both active command and passive recognition of Iliad 1.
Such exercises, of course, carefully track what learners have and have not learned in any given lesson. While John Wright and Paula Debnar completely rewrote the grammatical explanation in Pharr, they carefully preserved Pharr’s exercises (Pharr 2012). Ultimately, we should be able to generate such exercises automatically from an annotated corpus — we could almost certainly do so now, given the state of natural language processing — but for now we will build upon the manually constructed exercises that Pharr constructed a century ago.
I assume that learners are working to master an existing corpus, rather than to produce and understand an effectively infinite number of utterances in a living language. This use case can, of course, describe use cases with modern language where a learner wishes to understand a particular corpus — e.g., the screenplay of a Kurosawa movie or the lyrics of the Grateful Dead. They are able to measure their progress against a corpus that is, at least for the moment, relatively fixed.
Manually curated annotations listing the dictionary form, part of speech, and syntactic function for more than 1 million running words of Ancient Greek are available on GitHub from various openly licensed treebanks. This includes the c. 200,000 running words in the Iliad and the Odyssey. We can thus begin to measure how factors such as how many dictionary words and how many inflected forms we find in Homeric Epic.
Vocabulary in the Homeric Epics
The precise counts will vary according to the edition chosen (the Perseus Treebank for Homer uses Allen’s 1920 Oxford Classical Text), to the way words are counted (the Perseus Treebank treats words such as mête (“neither”) as two words: mê (“not”) and te (“also, and”)) and to decisions as to whether particular words get assigned to one or to two or more dictionary entries. These factors are unlikely to change the picture that emerges in the figure below.
Figure 1 plots the contribution of each dictionary entry and inflected form to an understanding of Homeric Epic as a whole. It starts with the most common dictionary forms and inflected forms, and then moves down the lists, adding the frequency of each new item and measuring how many words in Homeric epic for which we can now account. Figure 1 drives home the fact that Greek is a highly inflected language, where each dictionary entry shows up — on the average — in 3.6 different forms.
The table below shows most common forms in Homeric epic and how much each contributes to the overall corpus of 200,518 running words. Most of the forms are indeclinable particles and prepositions. Only a few of this top 20 are, strictly speaking, inflected. Note that no attempt was made to normalize forms: thus, kai (“and”) with a grave accent, for example, appears as a distinct form of kai with a grave accent, although the two words are identical in meaning. Normalization might reduce the number of forms but would not significantly change the relationship between dictionary entries and forms in Figure 1.
1 | δ’ | δέ | but | 7014 | 7014 | 0.03 |
2 | καὶ | καί | and | 4812 | 11826 | 0.05 |
3 | δὲ | δέ | but | 3461 | 15287 | 0.07 |
4 | τε | τε | and | 2766 | 18053 | 0.09 |
5 | δέ | δέ | but | 1640 | 19693 | 0.09 |
6 | μὲν | μέν | on the one hand | 1634 | 21327 | 0.10 |
7 | οὐ | οὐ | not | 1560 | 22887 | 0.11 |
8 | ἐν | ἐν | in | 1407 | 24294 | 0.12 |
9 | τ’ | τε | and | 1222 | 25516 | 0.12 |
10 | ὣς | ὡς | as | 1215 | 26731 | 0.13 |
11 | γὰρ | γάρ | for | 967 | 27698 | 0.13 |
12 | ἀλλ’ | ἀλλά | otherwise | 933 | 28631 | 0.14 |
13 | τὸν | ὁ | the | 904 | 29535 | 0.14 |
14 | ἐπὶ | ἐπί | on | 829 | 30364 | 0.15 |
15 | οἱ | ἕ | nodef | 804 | 31168 | 0.15 |
16 | αὐτὰρ | ἀτάρ | but | 759 | 31927 | 0.15 |
17 | μοι | ἐγώ | I (first person pronoun) | 745 | 32672 | 0.16 |
18 | δὴ | δή | [interactional particle: S&H on same page] | 741 | 33413 | 0.16 |
19 | οὔ | οὐ | not | 677 | 34090 | 0.17 |
20 | μιν | μιν | him | 644 | 34734 | 0.17 |
From the table above, we can see that most of the most frequent forms in the corpus are not inflected and thus the ratio of 3.6 forms per dictionary entry underestimates how many different forms of many words audiences actually encounter. (When, in summer 2019, Ethan Yates generated the forms that Clyde Pharr listed in the paradigms for Homeric Greek textbook, he produced a list of almost 14,000 different inflected forms, illustrating the various paradigms. If he had included every inflected participle, the list would have been substantially longer).
The following table shows the counts for the 20 most common dictionary words in Homeric Epic.
1 | δέ | but | 12138 | 12138 | 0.06 |
2 | ὁ | the | 5870 | 18008 | 0.08 |
3 | καί | and | 5283 | 23291 | 0.11 |
4 | τε | and | 4322 | 27613 | 0.13 |
5 | ἐγώ | I (first person pronoun) | 2871 | 30484 | 0.15 |
6 | οὐ | not | 2695 | 33179 | 0.16 |
7 | εἰμί | to be | 2118 | 35297 | 0.17 |
8 | ἐν | in | 2076 | 37373 | 0.18 |
9 | ὅς | who | 2043 | 39416 | 0.19 |
10 | ὡς | as | 2007 | 41423 | 0.20 |
11 | σύ | you (personal pronoun) | 1900 | 43323 | 0.21 |
12 | μέν | on the one hand | 1872 | 45195 | 0.22 |
13 | ἄρα | particle: ‘so’ | 1772 | 46967 | 0.23 |
14 | τις | any one | 1457 | 48424 | 0.24 |
15 | ἄν | modal particle | 1449 | 49873 | 0.24 |
16 | ἀλλά | otherwise | 1430 | 51303 | 0.25 |
17 | γάρ | for | 1401 | 52704 | 0.26 |
18 | ἐπί | on | 1371 | 54075 | 0.26 |
19 | αὐτός | unemph. 3rd pers.pronoun; -self; [the] same | 1121 | 55196 | 0.27 |
20 | πᾶς | all | 1089 | 56285 | 0.28 |
The figure above reports that 8,792 different dictionary words account for all 200,518 running words counted in the Perseus Treebank for Homeric Epic. For those who do not know what a power law is, the figures above provide a good example. The top 20 dictionary words — less than 1 out of 450 or 0.23% — account for 56,285 running words — more than a quarter of the corpus as a whole (28%).
The top thousand dictionary entries (11% of the whole) account for 165,922 words (82% of the whole). The table below lists every 50th entry for perspective and to illustrate how much progress learners could make if they based the vocabulary on frequency. (Of course, they would probably use a modified strategy where they started with the most common nouns, verbs etc. rather than the most common forms).
1 | δέ | but | 12138 | 12138 | 0.06 |
50 | μή | not | 614 | 80123 | 0.39 |
100 | σφεῖς | personal and (ind.) reflexive pronoun | 294 | 100538 | 0.50 |
150 | προσεῖπον | to address | 190 | 112061 | 0.55 |
200 | ἄνωγα | to command | 145 | 120344 | 0.60 |
250 | πάσχω | to experience | 118 | 126800 | 0.63 |
300 | ἡμέτερος | our | 99 | 132248 | 0.65 |
350 | θέω | to run | 83 | 136814 | 0.68 |
400 | ὅστις | indef. relative or indirect interrogative | 73 | 140722 | 0.70 |
450 | ἕλκω | to draw | 65 | 144164 | 0.71 |
500 | πάλιν | back | 57 | 147208 | 0.73 |
550 | τέμνω | to cut | 51 | 149911 | 0.74 |
600 | ἔνθεν | whence; thence | 45 | 152289 | 0.75 |
650 | ἀλέομαι | to avoid | 42 | 154446 | 0.77 |
700 | φόβος | fear | 39 | 156458 | 0.78 |
750 | ὀρέγω | to reach | 36 | 158317 | 0.78 |
800 | ἄρειος | nodef | 34 | 160061 | 0.79 |
850 | δατέομαι | to divide among themselves | 31 | 161695 | 0.80 |
900 | χαμᾶζε | to the ground | 29 | 163190 | 0.81 |
950 | ἐρετμόν | oar | 27 | 164601 | 0.82 |
1000 | βλάπτω | to disable | 26 | 165922 | 0.82 |
The model above makes at least one assumption that would be problematic in print culture: once a dictionary word is encountered, it assumes that all forms of that word will be understood. For traditional learners, that is a major problem with Greek verbs because much of first year Greek is usually devoted to the many arcane ways Greek verbs can be produced.
In a digital environment, however, readers can see the analysis of each and every verb form long before they choose to internalize the ability to generate those forms. In this scenario, learners focus early on the outlines of Greek grammar (e.g., system of tenses, moods and voices) so that they have an understanding of what, for example, an aorist optative contributes. Full annotation of source texts would include links to a grammar explaining the function of each and every word.
The Homer treebank will, for example, tell you that the Greek words “spear” and “shoulder” are both datives and modify the verb ballô, “to strike with a thrown object” (a verb that shows up 469 times in Homer), but it will not tell you that the first dative reflects instrumentality (you strike someone “with a spear”) while the second designates location (you strike someone “on the shoulder”). We need to use either more complex tags (such as my Tufts colleague Matthew Harrington has developed) or you can create a separate annotation layer (as my collaborator Farnoosh Shamsian and I are doing).
The case explored below is based upon Clyde Pharr’s Homeric Greek (D.C. Heath & Co., 1925), a brilliant and innovative introduction to Ancient Greek, produced a century ago. It took a data-driven approach in its design when data had to be collected by hand. Pharr had two fundamental principals: (1) design the grammar so that learners engage directly with the corpus of interest as soon as possible and (2) use frequency data so that learners can engage with more frequent phenomena first.
There is, however, tension between the principles of engaging as quickly as possible with the source texts and the of focusing on the most common words. On the one hand, Pharr’s first year grammar introduces Iliad 1.1-5 in lesson 13 — the beginning of week 5 if learners move through the book 3 chapters each week. Pharr has, however, been preparing learners for this from the first page of the textbook. Learners are asked to learn the Greek alphabet by spelling and pronouncing the dictionary entries for each word in Il. 1.1-5 , each of which is offered along with a definition.
Pharr introduces the vocabulary for the first five lines of the Iliad on page 1. The immediate goal is for learners to use these words to practice reading the Greek alphabet. At the same time, they have an opportunity for incidental learning, as they are exposed to the definitions and to words that they will study in more detail much later (such as tithêmi, “put, place, cause”, the complex morphology of which is introduced in lesson 32).
Before addressing the question of word frequencies and vocabulary acquisition, I want to point out in passing that we could take a very different approach to having learners begin by sounding out words from Iliad 1.1-5 on day one. In a digital environment, for example, we easily add sound as a guideline that learners can use as a starting point. We can certainly link a simple reading of the Greek words and definition. We can, however, also provide metrical analysis and reading as well.
Figures 3 and 4 illustrate how data published under an open license can be available in multiple venues. The use of either venue would make possible a different approach for the beginning learner, one that made the sound of the dactylic hexameter an object of study from day 1. That would certainly add to the complexity of what can be a daunting start, but it would also enable learners to begin approaching Homeric Epic as poetry immediately. In a world of smart phones, learners can also listen to recordings anywhere and there is no reason why we cannot make sound the sound of Homeric poetry a fundamental part of their experiences. Many of us listen to music online in languages that we do not know and do not wait to study the grammar before listening to songs that capture our imagination.
Returning to the question of vocabulary acquisition, the focus on exhaustive coverage of particular passages does, however, come at a price. Any substantial passage in Homeric epic — and any regular text — will contain words that rarely appear. Those who follow Pharr’s exercises will find themselves practicing an active command of words such as proiaptô, “hurl forward, send forth,” and helôrion, “booty, prey, spoil,” even though, in the rest of the Iliad and Odyssey, proiaptô appears only 3 times elsewhere and helôrion is never seen again.
We can, however, now easily indicate to the learner how often each of the words presented in this list actually occurs in the Iliad and Odyssey.
1 | μῆνις | 16 | wrath, fury, madness, rage. |
2 | ἀείδω | 40 | sing (of), hymn, chant. |
3 | θεά | 199 | goddess. |
4 | Πηλείδης | 59 | son of Peleus, Achilles. |
5 | Ἀχιλλεύς | 382 | Achilles. |
6 | οὐλόμενος | 14 | accursed, destructive, deadly. |
7 | ὅς | 2043 | his, her(s), its (own). |
8 | μυρίος | 32 | countless, innumerable. |
9 | Ἀχαιός | 722 | Achaean, Greek. |
10 | ἄλγος | 92 | grief, pain, woe, trouble. |
11 | τίθημι | 377 | put, place, cause. |
12 | πολύς | 874 | much, many, numerous, |
13 | δέ | 12138 | post. conj. but, and, so, for. |
14 | ἴφθιμος | 44 | mighty, valiant, stout-hearted, brave. |
15 | ψυχή | 81 | soul, breath, life, spirit. |
16 | Ἅιδης | 2 | Hades, god of the lower world. |
17 | προιάπτω | 4 | hurl forward, send forth. |
18 | ἥρως | 114 | HERO, mighty warrior, protector, savior. |
19 | αὐτός | 1121 | self, him(self), her(self), it(self), same. (one), he, she, it. |
20 | ἑλώριον | 1 | booty, spoils, prey. |
We could easily identify dictionary words that fall below some threshold of frequency in Homeric Epic as a whole. Learners could prioritize active mastery of dictionary words that are above that threshold and passive mastery of uncommon words (such as helôrion). Figure 5 shows the progress that learners would make as they moved through the 77 chapters of Pharr.
Figure 5 divides words into four classes. The y-axis show the number of running words in Iliad 1 while the x-axis tracks the 77 chapters in Pharr. The orange band at the top represents unseen words that fall below the selected threshold (here 20, so that orange words occur 19 times or less in Homeric Epic). The red band designates unseen vocabulary that meets the cutoff for common words (20 or more occurrences in Homeric Epic). The blue band shows how many new running words in Iliad 1 learners can understand after mastering the vocabulary in the current chapter. The green bars then allow learners to see how many words they should be able to recognize. Following Pharr’s plan, they achieve 100% coverage for all words in Iliad 1 at the end of the final chapter.
The 4563 running words in Iliad 1 contain 1118 separate dictionary entries. If, however, we focus on dictionary words that occur 20 or more times in the Homeric epics, we reduce that total to 655 dictionary entries for active mastery. The other 463 dictionary entries only account for 593 running words (pictured in orange above), i.e., the vast majority of them only occur once in Iliad 1. Learners can concentrate on recognizing these lesson common words in context rather than on actively mastering them.
If we were to increase the cutoff point to dictionary words that appear 30 or more times in Homeric epic, we would reduce the number of words for active recall from 655 to 565 — 90 dictionary entries from Iliad 1 occur between 21 and 30 times in Homeric Epic. Of the 4563 running words in Iliad 1, 730 would now fall into the less common category and could be recognized passively in context.
In increasing the cutoff from 20 to 30, we increase the number of words that we do not try to actively master, but Figure 6 shows that the change is not drastic — most readers will have to look carefully to see the difference from Figure 5.
The problem with focusing on Iliad 1 (or on any one book of the Iliad or the Odyssey) is that we will miss many words that are common in Homeric Epic as a whole but do not appear in our chosen book.
Where the green region in Figures 5 and 6 ultimately cover every single word in Iliad 1, being able to recognize every vocabulary item in Iliad 1 only prepares learners to recognize 145,026 of the 200,518 (72%) running words in the Homeric Epics as a whole.
If we return to the cutoff of 20 occurrences in Homer, 655 lemmas fall into this category. Adding this extra 90 dictionary entries for active mastery raises the the number of running words in Homer Epic that learners can understand from 145,026 to 150,848 (from 72% to 75%).
The problem is that many dictionary words that are common in Homer do not appear in Iliad 1. The following list shows the 20 most common such words in Homer that do not appear in the first book of the Iliad (with Telemachus, not surprisingly, at the top of the list).
Tot. | Freq. | Lemma | Short Gloss (after Chicago Lemmas) | |
1 | 244 | 244 | Τηλέμαχος | Telemachus |
2 | 484 | 240 | ἔγχος | a spear |
3 | 716 | 232 | μνηστήρ | a wooer |
4 | 929 | 213 | ξένος | a guest |
5 | 1121 | 192 | τεῦχος | a weapon |
6 | 1311 | 190 | κελεύω | to urge, command |
7 | 1487 | 176 | ἀλλήλων | of one another |
8 | 1650 | 163 | Ἄρης | Ares |
9 | 1796 | 146 | δόμος | a house; a course of stone |
10 | 1932 | 136 | ταχύς | quick |
11 | 2067 | 135 | ὀτρύνω | to stir up |
12 | 2198 | 131 | ἄστυ | a city |
13 | 2328 | 130 | πατρίς | fatherland |
14 | 2457 | 129 | δύω | dunk |
15 | 2583 | 126 | ἐκεῖνος | that one over there |
16 | 2709 | 126 | τῷ | therefore |
17 | 2827 | 118 | πάσχω | to experience |
18 | 2943 | 116 | δῆμος | people; (originally) a country-district |
19 | 3059 | 116 | ἀμφότερος | each of two |
20 | 3173 | 114 | πεδίον | a plain |
Unsurprisingly, we do not find suitors (mnêstêres) or guests (xenos) — these are, like the name Telemachus, more typical of the Odyssey. But Iliad 1 does not have common words for spears (enchos) or weapons (teuchos) that appear frequently in the Iliad.
One could adopt a different approach and learn each common dictionary word in Homeric Epic, roughly starting from the top and working downwards. (You probably would start not only with the most common words but also with the most common regular nouns and verbs.) This would require looking at examples of Homeric Greek from outside of Iliad 1 to see how words such as mnêstêres and xenos are used. That would take away from the focus on Iliad 1 and/or require additional work. But it could produce a more balanced result.
Figure 8 illustrates two approaches to learning Homeric vocabulary. The top and bottom lines show vocabulary acquisition for learners working through the vocabulary in the 77 lessons of Pharr. If learners mastered every single dictionary word in Iliad 1 (without setting aside words of lower frequency), the would have seen 76% of all words in Homeric Epic. The top line represents 100% coverage for Iliad 1.
The two middle lines reflect results from the top-down approach, where we learn words purely by frequency (hence these lines have very smooth curves). Instead of learning the 1118 separate dictionary entries covered in Pharr, we learn the most common 1000 dictionary entries in Homer. These two lines are much closer. Learners would have only mastered 85.5% of the words in Iliad 1 but they would have seen 83% of all words in Homeric Epic. A stricter adherence to the overall frequencies of Homeric vocabulary and less focus on the vocabulary of a particular passage might prove more satisfying and effective for learners. They would learn to recognize the 15% of vocabulary that was not on their list.
Overall, the point is that learners can now track their progress and the future value of new vocabulary in real time. Our hypothesis is that this will increase motivation and satisfaction, keeping learners more fully engaged and for a longer period of time. A next step will be to do design ways to test that hypothesis. First, however, we need to present, in the second part of this paper, another method by which to track seen and unseen vocabulary more precisely.
Acknowledgements
This work was made possible by the Beyond Translation Project, funded by NEH HAA-266462-19 and by support from the Data Intensive Studies Center at Tufts University.
Citations
Iliad Treebank: https://github.com/gregorycrane/gAGDT/blob/master/data/xml/tlg0012.tlg001.perseus-grc1.tb.xml.
Odyssey Treebank: https://github.com/gregorycrane/gAGDT/blob/master/data/xml/tlg0012.tlg002.perseus-grc1.tb.xml
Pharr, Clyde. Homeric Greek. Rev. ed., D.C. Heath & Co., 1925.
Pharr, Clyde, et al. Homeric Greek: A Book for Beginners. Fourth edition, University of Oklahoma Press, 2012.
Schmitt, Norbert, Jiang, Xiangying, and Grabe, William, Percentage of words known in a text and reading comprehension (February 2011) Modern Language Journal 95: 26-43.