by Santiago Noriega
Mentor: Gina Kuperberg, Psychology; funding source: Fowler family Summer Scholars fundnoriegasantiago_28263_2262195_poster
A language model (LM) is a probabilistic model that can be used to predict the next word in a sequence of text when provided with the preceding context, or to determine the probability of a word following a given context. The performance of LMs depends on the library of text that they are trained on, and different LMs follow different strategies to calculate the probability of upcoming words.
Ngrams and Cloze are two methods that have widely been used to measure the predictability (probability) of words. Ngrams divide a text into n-word components, often as trigrams:
Text A trigrams: [The, war, between], [war, between, the], [between, the, Zulu] …
P(between | The war) = 1
P(was | The war) = 0
In the example above, the trigram model would determine the probability of the words between or was coming after The war. Since in the provided context between always followed The war, the model would estimate the probability of between coming after the war to be 1, and any other continuation to be 0. Cloze probability, on the other hand, involves calculating probabilities from the answers of participants who are asked to continue a text based on what they think the next word is.
In this study, we used a state-of-the-art language model called GPT-2 to measure speech predictability in sentence sets from which Cloze measures had previously been gathered. We will compared these model-generated measures to the crowd-sourced Cloze measures and the modeled trigram measures. This comparison is meant to assess the potential of using GPT-2 as a reliable measure of human speech predictability. When comparing GPT-2 probability measures to Cloze and trigram measures, we found that the results were strongly correlated and followed very similar patterns in their distribution across sentences. Furthermore, probability-derived measures like entropy, a measure often used to estimate information density, were also strongly correlated. These results are encouraging to support the use of GPT-2 as an accurate measure for text predictability. GPT-2 also has important improvements over the other two traditional methods. GPT-2 can be manipulated to alter the amount of context that it considers, making it a more flexible model than Ngrams that could also consider preceding context beyond the previous two words. Furthermore, although obtaining cloze measures is still the golden standard for measuring predictability, it is a time-consuming and expensive procedure because it requires the recruitments of participants. GPT-2, on the other hand, can be used for any text in a much more economic and timely manner.