Most speech-to-text (STT) systems would produce a transcript that looks something like this:
This transcript is useful, but falls short of truly capturing the conversation. There are a few obvious omissions. Perhaps most importantly, there is a 4.1 second silence between lines 5 and 6. This gap does some heavy lifting: it shows that Speaker 2 expected Speaker 1 to understand their joke. In addition, STT systems typically ignore laughter. On line 2, Speaker 2 laughs, an early cue that they are joking. On line 7, Speaker 1 laughs, indicating that they finally understand the joke. To really understand the sequence, what we really need is a transcript more like the one below, which marks paralanguage – everything that comes along with the words.
The system to transcribe paralinguistic features – Jeffersonian transcription – was created by Gail Jefferson. Jeffersonian transcribing is slow going. On one hand, the process of re-re-re-listening to the audio to add more and more details helps the researcher understand the mechanisms underlying the interaction. On the other hand, the laborious process limits the amount of data researchers can analyze. And, it’d be impossible to generate enough Jeffersonian transcripts to train a deep learning language model.
Enter: GailBot. We designed GailBot to create first-pass transcriptions of some paralinguistic features (speech rate, silences, overlaps and laughter). It interfaces with existing STT algorithms, and then applies post-processing modules that insert Jeffersonian symbols. All of this is useful, but GailBot’s most valuable characteristic is that it allows researchers to create, insert, and adjust post-processing modules; as researchers develop better algorithms or pose different research questions, they can easily create their own, customized, GailBot.
In this episode JP and Alex interview Michael Lee. They discuss model complexity and generative models, the differences between cognitive models and machine learning, whether and when preregistration of models is useful, and Michael’s undying love of cricket.
Follow us on Twitter: https://twitter.com/TheBayesFactor
Follow us on Facebook: https://www.facebook.com/TheBayesFactor/
Subscribe & leave us a review on iTunes: https://itunes.apple.com/us/podcast/the-bayes-factor/id1308207723
Notes and links:
Follow Michael on twitter: https://twitter.com/mdlBayes
The workshop we attended resulted in this special issue on robust modeling in cognitive science: https://link.springer.com/journal/42113/volumes-and-issues/2-3
Lee and Vanpaemel on informative priors: https://link.springer.com/article/10.3758/s13423-017-1238-3
Evaluating Models of Gesture and Speech Production for People With Aphasia
People with aphasia use gestures not only to communicate relevant content but also to compensate
for their verbal limitations. The Sketch Model (De Ruiter, 2000) assumes a flexible relationship
between gesture and speech with the possibility of a compensatory use of the two modalities. In the
successor of the Sketch Model, the AR-Sketch Model (De Ruiter, 2017), the relationship between ico-
nic gestures and speech is no longer assumed to be flexible and compensatory, but instead iconic ges-
tures are assumed to express information that is redundant to speech. In this study, we evaluated the
contradictory predictions of the Sketch Model and the AR-Sketch Model using data collected from
people with aphasia as well as a group of people without language impairment. We only found com-
pensatory use of gesture in the people with aphasia, whereas the people without language impair-
ments made very little compensatory use of gestures. Hence, the people with aphasia gestured
according to the prediction of the Sketch Model, whereas the people without language impairment
did not. We conclude that aphasia fundamentally changes the relationship of gesture and speech.
In their informal verbal exchanges people tend to follow the ‘one speaker at the time’ rule (Schegloff, 1968). The use of the term ‘turn-taking’ to describe the process in which this rule operates in human conversation is relatively recent, and attributed to Yngve (1970) and Goffman (1967) by Duncan (1972).
Especially since the famous 1974 paper by Harvey Sacks, Emanuel Schegloff, & Gail Jefferson in the journal Language, which marks the birth of the sociological discipline now called Conversation Analysis (CA), turn-taking in conversation has attracted attention from a variety of disciplines.
In this chapter, I will briefly summarize the main theoretical approaches and contro-
versies regarding turn-taking, followed by some reflections on different ways it can be
In every-day conversations, the gap between turns of conversational partners is most
frequently between 0 and 200 ms. We were interested how speakers achieve such fast
transitions. We designed an experiment in which participants listened to pre-recorded
questions about images presented on a screen and were asked to answer these
questions. We tested whether speakers already prepare their answers while they listen
to questions and whether they can prepare for the time of articulation by anticipating
when questions end. In the experiment, it was possible to guess the answer at the
beginning of the questions in half of the experimental trials. We also manipulated whether
it was possible to predict the length of the last word of the questions. The results suggest
when listeners know the answer early they start speech production already during the
questions. Speakers can also time when to speak by predicting the duration of turns.
These temporal predictions can be based on the length of anticipated words and on the
overall probability of turn durations.
During conversation listeners have to perform several tasks simultaneously. They have
to comprehend their interlocutor’s turn, while also having to prepare their own next turn.
Moreover, a careful analysis of the timing of natural conversation reveals that next speakers
also time their turns very precisely. This is possible only if listeners can predict accurately
when the speaker’s turn is going to end. But how are people able to predict when a turn-
ends? We propose that people know when a turn-ends, because they know how it ends.
We conducted a gating study to examine if better turn-end predictions coincide with more
accurate anticipation of the last words of a turn. We used turns from an earlier button-press
experiment where people had to press a button exactly when a turn-ended. We show that
the proportion of correct guesses in our experiment is higher when a turn’s end was esti-
mated better in time in the button-press experiment. When people were too late in their
anticipation in the button-press experiment, they also anticipated more words in our gating
study. We conclude that people made predictions in advance about the upcoming content
of a turn and used this prediction to estimate the duration of the turn. We suggest an eco-
nomical model of turn-end anticipation that is based on anticipation of words and syntactic
frames in comprehension.
Recognizing the intention of others is important in all social interactions, especially in the
service domain. Enabling a bartending robot to serve customers is particularly challenging
as the system has to recognize the social signals produced by customers and respond
appropriately. Detecting whether a customer would like to order is essential for the service
encounter to succeed. This detection is particularly challenging in a noisy environment
with multiple customers. Thus, a bartending robot has to be able to distinguish between
customers intending to order, chatting with friends or just passing by. In order to study
which signals customers use to initiate a service interaction in a bar, we recorded
real-life customer-staff interactions in several German bars. These recordings were used
to generate initial hypotheses about the signals customers produce when bidding for the
attention of bar staff. Two experiments using snapshots and short video sequences then
tested the validity of these hypothesized candidate signals. The results revealed that bar
staff responded to a set of two non-verbal signals: first, customers position themselves
directly at the bar counter and, secondly, they look at a member of staff. Both signals
were necessary and, when occurring together, sufficient. The participants also showed a
strong agreement about when these cues occurred in the videos. Finally, a signal detection
analysis revealed that ignoring a potential order is deemed worse than erroneously inviting
customers to order. We conclude that (a) these two easily recognizable actions are
sufficient for recognizing the intention of customers to initiate a service interaction, but
other actions such as gestures and speech were not necessary, and (b) the use of reaction
time experiments using natural materials is feasible and provides ecologically valid results.
We used a new method called “Ghost-in-the-Machine” (GiM) to investigate social
interactions with a robotic bartender taking orders for drinks and serving them. Using the
GiM paradigm allowed us to identify how human participants recognize the intentions
of customers on the basis of the output of the robotic recognizers. Specifically, we
measured which recognizer modalities (e.g., speech, the distance to the bar) were
relevant at different stages of the interaction. This provided insights into human social
behavior necessary for the development of socially competent robots. When initiating
the drink-order interaction, the most important recognizers were those based on
computer vision. When drink orders were being placed, however, the most important
information source was the speech recognition. Interestingly, the participants used only
a subset of the available information, focussing only on a few relevant recognizers while
ignoring others. This reduced the risk of acting on erroneous sensor data and enabled
them to complete service interactions more swiftly than a robot using all available sensor
data. We also investigated socially appropriate response strategies. In their responses,
the participants preferred to use the same modality as the customer’s requests, e.g.,
they tended to respond verbally to verbal requests. Also, they added redundancy to
their responses, for instance by using echo questions. We argue that incorporating the
social strategies discovered with the GiM paradigm in multimodal grammars of human–
robot interactions improves the robustness and the ease-of-use of these interactions,
and therefore provides a smoother user experience.
We investigate effects of priming and preference on frame of reference (FOR) selection in
dialog. In a first study, we determine FOR preferences for specific object configurations
to establish a baseline. In a second study, we focus on the selection of the relative or the
intrinsic FOR in dialog using the same stimuli and addressing the questions whether (a)
interlocutors prime each other to use the same FOR consistently or (b) the preference for
the intrinsic FOR predominates priming effects. Our results show effects of priming (more
use of the relative FOR) and a decreased preference for the intrinsic FOR. However, as
FOR selection did not have an effect on target trial accuracy, neither effect alone represents
the key to successful communication in this domain. Rather, we found that successful
communication depended on the adaptation of strategies between interlocutors: the more
the interlocutors adapted to each other’s strategies, the more successful they were.