Author Archives: Muhammad Umair

Can Language Models Trained on Written Monologue Learn to Predict Spoken Dialogue?

Fill in this form to receive a download link: [email-download-link namefield=”YES” id=”68″]

Transformer-based Large Language Models (LLMs) have recently increased in popularity, in part due to their impressive performance on a number of language tasks. While LLMs can produce human-like writing, the extent to which these models can learn to predict spoken language in natural interaction remains unclear. This is a nontrivial question, as spoken and written language differ in syntax, pragmatics, and norms that interlocutors follow. Previous work suggests that while LLMs may develop an understanding of linguistic rules based on statistical regularities, they fail to acquire the knowledge required for language use. This implies that LLMs may not learn the normative structure underlying interactive spoken language, but may instead only model superficial regularities in speech. In this paper, we aim to evaluate LLMs as models of spoken dialogue. Specifically, we investigate whether LLMs can learn that the identity of a speaker in spoken dialogue influences what is likely to be said. To answer this question, we first fine-tuned two variants of a specific LLM (GPT-2) on transcripts of natural spoken dialogue in English. Then, we used these models to compute surprisal values for two-turn sequences with the same first-turn but different second-turn speakers and compared the output to human behavioral data. While the predictability of words in all fine-tuned models was influenced by speaker identity information, the models did not replicate humans’ use of this information. Our findings suggest that although LLMs may learn to generate text conforming to normative linguistic structure, they do not (yet) faithfully replicate human behavior in natural conversation.

De Beer, C., Hogrefe, M., Hielscher-Fastabend, M., & De Ruiter, J.P. (/2020/). Evaluating models of gesture and speech production for people with aphasia. /Cognitive Science./

Fill in this form to receive a download link: [email-download-link namefield=”YES” id=”61″]

Evaluating Models of Gesture and Speech Production for People With Aphasia

People with aphasia use gestures not only to communicate relevant content but also to compensate for their verbal limitations. The Sketch Model (De Ruiter, 2000) assumes a flexible relationship between gesture and speech with the possibility of a compensatory use of the two modalities. In the successor of the Sketch Model, the AR-Sketch Model (De Ruiter, 2017), the relationship between ico- nic gestures and speech is no longer assumed to be flexible and compensatory, but instead iconic ges- tures are assumed to express information that is redundant to speech. In this study, we evaluated the contradictory predictions of the Sketch Model and the AR-Sketch Model using data collected from people with aphasia as well as a group of people without language impairment. We only found com- pensatory use of gesture in the people with aphasia, whereas the people without language impair- ments made very little compensatory use of gestures. Hence, the people with aphasia gestured according to the prediction of the Sketch Model, whereas the people without language impairment did not. We conclude that aphasia fundamentally changes the relationship of gesture and speech.

De Ruiter, J.P. (2019). Turn-Taking. In: Cummins, C. & Katsos, N. (eds.), Oxford Handbook of Experimental Semantics and Pragmatics.

Fill in this form to receive a download link: [email-download-link namefield=”YES” id=”60″]

In their informal verbal exchanges people tend to follow the ‘one speaker at the time’ rule (Schegloff, 1968). The use of the term ‘turn-taking’ to describe the process in which this rule operates in human conversation is relatively recent, and attributed to Yngve (1970) and Goffman (1967) by Duncan (1972).

Especially since the famous 1974 paper by Harvey Sacks, Emanuel Schegloff, & Gail Jefferson in the journal Language, which marks the birth of the sociological discipline now called Conversation Analysis (CA), turn-taking in conversation has attracted attention from a variety of disciplines.

In this chapter, I will briefly summarize the main theoretical approaches and contro- versies regarding turn-taking, followed by some reflections on different ways it can be studied experimentally.

Magyari, L., De Ruiter, J.P., & Levinson, L. (2017). Temporal Preparation for Speaking in Question- Answer Sequences. Frontiers in Psychology 8.

Fill in this form to receive a download link: [email-download-link namefield=”YES” id=”59″]

In every-day conversations, the gap between turns of conversational partners is most frequently between 0 and 200 ms. We were interested how speakers achieve such fast transitions. We designed an experiment in which participants listened to pre-recorded questions about images presented on a screen and were asked to answer these questions. We tested whether speakers already prepare their answers while they listen to questions and whether they can prepare for the time of articulation by anticipating when questions end. In the experiment, it was possible to guess the answer at the beginning of the questions in half of the experimental trials. We also manipulated whether it was possible to predict the length of the last word of the questions. The results suggest when listeners know the answer early they start speech production already during the questions. Speakers can also time when to speak by predicting the duration of turns. These temporal predictions can be based on the length of anticipated words and on the overall probability of turn durations.

Magyari, L., & De Ruiter, J. P. (2012). Prediction of turn-ends based on anticipation of upcoming words. Frontiers in Psychology 3, 376.

Fill in this form to receive a download link: [email-download-link namefield=”YES” id=”58″]

During conversation listeners have to perform several tasks simultaneously. They have to comprehend their interlocutor’s turn, while also having to prepare their own next turn. Moreover, a careful analysis of the timing of natural conversation reveals that next speakers also time their turns very precisely. This is possible only if listeners can predict accurately when the speaker’s turn is going to end. But how are people able to predict when a turn- ends? We propose that people know when a turn-ends, because they know how it ends. We conducted a gating study to examine if better turn-end predictions coincide with more accurate anticipation of the last words of a turn. We used turns from an earlier button-press experiment where people had to press a button exactly when a turn-ended. We show that the proportion of correct guesses in our experiment is higher when a turn’s end was esti- mated better in time in the button-press experiment. When people were too late in their anticipation in the button-press experiment, they also anticipated more words in our gating study. We conclude that people made predictions in advance about the upcoming content of a turn and used this prediction to estimate the duration of the turn. We suggest an eco- nomical model of turn-end anticipation that is based on anticipation of words and syntactic frames in comprehension.

Loth, S., Huth, K., & De Ruiter, J. P. (2013). Automatic detection of service initiation signals used in bars. Frontiers in Psychology, 4 (557). http://doi.org/10.3389/fpsyg.2013.00557

Fill in this form to receive a download link: [email-download-link namefield=”YES” id=”57″]

Recognizing the intention of others is important in all social interactions, especially in the service domain. Enabling a bartending robot to serve customers is particularly challenging as the system has to recognize the social signals produced by customers and respond appropriately. Detecting whether a customer would like to order is essential for the service encounter to succeed. This detection is particularly challenging in a noisy environment with multiple customers. Thus, a bartending robot has to be able to distinguish between customers intending to order, chatting with friends or just passing by. In order to study which signals customers use to initiate a service interaction in a bar, we recorded real-life customer-staff interactions in several German bars. These recordings were used to generate initial hypotheses about the signals customers produce when bidding for the attention of bar staff. Two experiments using snapshots and short video sequences then tested the validity of these hypothesized candidate signals. The results revealed that bar staff responded to a set of two non-verbal signals: first, customers position themselves directly at the bar counter and, secondly, they look at a member of staff. Both signals were necessary and, when occurring together, sufficient. The participants also showed a strong agreement about when these cues occurred in the videos. Finally, a signal detection analysis revealed that ignoring a potential order is deemed worse than erroneously inviting customers to order. We conclude that (a) these two easily recognizable actions are sufficient for recognizing the intention of customers to initiate a service interaction, but other actions such as gestures and speech were not necessary, and (b) the use of reaction time experiments using natural materials is feasible and provides ecologically valid results.

Loth, S., Guliani, M., Jettka, K., Kopp, S. & De Ruiter, J.P. (2018). Confidence in uncertainty: Error cost and commitment in early speech hypotheses. PLoS One.

Fill in this form to receive a download link: [email-download-link namefield=”YES” id=”56″]

Interactions with artificial agents often lack immediacy because agents respond slower
20 than their users expect. Automatic speech recognisers introduce this delay by analysing a
21 user’s utterance only after it has been completed. Early, uncertain hypotheses of incremental
22 speech recognisers can enable artificial agents to respond more timely. However, these
23 hypotheses may change significantly with each update. Therefore, an already initiated action
24 may turn into an error and invoke error cost. We investigated whether humans would use
25 uncertain hypotheses for planning ahead and/or initiating their response. We designed a
26 Ghost-in-the-Machine study in a bar scenario. A human participant controlled a bartending
27 robot and perceived the scene only through its recognisers. The results showed that
28 participants used uncertain hypotheses for selecting the best matching action. This is
29 comparable to computing the utility of dialogue moves. Participants evaluated the available
30 evidence and the error cost of their actions prior to initiating them. If the error cost was low,
31 the participants initiated their response with only suggestive evidence. Otherwise, they waited
32 for additional, more confident hypotheses if they still had time to do so. If there was time
33 pressure but only little evidence, participants grounded their understanding with echo
34 questions. These findings contribute to a psychologically plausible policy for human-robot
35 interaction that enables artificial agents to respond more timely and socially appropriately
36 under uncertainty.

Loth, S., Jettka, K., Giuliani, M., & De Ruiter, J. P. (2015). Ghost-in-the-Machine reveals human social signals for human–robot interaction. Frontiers in Psychology, 6. http://doi.org/10.3389/fpsyg.2015.01641

Fill in this form to receive a download link: [email-download-link namefield=”YES” id=”55″]

We used a new method called “Ghost-in-the-Machine” (GiM) to investigate social interactions with a robotic bartender taking orders for drinks and serving them. Using the GiM paradigm allowed us to identify how human participants recognize the intentions of customers on the basis of the output of the robotic recognizers. Specifically, we measured which recognizer modalities (e.g., speech, the distance to the bar) were relevant at different stages of the interaction. This provided insights into human social behavior necessary for the development of socially competent robots. When initiating the drink-order interaction, the most important recognizers were those based on computer vision. When drink orders were being placed, however, the most important information source was the speech recognition. Interestingly, the participants used only a subset of the available information, focussing only on a few relevant recognizers while ignoring others. This reduced the risk of acting on erroneous sensor data and enabled them to complete service interactions more swiftly than a robot using all available sensor data. We also investigated socially appropriate response strategies. In their responses, the participants preferred to use the same modality as the customer’s requests, e.g., they tended to respond verbally to verbal requests. Also, they added redundancy to their responses, for instance by using echo questions. We argue that incorporating the social strategies discovered with the GiM paradigm in multimodal grammars of human– robot interactions improves the robustness and the ease-of-use of these interactions, and therefore provides a smoother user experience.

Johannsen, K., & De Ruiter, J. P. (2013). Reference frame selection in dialogue: priming or preference? Frontiers in Human Neuroscience, 7, 667.

Fill in this form to receive a download link: [email-download-link namefield=”YES” id=”54″]

We investigate effects of priming and preference on frame of reference (FOR) selection in dialog. In a first study, we determine FOR preferences for specific object configurations to establish a baseline. In a second study, we focus on the selection of the relative or the intrinsic FOR in dialog using the same stimuli and addressing the questions whether (a) interlocutors prime each other to use the same FOR consistently or (b) the preference for the intrinsic FOR predominates priming effects. Our results show effects of priming (more use of the relative FOR) and a decreased preference for the intrinsic FOR. However, as FOR selection did not have an effect on target trial accuracy, neither effect alone represents the key to successful communication in this domain. Rather, we found that successful communication depended on the adaptation of strategies between interlocutors: the more the interlocutors adapted to each other’s strategies, the more successful they were.

Johannsen, K., & De Ruiter, J. P. (2013). The role of scene type and priming in the processing and selection of a spatial frame of reference. Frontiers in Psychology, 4, 182.

Fill in this form to receive a download link: [email-download-link namefield=”YES” id=”53″]

The selection and processing of a spatial frame of reference (FOR) in interpreting verbal scene descriptions is of great interest to psycholinguistics. In this study, we focus on the choice between the relative and the intrinsic FOR, addressing two questions: (a) does the presence or absence of a background in the scene influence the selection of a FOR, and (b) what is the effect of a previously selected FOR on the subsequent processing of a different FOR. Our results show that if a scene includes a realistic background, this will make the selection of the relative FOR more likely. We attribute this effect to the facilitation of mental simulation, which enhances the relation between the viewer and the objects. With respect to the response accuracy, we found both a higher (with the same FOR) and a lower accuracy (with a different FOR), while for the response latencies, we only found a delay effect with a different FOR.

Human Interaction Laboratory

Tufts University