De Ruiter, J. P. (2007, January 25-26). Some multimodal signals in humans. In Van der Sluis, I., Theune, M., Reiter, E., & Krahmer, E. (Eds.), Proceedings of the Workshop on Multimodal Output Generation MOG 2007 (pp. 141-148), Aberdeen, UK.

Fill in this form to receive a download link:

In this paper, I will give an overview of some well-studied multimodal signals that humans produce while they communicate with other humans, and discuss the implications of those studies for HCI.

I will first discuss a conceptual framework that allows us to distinguish between functional and sensory modalities. This distinction is important, as there are multiple functional modalities using the same sensory modality (e.g., facial expression and eye-gaze in the visual modality). A second theoretically important issue is redundancy. Some signals appear to be redundant with a signal in another modality, whereas others give new information or even appear to give conflicting information (see e.g., the work of Susan Goldin-Meadows on speech accompanying gestures). I will argue that multimodal signals are never truly redundant. First, many gestures that appear at first sight to express the same meaning as the accompanying speech generally provide extra (analog) information about manner, path, etc. Second, the simple fact that the same information is expressed in more than one modality is itself a communicative signal.

Armed with this conceptual background, I will then proceed to give an overview of some multimodal signals that have been investigated in human-human research, and the level of understanding we have of the meaning of those signals. The latter issue is especially important for potential implementations of these signals in artificial agents.

First, I will discuss pointing gestures. I will address the issue of the timing of pointing gestures relative to the speech it is supposed to support, the mutual dependency between pointing gestures and speech, and discuss the existence of alternative ways of pointing from other cultures. The most frequent form of pointing that does not involve the index finger is a cultural practice called lip-pointing which employs two visual functional modalities, mouth-shape and eye-gaze, simultaneously for pointing.

Next, I will address the issue of eye-gaze. A classical study by Kendon (1967) claims that there is a systematic relationship between eye-gaze (at the interlocutor) and turn-taking states. Research at our institute has shown that this relationship is weaker than has often been assumed. If the dialogue setting contains a visible object that is relevant to the dialogue (e.g., a map), the rate of eye-gaze-at-other drops dramatically and its relationship to turn taking disappears completely. The implications for machine generated eye-gaze are discussed.

Finally, I will explore a theoretical debate regarding spontaneous gestures. It has often been claimed that the class of gestures that is called iconic by McNeill (1992) are a “window into the mind”. That is, they are claimed to give the researcher (or even the interlocutor) a direct view into the speaker’s thought, without being obscured by the complex transformation that take place when transforming a thought into a verbal utterance. I will argue that this is an illusion. Gestures can be shown to be specifically designed such that the listener can be expected to interpret them. Although the transformations carried out to express a thought in gesture are indeed (partly) different from the corresponding transformations for speech, they are a) complex, and b) severely understudied. This obviously has consequences both for the gesture research agenda, and for the generation of iconic gestures by machines.

De Ruiter, J. P., Rossignol, S., Vuurpijl, L., Cunningham, D., & Levelt, W. (2003). SLOT: A research platform for investigating multimodal communication. Behavior Research Methods, Instruments, & Computers, 35(3), 408-419.

Fill in this form to receive a download link:

In this article, we present the spatial logistics task (SLOT) platform for investigating multimodal com- munication between 2 human participants. Presented are the SLOT communication task and the soft- ware and hardware that has been developed to run SLOT experiments and record the participants’ mul- timodal behavior. SLOT offers a high level of flexibility in varying the context of the communication and is particularly useful in studies of the relationship between pen gestures and speech. We illustrate the use of the SLOT platform by discussing the results of some early experiments. The first is an experi- ment on negotiation with a one-way mirror between the participants, and the second is an exploratory study of automatic recognition of spontaneous pen gestures. The results of these studies demonstrate the usefulness of the SLOT platform for conducting multimodal communication research in both human– human and human–computer interactions.

De Ruiter, J. P., Noordzij, M., Newman-Norlund, S. E., Hagoort, P., & Toni, I. (2007). On the origin of intentions. In Haggart, P. Rosetti, Y., & Kawato, M. (Eds.), Attention & Performance XXII. Sensorimotor foundation of higher cognition (pp. 593-610). Oxford: Oxford University Press.

Fill in this form to receive a download link:

Any model of motor control or sensorimotor transformations starts from an intention to trigger a cascade of neural computations, yet how intentions themselves are generated remains a mystery. Part of the difficulty in dealing with this mystery might be related to the received wisdom of studying sensorimotor processes and intentions in individual agents. Here we explore the use of an alternative approach, focused on understanding how we induce intentions in other people. Under the assumption that generating intentions in a third person relies on similar mechanisms to those involved in generating first- person intentions, this alternative approach might shed light on the origin of our own intentions. Therefore, we focus on the cognitive and cerebral operations supporting the generation of communicative actions, i.e. actions designed (by a Sender) to trigger (in a Receiver) the recognition of a given communicative intention. We present empirical findings indicating that communication requires the Sender to select his behavior on the basis of a prediction of how the Receiver will interpret this behavior; and that there is spatial overlap between the neural structures supporting the generation of communicative actions and the generation of first-person intentions. These results support the hypothesis that the generation of intentions might be a particular instance of our ability to induce and attribute mental states to an agent. We suggest that motor intentions are retrodictive with respect to the neurophysiological mechanisms that generate a given action, while being predictive with respect to the potential intention attribution evoked by a given action in other agents.

De Ruiter, J. P., Mitterer, H., & Enfield, N. J. (2006). Predicting the end of a speaker’s turn; a cognitive cornerstone of conversation. Language, 82(3), 515-535.

Fill in this form to receive a download link:

A key mechanism in the organization of turns at talk in conversation is the ability to anticipate or PROJECT the moment of completion of a current speaker’s turn. Some authors suggest that this is achieved via lexicosyntactic cues, while others argue that projection is based on intonational contours. We tested these hypotheses in an on-line experiment, manipulating the presence of symbolic (lexicosyntactic) content and intonational contour of utterances recorded in natural con- versations. When hearing the original recordings, subjects can anticipate turn endings with the same degree of accuracy attested in real conversation. With intonational contour entirely removed (leaving intact words and syntax, with a completely flat pitch), there is no change in subjects’ accuracy of end-of-turn projection. But in the opposite case (with original intonational contour intact, but with no recognizable words), subjects’ performance deteriorates significantly. These results establish that the symbolic (i.e. lexicosyntactic) content of an utterance is necessary (and possibly sufficient) for projecting the moment of its completion, and thus for regulating conversa- tional turn-taking. By contrast, and perhaps surprisingly, intonational contour is neither necessary nor sufficient for end-of-turn projection.*

De Ruiter, J. P. (2004, May 25). On the primacy of language in multimodal communication. In Workshop Proceedings on Multimodal Corpora: Models of Human Behaviour for the Specification and Evaluation of Multimodal Input and Output Interfaces (pp. 38-41). Paris: ELRA – European Language Resources Association.

Fill in this form to receive a download link:

In this paper, I will argue that although the study of multimodal interaction offers exciting new prospects for Human Computer Interaction and human-human communication research, language is the primary form of communication, even in multimodal systems. I will support this claim with theoretical and empirical arguments, mainly drawn from human-human communication research, and will discuss the implications for multimodal communication research and Human-Computer Interaction.

Language as shaped by the brain

Fill in this form to receive a download link:

It is widely assumed that human learning and the structure of human languages are intimately related. This relationship is frequently suggested to derive from a language-specific biological endowment, which encodes universal, but communicatively arbitrary, principles of language structure (a Universal Grammar or UG). How might such a UG have evolved? We argue that UG could not have arisen either by biological adaptation or non-adaptationist genetic processes, resulting in a logical problem of language evolution. Specifically, as the processes of language change are much more rapid than processes of genetic change, language constitutes a “moving target” both over time and across different human populations, and, hence, cannot provide a stable environment to which language genes could have adapted. We conclude that a biologically determined UG is not evolutionarily viable. Instead, the original motivation for UG – the mesh between learners and languages – arises because language has been shaped to fit the human brain, rather than vice versa. Following Darwin, we view language itself as a complex and interdependent “organism,” which evolves under selectional pressures from human learning and processing mechanisms. That is, languages themselves are shaped by severe selectional pressure from each generation of language users and learners. This suggests that apparently arbitrary aspects of linguistic structure may result from general learning and processing biases deriving from the structure of thought processes, perceptuo-motor factors, cognitive limitations, and pragmatics.

De Ruiter, J. P., & Levinson, S. C. (2008). A biological infrastructure for communication underlies the cultural evolution of languages. Behavioral and Brain Sciences commentary, 31, 518.

Fill in this form to receive a download link:

Universal Grammar (UG) is indeed evolutionarily implausible. But if languages are just “adapted” to a large primate brain, it is hard to see why other primates do not have complex languages. The answer is that humans have evolved a specialized and uniquely human cognitive architecture, whose main function is to compute mappings between arbitrary signals and communicative intentions. This underlies the development of language in the human species.

De Ruiter, J. P. (2006). Can gesticulation help aphasic people speak, or rather, communicate? Advances in Speech-Language Pathology, 8(2), 124-127.

Fill in this form to receive a download link:

As Rose (2006) discusses in the lead article, two camps can be identified in the field of gesture research: those who believe that gesticulation enhances communication by providing extra information to the listener, and on the other hand those who believe that gesticulation is not communicative, but rather that it facilitates speaker-internal word finding processes. I review a number of key studies relevant for this controversy, and conclude that the available empirical evidence is supporting the notion that gesture is a communicative device which can compensate for problems in speech by providing information in gesture. Following that, I discuss the finding by Rose and Douglas (2001) that making gestures does facilitate word production in some patients with aphasia. I argue that the gestures produced in the experiment by Rose and Douglas are not guaranteed to be of the same kind as the gestures that are produced spontaneously under naturalistic, communicative conditions, which makes it difficult to generalise from that particular study to general gesture behavior. As a final point, I encourage researchers in the area of aphasia to put more emphasis on communication in naturalistic contexts (e.g., conversation) in testing the capabilities of people with aphasia.