Take a second to listen to this conversation.
Most speech-to-text (STT) systems would produce a transcript that looks something like this:

This transcript is useful, but falls short of truly capturing the conversation. There are a few obvious omissions. Perhaps most importantly, there is a 4.1 second silence between lines 5 and 6. This gap does some heavy lifting: it shows that Speaker 2 expected Speaker 1 to understand their joke. In addition, STT systems typically ignore laughter. On line 2, Speaker 2 laughs, an early cue that they are joking. On line 7, Speaker 1 laughs, indicating that they finally understand the joke. To really understand the sequence, what we really need is a transcript more like the one below, which marks paralanguage – everything that comes along with the words.

The system to transcribe paralinguistic features – Jeffersonian transcription – was created by Gail Jefferson. Jeffersonian transcribing is slow going. On one hand, the process of re-re-re-listening to the audio to add more and more details helps the researcher understand the mechanisms underlying the interaction. On the other hand, the laborious process limits the amount of data researchers can analyze. And, it’d be impossible to generate enough Jeffersonian transcripts to train a deep learning language model.
Enter: GailBot. We designed GailBot to create first-pass transcriptions of some paralinguistic features (speech rate, silences, overlaps and laughter). It interfaces with existing STT algorithms, and then applies post-processing modules that insert Jeffersonian symbols. All of this is useful, but GailBot’s most valuable characteristic is that it allows researchers to create, insert, and adjust post-processing modules; as researchers develop better algorithms or pose different research questions, they can easily create their own, customized, GailBot.
You can currently use GailBot on the command line (see here for details). For more details, see the paper recently published in Dialogue and Discourse.