

Words with a relatively high number of phonological neighbors (such as “cat”) thus have higher levels of lexical competition and may thus rely more on cognitive processes of inhibition or selection compared to words with relatively few phonological neighbors (such as “orange”). Lexical competition frameworks start from the assumption that listeners must select the appropriate target word from among a set of similar-sounding words (phonological neighbors), which act as competitors. One way to quantify the effects of such facilitation is in the context of a lexical competition framework. Below we review two key mechanisms by which audiovisual speech benefits listeners relative to unimodal speech. Furthermore, given that acoustically challenging speech is also associated with increased cognitive challenge ( Peelle, 2018), visual information may reduce the cognitive demand associated with speech-in-noise processing ( Gosselin and Gagné, 2011) (but see Brown and Strand, 2019). In addition, AV speech has been shown to speed performance in shadowing tasks (where listeners repeat a spoken passage in real time) ( Reisberg et al., 1987) and improve comprehension of short stories ( Arnold and Hill, 2001). In experiments that present speech to listeners in noisy backgrounds, recognition of AV speech is significantly better than recognition of auditory-only speech ( Erber, 1975 Sommers et al., 2005 Tye-Murray et al., 2007b Van Engen et al., 2014 Van Engen et al., 2017), and listeners are able to reach predetermined performance levels at more difficult signal-to-noise ratios (SNRs) ( Grant and Seitz, 2000 Macleod and Summerfield, 1987 Sumby and Pollack, 1954). BENEFITS OF AUDIOVISUAL SPEECH COMPARED TO AUDITORY-ONLY SPEECH With this background, we examine the McGurk effect to assess its usefulness for furthering our understanding of audiovisual speech perception, joining the voices of other speech scientists who argue that it is time to move beyond McGurk ( Alsius et al., 2017 Getz and Toscano, 2021 Massaro, 2017).

We start by reviewing behavioral findings regarding audiovisual speech perception and some theoretical constraints on our understanding of multisensory integration.
#Speech perception definition how to#
In this paper, we consider how to best investigate the benefits listeners receive from being able to see a speaker's face while listening to their speech during natural communication. At the same time, the stimuli typically used to elicit a McGurk effect differ substantially from what we usually encounter in conversation. 1 Since that time, McGurk stimuli have been used in countless studies of audiovisual integration in humans (not to mention the multitude of classroom demonstrations on multisensory processing) ( Marques et al., 2016). More than 45 years ago, McGurk and MacDonald (1976) published a remarkable (and now famous) example of visual influence on auditory speech perception: when an auditory stimulus (e.g., /ba/) was presented with the face of a talker articulating a different syllable (e.g., /ga/), listeners often experienced an illusory percept distinct from both sources (e.g., /da/). Speech perception in face-to-face conversations is a prime example of multisensory integration: listeners have access not only to a speaker's voice, but to visual cues from their face, gestures, and body posture. Although the McGurk effect is a fascinating illusion, truly understanding the combined use of auditory and visual information during speech perception requires tasks that more closely resemble everyday communication: namely, words, sentences, and narratives with congruent auditory and visual speech cues. Furthermore, recent data show that susceptibility to McGurk tasks does not correlate with performance during natural audiovisual speech perception.

However, despite their popularity, we join the voices of others in the field to argue that McGurk tasks are ill-suited for studying real-life multisensory speech perception: McGurk stimuli are often based on isolated syllables (which are rare in conversations) and necessarily rely on audiovisual incongruence that does not occur naturally. Not all listeners show the same degree of susceptibility to the McGurk illusion, and these individual differences are frequently used as a measure of audiovisual integration ability. One approach to measuring multisensory integration is to use variants of the McGurk illusion, in which discrepant auditory and visual cues produce auditory percepts that differ from those based on unimodal input. Although it is clear that sighted listeners use both auditory and visual cues during speech perception, the manner in which multisensory information is combined is a matter of debate.
