Samuel Smith1,2, Christian J. Sumner1, Thom Baguley1, Paula C. Stacey1
1
NTU Psychology, Nottingham Trent University, Nottingham, U.K.
2Hearing Sciences, University of Nottingham, Nottingham, U.K.

The comprehension of speech, whether for normal hearing or aided, is often supplemented by watching a talker’s facial movements. How do auditory and visual cues combine multiple sources of information to provide a speech benefit? Does audiovisual performance depend only on the unimodal-information, or does the integration process itself vary? Does the integration vary depending on different aspects of the stimulus, or do individuals “integrate differently”? We developed a model based on signal detection theory (SDT) which allows us to test these different possibilities quantitatively. Signal detection theory posits that performance is limited by sensory noise in the signal and internally in the nervous system. The benefits of multiple cues depend on whether internal noise occurs in unimodal processing, or occurs in later processing, after multisensory integration (Micheyl and Oxenham, 2012; J Acoust Soc Am. 131:3970). We propose a model whereby the proportions of unimodal (“early”) and post-integration (“late”) noise can be estimated from unimodal and multisensory performance. This allows us to test whether differences in multisensory performance across experimental variables are best explained by variations in the unisensory performance, or reflect a varying integration process. In previously published data (Stacey et al. 2016; Hear Res. 336:17) we found that SDT provided a good account of audiovisual speech perception, overall. However, previous models were restricted to assuming only one source of noise and neither unisensory nor multisensory noise models predicted the data quantitatively. Our new model, which combines unisensory and multisensory internal noise, fits these data precisely. Furthermore, we find that the integration process shifts towards later multisensory internal noise when temporal fine structure of speech is removed with tone-vocoding. Thus, we can quantify auditory-visual speech perception as an optimal integration of information with multiple sources of internal noise, and the integration itself varies depending on the unisensory signals.