Speech Perception

The speech waveform of humans is quite complex and the ability of humans to understand and generate these waveforms is rather amazing. Like other environmental sounds, speech carries extensive information about its source (i.e., physical characteristics of the speaker). Unlike many other environmental sounds, speech also contains a wealth of information of an abstract nature (i.e., the intended message of the speaker). To develop a cohesive model of the processes involved in speech communication, one needs to integrate research from auditory psychophysics, cognitive psychology, linguistics, neurophysiology, cognitive development, and the philosophy of the mind. Our work on speech perception embraces this interdisciplinary approach. Some of the questions that we are concerned with are: How do humans quickly (and seemingly effortlessly) match the variable and complex waveform of speech to words in their vast mental lexicon? How do infants and second-language learners come to perceive speech sounds in a linguistically-relevant manner? How does the auditory system constrain the representation of speech sounds?

Our minds are filled with words – many thousands of them in fact. The processes responsible for finding those words unfold in mere thousandths of a second. While the selection process feels effortless, there are many obstacles for the spoken language system to overcome. One set of important questions concerns the form in which speech information gets stored and how it activates entries in the lexicon. It is a stunning feat that we can find a handful of contextually appropriate words, in an array of thousands, as we construct a sentence; after all, some words sound alike and so are easily confused with one another (two words in this sentence begin with ‘con’ for example), and others are infrequently used and so seemingly stored out of the way. By priming words with speech-like stimuli constructed to isolate specific factors (such as amplitude modulation, used for auditory grouping, or sinewave synthesis, used to track and capture perceptually salient frequencies in speech), we can determine which aspects of the signal activate the words stored in our lexicon. Together with other members of the Parmly Hearing Institute, Dr. J.D. Trout has adapted routines for generating and presenting these speech-like stimuli for the purpose of probing the organization of the lexicon.

Another theoretically interesting feature of speech perception is the human listener's sensitivity to the phonetic form of spoken words. The perception of speech requires that listeners extract a stable phonetic percept from a variable speech signal. The same word can be uttered at different rates, by different voices, in different dialects. One source of variation in the signal derives from talker-specific vocal details. In fact, no two utterances of the same word are acoustically identical, even when repeated by the same talker; they may differ in rate and specific vocal quality, for example. Sources of within-talker variability are breathy/creaky voice quality, shifting formants, fundamental frequency of phonation, changing speaking rate, variable degrees of articulatory undershoot, and imperfect repetition across tokens of the same articulatory gesture. Because there is now evidence that we encode specific vocal characteristics of individual talkers, the repetition priming of a particular talker's distinct utterance of the same word allows us to examine the fineness of grain in the phonetic representation to which a listener is sensitive. Dr. Trout is currently examining these questions about the lexicon’s input and organization. He has published work on auditory-visual influences on phonemic restoration, and on the perceptual impact of fundamental frequency declination.

So, the task for a human adult listening to speech from their native language is rather complex. Now consider the daunting task for infants learning their first language or adults learning a second language. Some of the variance in the acoustic input that they receive is directly relevant to the intended message of the speaker. Other variance in the input, however, is the result of extra-linguistic influences such as the particular structure of the speaker’s vocal tract. Thus, the task for the language learner is to discriminate some of the acoustic variance and to treat the remaining, potentially discriminable, variance as functionally equivalent. In other words, the language learner must create auditory categories that map the linguistically relevant distinctions for the particular language they are attempting to learn. Dr. Andrew Lotto and his students have developed a paradigm for teaching categories of non-speech sounds to adults. These auditory categories have some of the complexity of speech. By exploring the results of manipulating various parameters of the training sets, Dr. Lotto and his students hope to gain some insight into the processes that lead to the learning of linguistic categories (like phonemes). These data will have practical implications for training second-language learners.

In addition to the important role played by experience and categorical processes, the operating characteristics of our perceptual systems must constrain our representations. Dr. Lotto has done quite a bit of empirical work in the last several years examining context effects in speech perception. He has demonstrated that these context effects can be obtained in humans with non-speech sounds with characteristics similar to speech sounds. These results have suggested that the auditory system constrains the representation of speech sounds in meaningful ways and that our current communication system (speech) and the sound inventories of the world’s languages take advantage of the operating characteristics of our auditory system.