| |
Sensory systems such as hearing probably evolved in order for organisms to determine objects in their environment, allowing them to navigate, feed, mate, and communicate. Objects vibrate and as a result produce sound. An auditory system provides the neural architecture for the organism to process sound, and thereby to learn something about these objects or sound sources. In most situations, many sound sources are present at the same time, but the sound from these various sources arrives at the organism as one complex sound field, not as separate, individual sounds. The challenge for the auditory system is to process this complex sound field so that the individual sound sources can be determined. That is, the auditory system is presented with a complex auditory scene, and the auditory images in this scene are the sound sources (Bregman, 1990). The auditory system must be capable of performing auditory scene analysis if it is to determine the sources of sounds.
Auditory scene analysis is not undertaken in the auditory periphery (the cochlea and auditory nerve). The auditory periphery provides a spectral-temporal neural code for the acoustic information contained within the auditory scene. That is, the auditory nerve relays to the auditory brainstem the coding performed by the cochlea. This neural code provides the central nervous system with information about the spectral components that make up the auditory scene in terms of their frequencies, levels, and timing. The central auditory nervous system must then analyze this peripheral neural code so that the individual sound sources that generated the scene can be determined.
What information might be preserved in the peripheral code that is usable by the central auditory system for auditory scene analysis? Several cues have been suggested. They include frequency separation, temporal separation, spatial separation, level differences in spectral profiles, asynchronous onsets and offsets, harmonic structure, and temporal modulation (Yost, 1992). These are properties of the sound generated by sound sources that may be preserved in the peripheral code. As an example, if two sound sources each vibrate with a different frequency, the two frequencies will be mixed into a single auditory scene arriving at the listener. The auditory system could ascertain that two frequencies are present, indicating two sound sources. We know that this is possible since within certain boundary conditions, the auditory periphery codes for the frequency content of any complex sound.
The example of two sound sources each producing a different frequency is the basis of a set of experiments designed to investigate auditory scene analysis. Imagine that the sound coming from each of the two sources is pulsed on and off so that the sound from one source (one frequency) is on when the sound from the other source (a different frequency) is off. The perception of the stimulus condition could be described as a single sound with an alternating pitch. However, since each sound could be from a different source, the perception of the stimulus condition could also be that of two sound sources, each producing a pulsing tone of a particular frequency. Each of these perceptions is possible given the exact frequencies and timing used in the experiment. When the perception is one of two different sound sources, the percept is often described as if two perceptual streams were running side by side, each stream representing a sound source. The stimulus conditions that lead to this form of stream segregation are likely to be those that promote the segregation of sound from one source from that of another source (Bregman, 1990). Many of the parameters listed above have been studied using this auditory streaming paradigm. In general, stimulus parameters associated with frequency are more likely to support stream segregation (Kubovy, 1987), but most of the parameters listed can support auditory stream segregation under certain conditions.
Experiments to study auditory scene analysis, such as auditory streaming, require listeners to process sound over a large range of frequencies and over time. Since a great deal of work in auditory perception and psychoacoustics has concentrated on short-time processing in narrow frequency regions, less is known about auditory processing across wide regions of the spectra and longer periods of time. Therefore, obtaining a better understanding of cross-spectral processing and long-time processing is very important for revealing processes and mechanisms that may assist auditory scene analysis (Yost, 1992).
One of the traditional examples of cross-spectral processing that relates to auditory scene analysis is the processing of the pitch of complex sounds, such as the pitch of the missing fundamental (see pitch perception). For these spectrally complex stimuli, usually ones that have a harmonic structure, a major perceptual aspect of the stimulus is the perception of a single pitch (Moore, 1997). Conditions such as the pitch of the missing fundamental suggest that the auditory system uses a wide range of frequencies to determine the complex pitch and that this complex pitch may be the defining acoustic characteristic of a harmonic sound source, such as the musical note of a piano key. A sound consisting of the frequencies 300, 400, 500, 600, and 700 Hz would most likely have a perceived pitch of 100 Hz (the missing fundamental in the sound's spectrum). The 100-Hz pitch may help in determining the existence of a sound source with this 100-Hz harmonic structure.
The example of the missing fundamental pitch can be generalized to describe acoustic situations in which the auditory system segregates sounds into more than one source. A naturally occurring sound source is unlikely to have all but one of its frequency components harmonically related. Thus, in the example cited above, it is unlikely that a single sound source would have a spectrum with frequency components at 300, 400, 550, 600, and 700 Hz (the 550-Hz component is the inharmonic component that replaced the 500-Hz component). In fact, when one of the harmonics of a series of harmonically related frequency components is “mistuned” from the harmonic relationship (550 Hz in the example), listeners are likely to perceive two pitches (as if there were two sound sources), one associated with the 100-Hz harmonic relationship and the other with the frequency of the mistuned harmonic (Hartmann, McAdams, and Smith, 1990). That is, the 550-Hz mistuned harmonic is perceptually segregated as a separate pitch from the 100-Hz complex pitch associated with the rest of the harmonically related components. Such dual pitch perception suggests there were two potential sound sources. In this case, the auditory system appears to be using a wide frequency range (300–700 Hz) to process these two pitches, and hence perceives two potential sound sources.
The complex pitch example can also be used to address the role of stimulus onset as another potential cue used for auditory scene analysis. Two sound sources may each produce a harmonically related spectrum, such that one sound source may have frequency components at 150, 300, 450, 600, and 750 Hz (harmonics of 150 Hz) and another at 233, 466, 699, and 832 Hz (harmonics of 233 Hz). When presented in isolation these two sounds will produce pitches of 150 and 233 Hz. However, if the two stimuli are added together so that they come on and go off together, it is unlikely that two pitches will be perceived. The perception of two pitches can be recovered if one of the complex sounds comes on slightly before the other one, even though both sounds remain on together thereafter. Thus, if the sound from one source comes on (or in some cases goes off) before another sound, then the asynchronous onsets (or offsets) promote sound source segregation, aiding in auditory scene analysis (Darwin, 1981).
If two sounds have different temporal patterns of modulation they may be perceptually segregated on the basis of temporal modulation (especially if the modulation is amplitude modulated; Moore and Alcantara, 1996). Detecting a tonal signal in a wideband-noise background is improved if the noise is amplitude modulated, suggesting that the modulation helps segregate the tone from the noise background (Hall, Haggard, and Fernandes, 1984).
Sounds from spatially separated sources may help in determining the auditory scene. The ability of the auditory system to use sound to locate objects in the real world also appears to help segregate one sound source from another (Yost, 1997). However, the ability to determine the instruments (sound sources) in an orchestral piece played over a single loudspeaker suggests that having actual sources in our immediate environment at different locations is not required for auditory scene analysis.
Thus, in order to use sound to determine something about objects in our world, the central auditory system must process the neural code for sound in order to parse that code into subsets of neural information where each subset may be the neural counterpart of a sound source. Several different parameters of sound sources are preserved by the neural code and may form the basis of auditory scene analysis. Determining the sources of sound requires processing across a wide range of frequencies and over time, and is required for organisms to successfully cope with their environments.
| |