| |
Listening to spoken language usually seems effortless, but the processes involved are complex. A continuous acoustic signal must be translated into meaning so that the listener can understand the speaker's intent. The mapping of sound to meaning proceeds via the lexicon—our store of known words. Any utterance we hear may be novel to us, but the words it contains are familiar, and to understand the utterance we must therefore identify the words of which it is composed.
We know a great many words; an educated adult's vocabulary has been estimated at around 150,000 words. Entries in the mental lexicon may include, besides stand-alone words, grammatical morphemes such as prefixes and suffixes and multiword phrases such as idioms and cliches. Languages also differ widely in how they construct word forms, and this too will affect what is stored in the lexicon. But in any language, listening involves mapping the acoustic signal onto stored meanings.
The continuity of utterances means that boundaries between individual words in speech are not overtly marked. Speakers do not pause between words but run them into one another. The problem of segmenting a speech signal into words is compounded by the fact that words themselves are not highly distinctive. All the words we know are constructed of just a handful of different sounds; on average, the phonetic repertoire of a language contains 30–40 contrasting sounds (Maddieson, 1984). As a consequence, words inevitably resemble other words, and may have other words embedded within them (thus strange contains stray, strain, train, rain, and range). Word recognition therefore involves identifying the correct form among a large number of similar forms, in a stream in which they abut one another without a break (strange act contains jack and jacked).
The only segmentation that is logically required is to find the words in speech. Whether listening also involves some intermediate level of coding is an issue of contention among speech researchers. Do listeners extract whole syllables from the speech stream and use this syllabic representation to contact the lexicon? Do they extract phonemes from the input, so that listening involves an intermediate stage in which heard utterances are represented as strings of phonemes? Or does listening involve matching speech input against holistic stored forms? The available evidence does not yet allow us to distinguish among these positions (and other variants).
There is agreement, however, on other aspects of the spoken-word recognition process. First, information in the signal is evaluated continuously and the results are passed to the lexicon. Coarticulatory effects that cause cues to adjacent phonemes to overlap in time are efficiently used. Thus robe, rope, wrote, road, and rogue all begin with ro-, but the vowel will in each case include anticipatory information about the place of articulation of the following consonant, and listeners can exploit this (e.g., to narrow the field of candidates to only rope and robe, eliminating rogue, road, and wrote).
Evidence for continuous evaluation comes from experiments in which listeners perform lexical decision (judging whether a spoken string is indeed a real word) on speech that has been cross-spliced so that the coarticulatory effects are no longer reliable. Thus, when listeners hear troot they should respond “no”—troot is not a word. If troot is cross-spliced so that a final -t is appended to a troo-from either trook or troop (which give coarticulatory cues to an upcoming velar or bilabial consonant, respectively), then responses are slower than if the cues match. This shows that listeners are sensitive to the coarticulatory mismatch and must have processed the consonant place cues in the vowel. However, the responses are still slower when the mismatching troo-comes from troop than when it comes from trook. This suggests that the processing of consonant cues in the vowel has caused activation of the existing compatible real-word troop (Marslen-Wilson and Warren, 1994; McQueen, Norris, and Cutler, 1999).
Second, multiple candidate words are simultaneously activated during the listening process, including words that are merely accidentally present in a speech signal. Thus, hearing strange-acting may activate stray, train, range, jack, and so on, as well as the intended words.
Evidence for multiple activation comes from cross-modal priming experiments in which a word-initial fragment facilitates recognition of different words that it might become. Thus, lexical decision responses for visually presented “captain” or “captive” are both facilitated when listeners have just heard the fragment capt-(compared with some other control fragment). Moreover, both are facilitated even if only one of them matches the context (Zwitserlood, 1989).
Third, there is active competition between alternative candidate words. The more active a candidate word is, the more it may suppress its rivals, and the more competitors a word has, the more suppression it may undergo. Evidence for competition between simultaneously activated candidate words comes from experiments in which listeners must spot any real words occurring in spoken nonsense strings. If the rest of the string partially activates a competitor word, then spotting the real embedded word is slowed. For instance, listeners spot mess less rapidly in domess (which partially activates domestic, a competitor for the same portion of the signal that supports mess) than in nemess (which supports no other word; McQueen, Norris, and Cutler, 1994; see also Norris, McQueen, and Cutler, 1995; Vroomen and de Gelder, 1995; Soto-Faraco, Sebastián-Gallés, and Cutler, 2001).
Because activated and competing words need not be aligned with one another, the competition process offers a potential means of segmenting the utterance. Thus, although recognition of strange-acting may involve competition from stray, range, jack, and so on, this will eventually yield to joint inhibition from the two intended words, which receive greater support from the signal.
Adult listeners can also use information which their linguistic experience suggests to be correlated with the presence of a word boundary. For instance, in English the phoneme sequence [mg] never occurs word-internally, so the occurrence of this sequence must imply a word boundary (some go, tame goose); sequences such as [pf] or [ml] or [zw] never occur syllable-internally, so this sequence implies at least a syllable boundary (cupful, seemly, beeswax). Listeners more rapidly spot embedded words whose edges are aligned with such a boundary-correlated sequence (e.g., rock is spotted more easily in foomrock than in foogrock; McQueen, 1998). Also, words that begin with a common phoneme sequence are easier to extract from a preceding context than words that begin with an infrequent sequence (e.g., in golnook versus golnag, it will be easier to spot nag, which shares its beginning with natural, navigate, narrow, nap, and many other words; van der Lugt, 2001; see also Cairns et al., 1997).
These latter sources of information are, of course, necessarily language-specific. It is a characteristic of a particular vocabulary that more words begin with the na-of nag than with the noo-of nook; likewise, it is vocabulary-specific that sequences such as [pf] or [zw] or [ml] cannot occur within a syllable. Each of these three sequences is in fact legitimately syllable-internal in some language ([pf], for instance, in German: Pferd, Kopf).
Other language-specific information is also used in segmentation, notably rhythmic structure. In languages such as English and Dutch, most words begin with stressed syllables, and listeners find it easier to segment speech at the onset of stressed syllables (Cutler and Norris, 1988; Vroomen, van Zon, and de Gelder, 1996). This can be clearly seen in segmentation errors, as when a pop song line She's a must to avoid is widely misperceived as She's a muscular boy—the strong syllable void is taken to be the onset of a new word, while the weak syllables to and a- are taken to be noninitial (Cutler and Butterfield, 1992).
The stress rhythm of English and Dutch is not universal; many other languages have different rhythmic structures. Indeed, syllabically based rhythm in French is accompanied by syllabic segmentation in French listening experiments (Mehler et al., 1981; Cutler et al., 1986; Kolinsky, Morais, and Cluytens, 1995), while moraic rhythm in Japanese likewise accompanies moraic segmentation by Japanese listeners (Otake et al., 1993; Cutler and Otake, 1994).
Thus, although the type of rhythm is language-specific, its use in speech segmentation seems universal. Other universal constraints on segmentation exist, for example, to limit activation of spurious embedded competitors. It is harder to spot a word if the residual context contains only consonants (thus, apple is harder to find in fapple than in vuffapple; Norris et al., 1997), an effect explained as a primitive filter selecting for possible words—vuff is not a word, but it might have been one, while f could never be a word. This constraint would operate to rule out many spuriously present words in speech (such as tray and ray in stray). It is not affected by what may be a word in a particular language (Norris et al., 2001; Cutler, Demuth, and McQueen, 2002) and thus appears to be universal.
The ability to extract words from continuous speech starts early in life, as shown by experiments in which infants listen longer to passages containing words that they had previously heard in isolation than to wholly new passages (Jusczyk and Aslin, 1995); none of the passages can be comprehended by these young listeners, but they can recognize familiar strings embedded in the fluent speech. One-year-olds also detect familiar strings less easily if they are embedded in a context without a vowel (e.g., rest is found less easily in crest than in caressed; Johnson et al., 2003); that is, they are already sensitive to the apparently universal constraint on possible words.
Finally, segmentation of second languages in later life is not aided by the efficiency with which listeners exploit language-specific structure in recognizing speech. Segmentation procedures suitable for the native language can be inappropriately applied to non-native input (Cutler et al., 1986; Otake et al., 1993; Cutler and Otake, 1994; Weber, 2001). This is one effect making listening to a second language paradoxically harder than, for instance, reading the same language.
See also phonology and adult aphasia.
| |