Live Sound University Article Thu, December 04, 2008

LSI University | Audio Basics |

Native language and speech intelligibility problems

by Jeffrey A. Rocha

Summary

  • A thorough look at issues of speech intelligibility in emergency systems as well as the impact of native language differences.

On a foggy night in the early 1980’s, a pilot accepted instructions from an air traffic controller while taxiing to takeoff in Tenerife. The German pilot maneuvered his aircraft into position anticipating the go ahead when the aircraft was suddenly struck by another in the process of take off.

Over 600 people perished in one of the worst aircraft accidents in history. The cause of the collision: A lack of speech intelligibility. The cockpit environment was noisy and the situation stressful, but the conditions were well within acceptable limits for proper communications between pilots and controllers.

In this case however, the German pilot misunderstood the English instructions that he received from the Spanish air traffic controller. English - a second language to both - was their only common language.

This is merely one example of the ramifications of poor speech intelligibility in our daily lives. The absolute need to comprehend instructions in life-threatening situations occurs more frequently than one might expect.

Whenever we walk into a place of public meeting, the presumption is that in the event of emergency, we will receive clear instructions. Our ability to understand these instructions directly impacts our ability to survive.

The panic and noise associated with emergencies further degrade intelligibility. Add native language differentials to this scenario and the problem intensifies.

It is clear that poor speech intelligibility on that foggy night in Tenerife was a major factor in a devastating accident. It is also apparent that this speech intelligibility obstacle was at least in part the result of a difference in native language between communicators.

But do even reasonably fluent speakers of a language experience a reduced ability to comprehend speech based merely upon a deviation from their primary language? The answer is yes.

The (Limited) Proof
Unfortunately, the native language dependence of speech intelligibility has gone relatively unnoticed, and consequently, little direct research is available regarding its existence or implications. The aviation accident in Tenerife offers some limited evidence of a problem.

And certainly our own personal struggles with non-native speech indicate the potential for reduced comprehension, particularly when actual practice of these second and third languages is infrequent.

A more scientific means of evaluation is required for an absolute determination of this phenomenon. The increased curiosity surrounding the potential for this problem has prompted some to look back at earlier research in an attempt to isolate some additional proof.

Professor Richard Campbell of Worcester Polytechnic Institute (WPI) in Massachusetts has conducted an annual experiment with his audio engineering classes, in which he attempts to determine the speech intelligibility map of a controlled environment in a lecture hall. Speech intelligibility evaluation recordings are available that place standard word lists within the phrase, “Write (word) on the line now.”

In this way the listener knows the exact position and temporal location of the desired word prior to its utterance. Professor Campbell plays this recording through a loudspeaker that he places in one corner of the room. Campbell then introduces pink noise into the room through a second loudspeaker positioned in an opposing corner.

The students are seated in rows and are distributed evenly throughout the room. After the word list has been completed, the students evaluate the percentage of words that they have heard correctly through the use of an answer key. These percentages are then plotted on a top view of the room to develop contours of intelligibility.

Invariably, the resulting contours demonstrate a higher intelligibility close to the speech loudspeaker and a lower intelligibility close to the noise loudspeaker. A gradual degradation of speech intelligibility is observed over the area between the two sources. A marked deviation from this trend has been observed when there are non-native speakers of English present in the room.

Professor Campbell indicates that he has observed holes in the intelligibility map in the locations where non-native speakers were seated. In one particular map, an oblong “hole” corresponds to the seat locations of three Asian students who speak English as a second language. The intelligibility scores of those seated behind the Asian students actually rose as the distance from the source increased.

The professor also notes that he observed a similar depression when three Argentinean students were seated together. Yet he’s quick to point out that these students are fluent English speakers, “in the sense that rapid colloquial two-way conversation was easy for them,” he explained.

Indeed more research is required before absolute conclusions can be drawn, but the repeated incidence of poor intelligibility scores particularly when viewed in comparison with the scores of native speakers of English is intriguing.

Professor Campbell intends to continue this research while introducing an additional twist. He feels that there is strong evidence supporting the notion that this language dependent intelligibility phenomenon is exacerbated through the introduction of panic.

This added stress seems to generate an environment in which the ability to focus on what is being said is significantly compromised. This inability to focus also seems to have greater impact upon non-native speakers who require this concentration for absolute comprehension. While this area is currently extremely developmental, the impact - if true - would be tremendous, particularly in emergency situations.

The Cause of Language Dependence
Due to the infancy of this exploration, hard evidence is tough to come by, and further research is required. But perhaps it is helpful to investigate why this phenomenon might exist and then arrive at a logical explanation for its cause.

The root of the problem lies in the phoneme - essentially, the smallest unit of speech. Phoneme are distinct sounds that are formed through the various combinations of letters employed in the written word.

“The phonemes of a language operate in an analogous way [to letters] and in fact alphabetic writing is derived originally from the phonemic system…”(From the book “Homo Loquens”, by Dennis Fry, page 12)

While there are only 26 letters in the English alphabet, there are over 40 phonemes, 20 of which are derived from the five vowels utilized in English word construction. This essentially proves that the uses of vowel sounds in English word enunciation are quite diverse.

“Languages differ from each other in their phoneme systems, just as they differ in grammar and vocabulary…. What the phoneme system does is to dictate for any language what particular sounds must be recognized as distinct from each other and what sound differences should be disregarded.” (“Homo Loquens”, page 15)

The phoneme differences between languages result in situations where distinct phonemes in the one language are interpreted as being the same by foreigners who are unused to the diverse pronunciations of a given lettering. This situation arises between any two languages.

Several examples of this phoneme disconnect are outlined in Fry’s “Homo Loquens” book:

1) “English uses the difference between /s/ and /sh/ as a way of distinguishing words, so that we can find pairs like save and shave, sin and shin, mass and mash. The phonemic system of Dutch or of Spanish or of a number of Indian languages does not include this distinction and as a consequence native speakers of these languages are quite unable to perceive the difference in English unless they have made a special effort to learn to do so.” (page 15)

2) “Another example is to be found in the final sounds of win and wing which are indistinguishable to a native Italian speaker…. his language [also] contains no pair of words differentiated solely by the presence of either /n/ or /ng/.” (page 15)

3) “The phonemic system in quite a number of Indian languages includes as many as six different t sounds which are all but indistinguishable to the English ear. Among them is a pair which differ from each other in the same way as the t sounds in the two English words tar and star, but this is not a difference that has any function in the English system and we are therefore unaware of its existence.” (pages 15-16)

4) “…the fact that Japanese speakers cannot detect the difference between /r/ and /l/ sounds and cannot make the distinction when talking English. A rather endearing example is that of the Japanese who when making an after-dinner speech in English confessed that he was rather nervous and ‘had butterfries in his stomach’.” (page 72)

It is apparent that from the perspective of the person speaking there is no difference between the distinct phoneme sounds. As a consequence he/she feels phoneme sounds can be used interchangeably.

This same phenomenon can be seen in the early language developmental stages of children speaking their native language. During this developmental stage children cultivate the ability to distinguish and enunciate various phonemes.

Often a child uses dissimilar phonemes interchangeably without distinction. In these situations the child exclusively employs the more easily pronounced phoneme. When the child then hears the phoneme pronounced correctly, they typically insist that this is exactly how they had said it.

At this point they have not developed the ability to distinguish between two dissimilar phonemes. This process is much the same as that of learning a foreign language.

Once upon walking through a wooded area with a child of three, he informed me that there was a really big wock (rock) off to the right. Jokingly I responded, “Yes, that is a really big wock.” At which point young William informed me that obviously I had difficulty with the pronunciation of that word, for I had said it incorrectly.

I have also heard the story of a child who requested that an individual “keep quiet because the baby is sweeping (sleeping).” He replied, “Oh, the baby is sweeping?” She looked at him puzzled and stated emphatically, “Not sweeping: sweeping!”

This language development pattern serves in some form as reinforcement of the fact that the phoneme system is complex. It is learned gradually and mastered through every day usage. The typical inability to have multiple primary languages results in a situation where phoneme variations in languages are difficult to interpret due to their abundance.

“People who have a common language have learned to adopt a particular system and moving to another language means acquiring a new and additional system of phonemic organization.” (“Homo Loquens”, page 16)

This can be difficult. It is also unclear whether or not mastering phonemic pronunciation in a language guarantees phonemic comprehension. Perhaps differentiating spoken phonemes is more difficult than actually speaking them. This hypothesis is somewhat supported by Professor Campbell’s experience with fluent non-native English speakers.

There are additional criteria that lend themselves to word comprehension. Intonation and rhythm can dramatically affect the meaning that is being conveyed by the speaker.

“The various intonations that can be given to a sentence are themselves part of the grammar of the spoken language and the information about the intonation system is another component in the linguistic knowledge stored by the brain.” (Homo Loquens, page 16)

But these variations are typically less language dependent and are fewer in number than the differing phonemes.

“So much emphasis has been placed on the phoneme level of operation because this is where the main ear-work of speech takes place. [intonation and rhythm, while important to comprehension, involve a significantly smaller number of categories]…the English system, for example, functions with six tones and only two rhythmic categories, formed by the strong syllables and the weaker ones.” (Homo Loquens, page 72)

All of this is to say nothing of the tremendous differences in sentence construction between various languages that can add to or detract from one’s ability to achieve comprehension from context. Simple things like adjectives preceding or following nouns can severely obstruct ones ability to gather meaning.

In essence, there are several logical explanations that describe the perceived inability of non-native speakers to comprehend a familiar language, particularly when spoken in a noisy environment.

Speech Intelligibility Derivations
The goal in developing a good speech transmission system is to determine what conditions are necessary for the maximum intelligibility. This intelligibility “… is used to signify the accuracy and ease with which the articulated sounds of speech are recognized.” (Olson, page 495)

The criteria used to determine the effectiveness of this speech transmission system are intelligibility indices that are based upon signal and noise levels over specified bandwidths. The fundamental methods used to determine intelligibility involve “… pronouncing speech sounds into one end of a transmission system and having the observer write the sounds that are heard at the receiving end.” (Olson, page 495)

“According to the work of French and Steinberg and of Beranek, if the spectrum levels of speech at a listener’s ear are such that the shaded region of lies above the threshold of hearing of the listener and above the ambient noise, but below the overload line, all syllables of the speech will be audible to the listener and the speech intelligibility will be nearly perfect. This corresponds to an articulation index of 100 percent….” The percentage articulation index is defined as the ratio (times 100) of the speech area not covered over by [noise, the threshold of hearing, or overload] to the total speech area… (Beranek pp. 408-409)

In order to calculate these quantities for a theoretical system, the gain of the system, coupled with the directivity index of the amplification system and the reverberant characteristics of the space can be used to determine the average, peak and minimum levels of speech and noise in a given space.

The problem here is that the tests that were used to arrive at these conclusions involved native speakers of English listening to native speakers of English. While Beranek and others recognize the significance of “psychological and linguistic” factors as they relate to different native speakers, different word lengths and trained or untrained listeners all of which yield dramatically different articulation results make “absolute predictions of articulation scores … not possible”. The contention remains that “one can say that if the calculated articulation index exceeds 60 per cent, a speech-communication system is probably satisfactory.” (Beranek, p. 415)

Much of the basis for the additional indices such as STI (a general purpose speech intelligibility index based upon SNR and reverberation), RASTI (similar to STI but requiring less data) and %ALCons (the percentage of consonants that will be detected clearly which is paramount to comprehension) has evolved from these early studies into speech intelligibility.

More recent social and technological changes require that additional steps be taken to ensure public safety. The original conclusions are all based upon the fallacy that the vast majority of speech takes place between native speakers of a common language. This assumes that resulting indices will suffice for all communication.

This was perhaps true at some point in the past. As technology expands and the world in essence shrinks, diverse language histories will frequently come in contact with one another. The simple experiments conducted by Professor Campbell indicate that non-native speaking students in one controlled environment correctly identified less than 40 percent of the words correctly. Clearly additional work must be done in this area.

Possible Solutions
It is unclear whether this language based influence on speech intelligibility is noise dependent or if in any environment there is a fundamental inability to understand some of what is being said due to a lack of phoneme distinction. In either case, some thought and analysis can yield better speech intelligibility criteria than those that exist currently.

One enhancement is to limit the number of words that are used in emergency announcements and familiarize the populace with this limited word list. This reduced emergency list would be an acoustical analog to the universal symbol for choking.

This would prove to be a tremendous benefit due to our inherent ability to mentally insert phonemes that have been masked by a noise when we are familiar with the context or perhaps even already aware of the available words. This ability is known as verbal auditory induction.

“Verbal auditory induction (phonemic restoration) employs contextual information of speech in determining the identity of the missing sound. The restored phoneme is indistinguishable to the listener from those physically present…the apparent position of the extraneous sound can be made to drift forward or backward in the sentence, although its exact location remains unclear.” (Contemporary Issues in Experimental Phonetics, p. 412)

In other words, this mental insertion of a masked phoneme is so natural; that the listener is unable to identify which phoneme had been masked. The synthesized phoneme is inserted and recalled as if it were actually heard.

As a result of the noise immunity of a predictable message, we see that experimentally, the intelligibility indices increase dramatically. “The precision with which listeners identify speech elements is intimately related to the size of the vocabulary and to the sequential or contextual constraints that exist in the message. The percent correct is higher the more predictable the message, either by virtue of higher probability of occurrence or owing to the conditional probabilities associated with the linguistic and contextual structure.” (Speech Analysis, p. 303)

In addition, “…as vocabulary size increases, the signal-to-noise ratio necessary to maintain a given level of performance also increases.” (Speech Analysis, p. 304)

In general “… speech perception … is a process in which the detection procedure probably is tailored to fit the signal and the listening task. If the listener is able to impose a linguistic organization upon the sounds, he may use the information that is temporally dispersed to arrive at a decision about a given sound element. If such an association is not made, the decision tends to be made more upon the acoustic factors of the moment and in comparison to whatever standard is available.” (Speech Analysis, p. 306)

It would seem that - regardless of the native language of the listener - if the message possibilities were known and anticipated, then intelligibility would be better. This would be particularly true if the message was specifically chosen to utilize the most common phonemes that are readily recognizable and distinguishable in most languages (perhaps even only those languages that are present in a particular ethnic cross section of a region where a public building is constructed).

For example, if at a public assembly a warning signal were sounded to alert those in attendance that important instructions were to follow, and if the possible message choices were limited to perhaps three that had been made familiar to the audience earlier, then the chances that all would comprehend the message would go up dramatically.

This simple system could readily be employed and maintained consistently over a given geographical area, however a national solution is certainly more practical.

Additional methods that can be utilized to improve intelligibility involve improving the signal to noise ratio. The best way to do this consistently is to increase the signal level without distortion. This could be implemented as an adaptive filtering scheme that adjusts the equalization of the system real time to be optimized for the given message.

This optimization would involve the elimination of most of the low and high frequency acoustic output while focusing on the vocal range. The level over this band could be increased and if appropriate devices had been chosen, the system could reproduce the message quite a bit louder than the level at which it is run for normal playback.

Conclusions
It is apparent that this phoneme phenomenon deserves additional attention, particularly in light of the ever-increasing ethnic diversity that we experience in these United States. The problem is understood, but its extent is not readily quantifiable. It would seem as if it possesses the potential to be devastating.

Additional speech intelligibility research must be done with diverse subjects, and methods of counteracting the effects of non-native speaker speech intelligibility degradation should be developed and employed to give further guarantee that the safety of the general public is the principal goal of a successful sound reinforcement system.

References

Beranek, Leo L.: Acoustics. Acoustical Society of America by American Institute of Physics, 1986.

Flanagan, James L.: Speech Analysis Synthesis and Perception. 2nd Edition. Springer- Verlag, Berlin, Heidelberg, New York, 1972.

Fry, Dennis.: Homo Loquens: Man as a talking animal. Cambridge University Press: Cambridge, London, New York, Melbourne, 1977.

Lass, Norman J., Ed.: Contemporary Issues in Experimental Phonetics. Academic Press: New York, San Francisco, London, 1976.

Lathi, B.P.: Modern Digital and Analog Communication Systems. 2nd Edition. Holt, Rinehart and Winston, Inc. Philadelphia, Fort Worth, Chicago, San Francisco, Montreal, Toronto, London, Sydney, Tokyo, 1989.

Olson, Harry F.: Acoustical Engineering. Professional Audio Journals, Inc.: Philadelphia, Pennsylvania, 1991

Jeff Rocha is director of loudspeaker design for EAW