|
| 
Factors That Affect Intelligibility in Sound Systems
Written by Ralph Jones
Edited by Rachel Murray, P.E.
|
To comment or question further on this article, contact techsupport@meyersound.com
And, to get the full series on intelligibility from Meyer and Ralph
Jones, click
here.
The goal of a speech reinforcement system is to deliver the speaking
voice to listeners with sufficient clarity to be understood. Given
the complexity of the speech signal, the task of providing high-quality
speech reinforcement in real-world, less-than-ideal conditions is
doubly complicated.
Here (below) is a diagram of a simplified speech reinforcement system
showing the main factors that affect intelligibility.

As the diagram indicates, a number of acoustic, electromechanical
and electronic factors need to be considered if intelligibility
is to be maintained. In order to deal with all of these factors
effectively, one must understand how each affects the speech signal.
Masking
The most common obstacle that speech system designers face is the
intrusion of unwanted sounds that inevitably interfere with the
speech signal. The effect is called “masking,” —
a general term that covers a very wide variety of situations.
Masking noise can come from acoustical sources such as ventilation
equipment, traffic, crowds and commonly, reverberation and echoes.
It can also arise electronically from thermal noise, tape hiss or
distortion products. If the sound system has unusually large peaks
in its frequency response, the speech signal can even end up masking
itself.
One relationship between the strength of the speech signal and the
masking sound is called the signal-to-noise ratio expressed in decibels.
Ideally, the S/N ratio is greater than 0dB, indicating that the
speech is louder than the noise. Just how much louder the speech
needs to be in order to be understood varies with, among other things,
the type and spectral content of the masking noise.
The most uniformly effective mask is broadband noise. Here is a
chart showing word articulation versus S/N when the masking source
is noise spanning 20 Hz to 4 kHz. Notice that the signal must be
12 dB louder than the broadband noise to achieve 80% word recognition.
Although, narrow-band noise is less effective at masking speech
than broadband noise, the degree of masking varies with frequency.
Here is a chart showing word articulation versus S/N for two noise
bands — 135 to 400 Hz (the fundamental frequency range of
speech) and 1800 to 2500 Hz (the strongest consonant frequency range).
High-frequency noise masks only the consonants, and its effectiveness
as a mask decreases as the noise gets louder. But low-frequency
noise is a much more effective mask when the noise is louder than
the speech signal, and at high sound pressure levels it masks both
vowels and consonants. This is why the proximity effect of cardioid
microphones can be so harmful to speech intelligibility: it causes
the speech signal to mask itself. While cardioids are very useful
for minimizing noise pickup at the source, they should always be
used with a steep (12 dB/octave or greater) high-pass tuned to about
100 Hz (or higher, if the speaker’s voice range allows) so
that proximity effect problems are minimized.
A human voice delivering a competing message, sometimes called a
“distractor,” is also very good at masking speech —
particularly at or below 0 dB S/N. In addition, the masking effect
increases with the number of distractor voices. Here is a diagram
comparing masking for one, two and three voices. Notice that, below
0 dB S/N, three voices become just as effective a source of masking
as broadband noise. Above 0 dB S/N, however, intelligibility improves
rapidly as the S/N increases. This illustrates the importance of
having sufficient power in paging system to overcome crowd noise.
The direction from which a masking sound arrives, relative to the
direction of the speech signal, can affect the degree of masking.
If the noise comes from the same place, the masking is greatest;
it decreases as the distance between the noise and the speech increases
because this makes it easier for the brain to discriminate between
them. The masking effect is lowest when the presentation is through
headphones, with the speech in one ear and the mask in the other.
(Unfortunately, we can’t take advantage of that feature in
sound reinforcement).
From this discussion, we can see why reverberation is so destructive
of intelligibility, especially beyond critical distance. Being itself
caused by the speech, reverb mimics the speech spectrum, but generally
with greater low-frequency energy. Sufficiently long reverb and
echoes — such as are encountered in cathedrals and large sports
arenas — can actually function like multiple distractor voices.
And by its nature, reverberant energy arrives from all angles, so
it’s hard to separate from the speech using directional clues.
Frequency Response
One of the most obvious aspects of sound system performance that
affect intelligibility is frequency response. Severely band-limited
systems deliver speech poorly. For instance, telephones are generally
limited to a 2 kHz bandwidth, and this makes it hard to distinguish
between “f” and “s” or “d” and
“t” sounds.
High-quality speech systems need to cover the frequency range of
about 80 Hz (for especially deep male voices) to about 10 kHz (for
best reproduction of consonants, which are crucial to intelligibility).
Response below 80 Hz must be eliminated to the extent possible:
not only do these frequencies fall below the range of the speech
signal, but also they will cause particularly destructive masking
at high sound levels.
It’s important, also, for the system response to be reasonably
flat throughout its range. The gradual high-frequency rolloff that
many reinforcement professionals favor for music applications will
tend to de-emphasize consonants, which are already as much as 27
dB less loud than vowels. Likewise, prominent peaks or dips in the
response can cause either self-masking or loss of consonant articulation.
Finally, the coverage of the system must be consistent throughout
the intended listener area, with minimal response cancellations
or off-axis dropoff in the critical high frequencies. This requirement
very often dictates either a distributed loudspeaker system or carefully
aimed and delayed fill speakers. Using high-Q loudspeakers will
help to elevate the S/N ratio between the speech and the reverberation
levels.
Distortion
Early studies of intelligibility in communication systems suggest
that clipping the peaks of the speech signal, and then amplifying
it to restore its peak-to-peak amplitude, improves intelligibility.
The trick works in very noisy situations because clipping generates
partials that are harmonically related to the fundamental —
and thus less likely to mask the speech — and because it both
accentuates consonants and increases the sound power of the signal.
As such, it has been helpful for band-limited communication systems
that are used in very noisy environments, such as the deck of an
aircraft carrier.
The fact is, however, that clipping the signal to improve intelligibility
works only in cases where the signal-to-noise ratio is very poor.
Here (below) is a chart showing word articulation versus S/N for
an infinitely clipped and an unclipped speech signal.

Notice that the intelligibility score for the clipped signal levels
out to around 50% at 0 dB S/N; above about +3 dB S/N, the unclipped
signal scores better.
In real-life speech reinforcement systems, clipping should be avoided.
Obviously, it will sound objectionable through a high-quality sound
system. It also will increase the masking from any noise that is
picked up by the microphone, since that noise will be clipped along
with the speech.
Another type of distortion that is very destructive to intelligibility
is intermodulation distortion. While it is easily controlled in
the electronics of a sound system, significant IM can be generated
when some types of loudspeakers (particularly two-way coaxials)
are driven at high levels. IM produces sum and difference products
that are not harmonically related to the fundamental frequency.
As such, they have a much greater masking effect than the harmonic
products of clipping.
Time Response
Perhaps because it remains poorly understood and its effects are
more subtle, phase response in communication systems has received
scant attention. In fact, most published research about “phase”
and intelligibility actually deals with the effects of relative
polarity. It’s been shown, for instance, that when speech
is presented with noise over headphones, intelligibility increases
by about 25% if the speech signal in one ear is inverted relative
to the other ear. But this result has no application in sound reinforcement,
other than for in-ear stage monitors.
Again, Meyer encourages your comments and question questions
regarding this article. Contact techsupport@meyersound.com
|