Study of Acoustic Correlates Associated with Emotional Speech
Serdar Yildirim - firstname.lastname@example.org
Sungbok Lee, Murtaza Bulut, Chul Min Lee, Abe Kazemzadeh, Carlos Busso, Shrikanth Narayanan
University of Southern California
Los Angeles, CA, 90089
Popular version of paper 1aSC10
Presented Monday morning, November 15, 2004
148th ASA Meeting, San Diego, CA
Human speech carries information about both the linguistic content as well
as the emotional/attitudinal state of the speaker. This study investigates the
acoustic characteristics of four different emotions expressed in speech. The
goal is to obtain detailed acoustic knowledge on how the speech signal is modulated
by changes from an emotionally neutral state to a specific emotionally aroused
state. Such knowledge is necessary for the automatic assessment of emotional
content and strength as well as for emotional speech synthesis which should
help develop a more efficient and user-friendly man-machine communication system.
For instance consider an automated call center application, where depending
on the detected emotional state of the user during the interaction -- such as
displeasure or anger due to errors in understanding user's requests -- the system
could transfer the troubled user to a human operator before premature man-machine
dialogue disruption. Similarly, development of speech synthesis systems capable
of emotional speech will enable the computer to interact with the user more
naturally such as by adopting an appropriate tone of the voice suitable to a
In this study, emotional speech data obtained from two semi-professional actresses are analyzed and compared. Each subject produced 211 sentences with four different emotions: neutral, sad, angry, happy. We analyze changes in speech acoustic parameters such as magnitude and variability of segmental (i.e., phonemic) duration, fundamental frequency and first three formant frequencies as a function of the emotion type. Segmental duration here means duration of spoken phoneme. RMS energy is correlated with loudness of speech and the fundamental frequency and formant frequencies are related to speaker's individual voice characteristics. The changes of these acoustic or speech parameters over time are known to be correlated with not only what is said but also how it is said. Therefore, change in emotion is expected to be reflected in changes in such acoustic parameters, when compared to those of neutral speech. Acoustic differences among the emotions are also explored through mutual information computation, multidimensional scaling and acoustic likelihood comparison with normal, neutral speech. Those are mathematical methods which are used to quantify or visualize similarity among objects.
Current results indicate that speech associated with anger and happiness is characterized by longer segmental durations, shorter inter-word silence, higher pitch and energy with wider dynamic range. Sadness is distinguished from other emotions by lower energy and longer inter-word silence. Interestingly, the difference in formant pattern between [happiness/anger] and [neutral/sadness] are better reflected in back vowels such as /a/(/father/) than in front vowels. Some detailed results on segmental duration, fundamental frequency and formant frequencies, and energy are given below.
Statistical data analysis (using analysis of variance, ANOVA) showed that effect of emotions on duration parameters such as utterance durations, inter-word silence/speech ratio, speaking rate, and average vowel durations are significant. Moreover, our results showed that angry and happy speech have longer average utterance and vowel durations compare to that of neutral and sad. In terms of inter-word silence/speech ratio, sad speech contains more pauses between words compare to that of other emotions. Our analysis also indicated that sad, angry, and happy have greater variability in speaking rate than that of neutral speech.
Fundamental Frequency and Formant Frequencies
Our statistical analysis indicated that the effect of emotion on fundamental frequency (F0) is significant (p < 0.001). The mean (standard deviation) of F0 for neutral are 188 (49) Hz, for sad 195 (66) Hz, for angry 233 (84) Hz, and for happy 237 (83) Hz. Earlier studies report that the mean F0 is lower in sad speech compared to that of neutral speech [Murray93]. This tendency is not observed for this particular subject. However, it is confirmed that angry and happy speech have higher F0 values and greater variations compared to that of neutral speech. As we can observe from Figure 1, mean vowel F0 values for neutral speech are less than that of other emotion categories. It is also observed that anger/happy and sad/neutral show similar F0 values on average, suggesting that F0 modulation between the two within-group emotions.
Our analysis based on RMS energy showed that sad speech has less median value and lower spread in energy than that of other emotions. Angry and happy speech have higher median values and greater spread in energy. Also, ANOVA indicates that effect of emotion is significant (p < 0.001). According to our statistical analysis, RMS Energy is the best single parameter to separate emotion classes.
|Mutual Info (bits)||0.4810||0.5202||0.8189||0.7988|