SPECTRAL METHODS IN POLISH EMOTIONAL SPEECH RECOGNITION

In this article the issue of emotion recognition based on Polish emotional speech signal analysis was presented. The Polish database of emotional speech, prepared and shared by the Medical Electronics Division of the Lodz University of Technology, has been used for research. Speech signal has been processed by Artificial Neural Networks (ANN). The inputs for ANN were information obtained from signal spectrogram. Researches were conducted for three different spectrogram divisions. The ANN consists of four layers but the number of neurons in each layer depends of spectrogram division. Conducted researches focused on six emotional states: a neutral state, sadness, joy, anger, fear and boredom. The averange effectiveness of emotions recognition was about 80%.


INTRODUCTION
For humans, speech is the main tool for communication.Factors like age, language, emotions, gender of speaker and many others can influence the features of speech [14].The above mentioned factors give additional information for listener, but through emotion some specific value can be added and communicated.Obviously, information which is conveyed by voice intonation have more then only textual meaning.The same sentences pronounced with different emotions can have completely different meaning [15].
It is well-known that a sentence spoken without any emotion cannot transfer extra information to the listener, but for systems constructed for automatic speech recognition it is a dream situation.In other words, emotional states caused essential changes in speech parameters, which deteriorate the accuracy of speech recognitions systems [7].
The largest problem for emotional speech recognition applications is the number of different emotional states.It is not trivial to construct model which will focused on all of emo-tions so usually researches consider states such as: joy, sadness, boredom, fear, anger [4,7].
In this article the usage of spectrograms in Polish emotional speech recognition will be shown.Authors focused on above mentioned emotional states adding a neutral state -state emotionally unmarked [3,7,14].In the researches Polish Emotional Speech Database was used.This base was prepared by Lodz University of Technology.All database contains 240 records prepared by professional actors (4 men and 4 woman).Each speaker pronounced 5 different sentences in 6 mentioned above emotional states [7].
The subject of presented research was to find if voice spectral analysis connected with artificial neural networks is enough to effectively recognize the speaker emotional state.The second objective was to determine the optimal input parameters and the whole structure of used artificial neural networks.

DESCRIPTION OF DATABASE
In researches connected with emotional speech analysis the Berlin Database of Emo-tional speech is commonly used.Abovementioned database contains recordings in seven emotional states: fear, anger, boredom, joy, sadness, disgust and neutral state.Recordings were prepared by ten professional actors of both sexes [1].If Polish emotional speech is considered, researchers rather used database prepared by Medical Electronics Division of the Lodz University of Technology.This base, the same as Berlin database, was prepared by eight professional actors: four women and four men.Collected files were recorded in six emotional states, that is: joy, anger, boredom, fear, sadness and neutral state [8].The whole database contains 240 records saved in the 'wav' format sampled with 44.1 kHz frequency and the bit rate of 16 bps.This database includes the following statements: 'I stop shaving from today on', 'Johnny was at the hairdresser's' today, 'They have bought a new car today', 'This lamp is on the desk today' and 'His girlfriend is coming here by plane'.

VOICE SPECTRAL ANALYSIS
The most commonly used tools in speech signal processing are methods connected with time and frequency.The sets of time-frequency methods are large but can be divided into two main groups: time -frequency representations and time -scale representations [10].Those methods can be interpreted as short -time frequency analysis, because they allow to estimate speech signal in a finite time intervals.This estimation is carried out based on signal's fragments cut by window function [18].
The voice spectral analyzes of Polish emotional speech is the main issue of the following research.In general spectrogram there is a visual representation of signal amplitude spectrum for each time, when the signal is determined.It is constructed by dividing the signal into specific parts.For each part the amplitude of harmonic components are counted.The frequency and time are arguments for spectrogram [6].

SHORT -TIME FOURIER TRANSFORM (STFT)
STTF fulfil the main role in speech signal analysis as well as spectrograms.These two methods can be included to time -frequency representation group [9].STFT can be treated as special case of Gabor transformation [18].Its definition for continuous signal x(t) has a form in frequency domain as follows [10]: (1) and in time domain: (2) where: w(t) is window function of the Fourier spectrum W(f) and X(f) is spectrum of the analyzed signal.

SELECTION OF SPECTROGRAM'S PARAMETERS
Getting the spectrogram which enables an efficient inference is associated with selection of spectrogram's parameters such as: resolution in time domain, window function or window width.The best resolution in time domain can be achieved by usage the maximum overlapping N o = N -1.The frequency resolution is directly proportional to number of elements of window function N [5,10].Applying the maximum overlapping seems to be the most desirable situation but this method is connected with significant increase in computational effort.Selecting the length of window defines the frequency resolution according to the above-mentioned relation: Δf•f p /N.The appropriate selection of window length (N) is more complex [5,10].If the signal is modulated the length of time interval should be defined as follows: the quotient of the mean width of the frequency B to time A should be equal to the quotient of the frequency rate increase to the time at which it occurred [10]: (6) where: (7) is mean square frequency width for window function w(t) in Fourier spectrum W(f) and: is mean square time width but [10]: (9) For discrete signal analysis (4) the parameter N is responsible for resolution in frequency domain [10].Times resolution can be increased by taking maximum overlapping but as it was mentioned above it is connected with significant calculation incensement.In order to obtain high time resolution without overlapping time windows with a small number of elements should be used but it will cause small frequency resolution [16,17].In that case the resolution in frequency All calculation are performed based on fourth equation.Parts of speech signal x(n) are consecutively cuts by window function w(n) and for each part Discrete Fourier Transform (DFT) is calculated.Using STFT for discrete signal a spectrogram can be defined as follows [10]: (5) The selection of resolution in the time and frequency domain has a main influence on spectrogram quality.Wide window in STFT guarantee high resolution in frequency domain but narrow window increase resolution in time domain.This effect has justification in timespectrum correlation for window function.To obtain a high time resolution window function w(n) with a small number of elements should be applied.If the number of elements for window function -N, is small, the DFT calculation, which is performed for successive frequencies mutually distant for Δf•f p /N (f p -sampling frequency), will be carried out with large increase in frequency.An additional disadvantage will be the occurrence of the blur effect in spectrum, which will be caused by a large width of main leaf in amplitude frequency characteristics for time window.For high N value, the resolution in frequency axis will increase and the width for main leaf in amplitude frequency characteristics will decrease.The disadvantage is that the calculation for DFT will be performed with big time step Δt=N/f p , which will have a negative influence on spectrogram precession [2].The overlapping method in STFT counting process is used to improve the spectrogram quality [11].
In the Figure 1

CONDUCTED RESEARCHES
The main aim of the conducted researches was to identify emotion based on Polish speech signal processing.These researches are focused on the use of spectrogram and artificial neural networks (ANN) to achieve abovementioned goal.As it was mentioned above, speech signals were obtained from database prepared by Medical Electronics Division of the Lodz University of Technology.
Researches were conducted in two ways.The first was regardless of speaker's sex and the second taking into account the above division.The first step in the conducted researches was pre-processing.The values of amplitude of particular samples have undergone the process of normalization and reached values between -1 to 1.The second step was framing.The whole signal was divided into small frames with time length 20 ms.To reduce the discontinuities at the edges of frames the Hamming window was used.
The next step was to use the Fast Fourier Transform (FFT) for transforming each segment of speech signal to its frequency domain from discrete time domain.
After mentioned above transformation the spectrograms were created in MatLab application.The example of time-frequency signal representation in shown in Figure 2. The spectrograms were created based on Hamming window of 128 size with 50 % overlapping.

FEATURE EXTRACTION
The main goal of feature extraction process was preparation the inputs for ANN.This process was as follows: • The spectrogram was converted into grey scale.• Achieved spectrogram was converted into binary.The values below threshold was changed into 0 and values greater than threshold into 1.• The whole spectrogram was divided into matrix as follows: 3x3, 4x4, 5x5 separately.In Figure 3 the division by matrix 4x4 is shown.The researches were conducted for each of the above-mentioned divisions.
• All the values in each sub areas were summed and became input parameters for ANN.

ARTIFICIAL NEURAL NETWORKS
In this article artificial neural networks have been created in MatLab application.The mathematical formulation on ANN is as follows [13]: (12) domain can be increased by complementing window by sequence of zeros.It means that if window function w(m) takes non-zero values for m=0,1,2, …, N p -1 the window should be complemented by zeros sequence to N length.Based on (4) equation the STFT will be as follows [4,10]: (10) where: When the signal spectral representation was divided by 4x4 matrix, the changes, comparing to previously described ANN, was in input layer and hidden layer.All other parameters remain unchanged.In this case the input layer consisted of 17 neurons.The additional neuron was as in the previous case, either speaker gender or bias.The hidden layers had 34 neurons each.The exampled ANN architecture was shown in Figure 4.
The last researches were conducted for 5x5 spectrogram divisions.The input layer was constructed with 26 neurons.The first hidden layer consists of 20 neurons the second of 10 neurons.The output layer and other ANN parameters was the same as in previous cases.
The artificial neural network was trained by setting the network parameters and stopping criteria.As it was mentioned, the back propagation algorithm was used.The sets of desired output and inputs were introduced to the network to learn the data's relationships.Error correction of this data set was generated by using local training approach.The main difference between traditional back propagation algorithm and the used one is usage of semi-supervised teaching technique.The main role of hidden layers was to adjust weights connected with each input nodes.For error calculation the root mean square error was used.If this value was not satisfied, the algorithm propagated error from output layer to input layer.The algorithm was working until the mean square error was not satisfied.

ACHIEVED RESULTS AND DISCUSSION
The database prepared by Medical Electronics Division of the Lodz University of Technology was used in conducted researches.This database contains 240 files recorded in six emotional states: joy, anger, fear, boredom, sadness and neutral state.The experimental results for each spectrogram division is shown in Figures 5-7.
Tables 1÷3 show the confusions matrix of the proposed algorithm.The above-mentioned tables were prepared with gender division and without.Each column represents the instance of an original class, each row represents emotion predicted by ANN.
It can be easily seen that the worse efficient was achieved if women were test group in researches.The average ANN effectiveness was in this case about 73%.In male test group this effectiveness was 77% and if both groups were considered ANN correctly predict average 75%.
A little bit more effective was the ANN if 4x4 spectrogram division was considered.In this case the best results were achieved.All ANN prediction was shown in Figures 8 to 10.In this case researches were conducted with mentioned above division.In tables 4÷6 the achieved confusions matrixes were shown.
In this case also the worse results were achieved in female tested group.The average ANN effectiveness in this case was about 76%.In male group it was over 80% and if both groups were considered the average result was almost 79%.The results and confusions matrixes for 5x5 spectrogram divisions were shown in Figures 11 to 13 and Tables 7 to 9 respectively.Also in this case the experiments conducted on female group gave the worse results.ANN effectiveness was about 74%.In male group this value was about 77% and in researches without gender division it was 76%.In all the researches neutral state was the least likely recognizable emotional state.It was frequently confused with sadness and boredom.If the best results are considered the ANN effectiveness is about 80 % which is a satisfactory result.
From the experimental results it can be confirmed that time -frequency domain spectral representation is efficient tool for visualization of different approach to Polish emotional state identification.The results which were achieved are satisfactory.Moreover, it was shown that ANN are good tool for spectrogram processing.

CONCLUSIONS
The conducted researches presented a novel approach for detecting human emotions based on Polish emotional speech by using voice spectral representation.The important features from spectrogram were extracted and a new architecture for ANN was presented.The structure of ANN allows to reduce the classification process difficulty and number of features which are inputs for ANN.The new way of features extraction was presented.This approach shows that based on analysis of speech signal time-frequency representation the emotional states can be properly identify.
the overlapping method is shown.Window function, consists of six samples, cuts parts of the analysed signal.In this example the overlapping value is 50% (N o = 3 samples).It can be easily noticed that overlapping samples number can be N o = 1 to N o = N -1.

Table 1 .
Confusion matrix for 3x3 spectrogram division without gender division

Table 2 .
Confusion matrix for 3x3 spectrogram division for female

Table 3 .
Confusion matrix for 3x3 spectrogram division for male

Table 4 .
Confusion matrix for 4x4 spectrogram division without gender division

Table 5 .
Confusion matrix for 4x4 spectrogram division for female

Table 6 .
Confusion matrix for 4x4 spectrogram division for male

Table 7 .
Confusion matrix for 5x5 spectrogram division without gender division

Table 8 .
Confusion matrix for 5x5 spectrogram division for female Fig. 13.ANN effectiveness for 5x5 spectrogram division for male

Table 9 .
Confusion matrix for 5x5 spectrogram division for male