ADVANCED TIME-FREQUENCY REPRESENTATION IN VOICE SIGNAL ANALYSIS

The most commonly used time-frequency representation of the analysis in voice signal is spectrogram. This representation belongs in general to Cohen’s class, the class of time-frequency energy distributions. From the standpoint of properties of the resolution, spectrogram representation is not optimal. In Cohen class representations are known which have a better resolution properties. All of them are created by smoothing the Wigner-Ville’a distribution characterized by the best resolution, however, the biggest harmful interference. The used smoothing functions decide about a compromise between the properties of resolution and eliminating harmful interference term. Another class of time-frequency energy distributions is the affine class of distributions. From the point of view of readability of analysis of the best properties are known so called Redistribution of energy caused by the use of a general methodology referred to as reassignment to any time-frequency representation. Reassigned distributions efficiently combine a reduction of the interference terms provided by a well adapted smoothing kernel and an increased concentration of the signal components.


INTRODUCTION
A time-frequency representation (TFR) is widely used in the analysis of non-stationary signals such as,: human speech signals, ECG signals, geophysical signals.Analyzed signal is represented as a joint function of time and frequency -rather than as a function of time or frequency [2,3].Such an analysis should constitute an important tool for understanding many processes and phenomena within problems of estimation, detection or classification.A unified way of presenting different kinds of TFR was followed by L. Cohen in the mid-1960s (in a context of quantum mechanics), what has become known as Cohen's class since then [16,17].There has been a rapid growth of interest in this subject.The diversity of theoretical and practical viewpoints from which they could be approached and the numerous known results would make a complete synthesis of this subject a voluminous document [13][14][15][26][27][28].Moreover, succinct [9, 13-15, 17, 33, 36, 39, 40] or detailed [18,22,23] publications already exist, and the reader is invited to refer to them.
Time-frequency analysis (TF) can be used for acoustic signal analysis, including speech signals [4].TF analysis is used as a tool in various types of techniques such as: speech coding, speech synthesis, speech recognition and speaker recognition.These techniques are defined by a common term speech processing.Speech processing mostly performs two fundamental operations:

ADVANCED TIME-FREQUENCY REPRESENTATION IN VOICE SIGNAL ANALYSIS
Dariusz Mika1 , Jerzy Józwik 2  Feature Extraction [6] and Classification [7].In scientific work [13] the authors presented survey of various feature extraction techniques in speech processing such as Fast Fourier Transforms, Linear Predictive Coding, Mel Frequency Cepstral Coefficients, Discrete Wavelet Transforms, Wavelet Packet Transforms and their applications in speech processing.Very often for effective analysis speech signals they are used sparse time frequency representations [19,24,35].In scientific work [35] a comparison between atomic decomposition methods and time-frequency distributions with respect to speech signals is presented.The authors demonstrated that the highest resolution of the analysis is achieved with the application Positive Time-Frequency Distributions.In scientific work [38] the authors conducted a comparison of speech signal analysis with the use of discreet Fourier transformation (DFT) and discreet cosines transformation DCT.The authors showed that the spectrograms plotted using DCT are clearer than the spectrograms plotted using same point DFT.Demonstrated that spectrogram using DCT is characterized by a higher resolution in relation to DFT.In the context of speech signal recognition, artificial neural networks are also used.The authors [34] compare various popular signal representations such as short-time Fourier transform (STFT) with linear and Mel scales, constant-Q transform (CQT) and continuous Wavelet transform (CWT), and assess their impact on the classification performance of environmental sound datasets using convolutional neural networks.It was shown that Mel-STFT spectrograms were consistently good performers across the variations tested.Time -frequency analysis is also used in biomedical diagnostics and analysis.
In phoniatrics time-frequency representations (primarily spectrograms and scaleograms) are used for diagnosis of diseases of the vocal organs [1,5,9,19,20,11,35,43,45]. The authors [21] presented an overview of the methodology of automatic detection of pathological changes in the voice.In scientific work [37] the authors demonstrated detection and discrimination of voice disorders in using modulation frequency analysis.The authors [19] used TF representations from the class Cohen for vocal fold's onset signal for diagnosis of different phonation disorders evoked by pathological changes.The vibration signals are acquired by direct optical inspection of the glottis using an endoscope and a high speed CCD camera system.In order to analyze the speech signal, TF representations from the Cohen class were used along with cone kernel distribution to ensure maximum smoothness over time.The authors show that even small pathological changes in the vocal folds are visible on the time-frequency plane, which allows sensitive detection of affects and helps to diagnose.Authors of many works in order to identify diseases and pathological changes in the voice used a discrete wavelet transformation DWT [1,41] and support vector machine-based classification method as feature classification tools [1,5,6,11].In scientific work [1,21,44] demonstrated that the most effective algorithm (100% recognition efficiency) is a system composed of wavelet packet transforms along with feature dimension reduction by linear discriminant analysis and a support vector machine-based classification method.

MATHEMATICAL BASICS OF ANALYSIS
The concept of affine time-frequency representation was introduced for the first time in 1985 [8] and is based on wavelet transform.In fact, the wavelet transform is directly connected to affine time-frequency representations by means of a smoothing operation in the time-frequency plane, which explains the great importance of affine time-frequency representations in signal analysis.First construction of these distributions was based on group theory [6,7], a powerful tool in signal analysis, which did not, however, stir up an enthusiasm comparable with that caused by the study of Cohen's class distributions [16].Second approach [28] relies on the affine smoothing of certain distributions of Cohen's class.
Spectrogram often used in signal analysis is the squared modulus of the short time Fourier transformation (STFT) of signal x(t) and can be expressed as (1): where: t -time, u -shift in time domain, v -frequency, h(t)-time window and '*' is conjugate operator Spectrogram belongs to the class of timefrequency energy distributions.The purpose of energy distributions is to distribute the energy of the signal over the two description variables: time and frequency.Among the desirable proper-ties of an energy time-frequency distribution, two of them are of particular importance: time and frequency covariance (2-3): where: t -time, t 0 -shift in time domain, v -frequency, v 0 -shift in frequency domain.
These properties guaranty that, if the signal is delayed in time and modulated, its time-frequency distribution is translated of the same quantities in the time-frequency plane.This group of transformations is called the Weyl-Heisenberg group [3].It has been shown that the class of energy time-frequency distributions verifying these covariance properties possesses the following general expression (4): is Wigner-Ville distribution (WVD) of signal x(t).In the case where Π is a smoothing function, this expression allows one to interpret C x as a smoothed version of the WVD; consequently, such a distribution will attenuate in a particular way the interferences of the WVD.
This class is of significant importance since it includes a large number of the existing time-frequency energy distributions.The type of parameterization function ϕ used (or smoothing function Π) determines the type of signal representation.The most frequently used representations include: Pseudo-Wigner-Villa distribution PWVD, Smoothed-pseudo Wigner-Villa distribution SPWVD, Rihaczek distribution, Margenau-Hill distribution, Choi-Williams distribution, Born-Jordan distribution and Zhao-Atlas-Marks distribution called also Cone-Shaped Kernel distribution [32].The Cohen's class, is based on the properties of covariance by shifts in time and in frequency.In order to favor a time-scale approach of the signal, one can also choose to put forward, among these desirable properties, the covariance by translation in time and dilation [30].The corresponding group of transforms, counterpart the Weyl-Heisenberg group, is the affine group.It can be expressed as (8): The set of such representations defines the affine class, which is the class of time-frequency energy distributions covariant by translation in time and dilation.As in the case of Cohen class smoothing function type Π determines the type of representation.The most popular are: affine smoothed pseudo Wigner distribution ASPWD, Bertrand distribution, D-Flandrin distribution, Unterberger distribution (active and passive) [4,6,7,23,29,30,42].
Bilinear time-frequency distributions offer a wide range of methods designed for the analysis of non-stationary signals.Nevertheless, a critical point of these methods is their readability, which means both a good concentration of the signal components and no misleading interference terms.Some efforts have been made recently in that direction, and in particular a general methodology referred to as reassignment [34,32].The original idea of reassignment was introduced in an attempt to improve the spectrogram.Indeed, as any other bilinear energy distribution, the spectrogram is faced with an unavoidable trade-off between the reduction of misleading interference terms and a sharp localization of signal components.The reassignment method concentrate the averaged energy of signal not at the geometrical center but rather at the center of gravity of the domain.Reasoning with a mechanical analogy, the local energy Time-distribution for example Π( − ,  − )   (, ) (as a function of s and f) can be considered as a mass distribution, and it is much more accurate to assign the total mass (i.e. the spectrogram value) to the center of gravity of the domain rather than to its geometrical center.The reassignment principle can be used for any distributions belonging to Cohen and affine class.Reassigned distributions efficiently combine a reduction of the interference terms provided by a well adapted smoothing kernel and an increased concentration of the signal components achieved by the reassignment.

METHODOLOGY AND RESULTS OF ANALYSIS
The time-frequency representations presented above were applied to speech signals with disease syndromes.The used samples of recordings come from a person suffering from larynx cancer.Recordings were made using a simple audio recorder with a sampling frequency of 10kHz.TF analysis was performed in Matlab using the Time-Frequency Toolbox.For comparison, the results of the TFR analysis were presented using both the Cohen and the affine class TFR and their reassignment version.In addition to the spectrogram and its reassignment, PWVD and SPWVD reassignment versions (Fig. 2, 3, 4) were presented in the Cohen class.In the affine class, Morlet's skalogram with reassignment and ASPWD (Fig. 5) were presented.In order to reduce processing time, processing was performed on samples with a reduced sampling rate of up to 2 kHz, so the received TFR images are limited to 1 kHz (the frequency scale is a relative scale).The time scale corresponds to the number of samples.The time of the analyzed signal was 5 seconds.In the preprocessing stage, the recorded speech signal was processed into an analytical signal using Hilbert transform.TFR transformations were performed using the Keizer window.
Figure 1 shows the Wigner-Ville distribution of the analyzed speech signal, which is the "basis" for the TFR representation used.Harmful interference is shown in the image.TFR representations are obtained by appropriate smoothing of WV representation in the time and frequency domain.
The classical method of presentation of a sound signal in the form of a spectrogram (Fig. 2a) is obtained by simultaneous time and frequency smoothing.This results in the removal of harmful interference clearly visible on the WV representation (Fig. 1), but at the expense of reduced resolution and legibility of the received image.
The PWVD representation (Fig. 3a) obtained by smoothing only in the frequency do- The controlled degree of smoothing both in the time domain and frequency characteristic for the SPWVD representation (Fig. 4a) results in a significant increase in the readability of the obtained images relative to PWVD, despite visible harmful interference.
Similar to the affine class, simultaneous time and scale smoothing in the field of wavelet transformations (Fig. 5a) (Morler's Skalogram) causes low readability of the received image despite the removal of harmful interferences.Controlled smoothing separately in time and scale in the case of ASPWD (Fig. 5b) significantly improves readability of the resulting image.
The use of the reassignment version for the TFR representation significantly increases the resolution of the images obtained, thereby increasing the readability.The lesions are visible in a slightly fuzzy manner on the spectrogram images (Fig. 1a), SPWVD and ASPWD on reassignment images (Fig. 2b, 3b, 4b and 5c) are clearly distinguishable which makes them easy to detect.
In the context of medical diagnostics, such an improvement of readability of the received TFR images can contribute to improved detection and evaluation of existing speech disorders.The complexity of the numerical operations occurring with such a representation results, however, in a significant lengthening of the analysis time, which excludes on-line processing mode applications.

CONCLUSIONS
Representations belonging to the class of time-frequented energy distributions are well known theoretically.However, their use in science and technology is mainly limited to a simple spectrogram, which is characterized by the lack of control over the degree of smoothing the signal.The resulting TFR image is therefore blurry, and for some applications (especially where the nuances in TFR play an important role) may not be sufficient.The use of other representations belonging to this class, and especially in conjunction with their reassignment versions, allows a very accurate analysis of the signal at the time-frequency plane.The use of such TFR representations, among others in medical or technical diagnostics, can contribute to more effective detection of disease syndromes and malfunctions.However, due to the complexity of the numerical operations involved in creating these representations (especially their reassignment), there is a need to develop (optimize) computational algorithms that will shorten the analysis time and allow for online mode application.

Fig. 1 .
Fig. 1.Wigner-Ville distribution of the analyzed speech signal along with harmful interference

Fig. 2 .Fig. 3 .Fig. 4 .
Fig. 2. The classical method of presentation of a sound signal is: a) a spectrogram, and b) its reassignment.Arrows indicate characteristic syndromes