Emotion Classification (Anger, Fear, Happy, Neutral and Sad) from Speech Audio
This thesis topic is about the detection of emotions through speech patterns. The main concept behind our project is SER that is Speech Emotion Recognition which is the detection of human emotions from speech. It focuses on the point that the human voice often reflects and resonates the emotion behind it. Some of the main classes of emotions that we have worked on are happiness, sadness, fear, neutral, anger, boredom, disgusted etc. One of the main objectives of our project was the collection of data of audio files. With the help of different databases, we obtained the required data sets. Extraction of features out of those audio files was done. Every audio file has various underlying features which are the basis of the differentiation among the audios. Some of those features are Pitch, Frequency, MFCCs, and Spectral Roll Of etc. For the implementation, we worked with Chroma – Short Time Fourier Transform, RMSE, Spectral Rollof, Spectral Centroid, Spectral Flux, Spectral Bandwidth, Zero Crossing Rate (ZCR) and 20 sets of MFCCs The basic problem that we found a solution to, was the maximum accuracy of the detected emotion. A single feature applied on an algorithm, generates a result that is a value in percentage might be totally different to the value generated by the combination of two features applied on the same algorithm. We chose three different datasets from RAVDESS, SAVEE and EMODB Berlin, respectively. For each of these data sets we extracted the features and made the combinations out of them to create feature vectors. Once, it was done, 3 different machine learning models CNN, MLP and SVM were introduced. We applied these algorithms one by one on each dataset to find the accuracy values. All 3 of the datasets generated different accuracy values for each of the model. We found that the dataset from EMODB Berlin had proven to work the best for us. When SVM was applied on it, it gave an accuracy of 82.00 % which is higher than the research paper we referred for this experimentation. We worked with deep learning as well and chose to apply CNN on one of the datasets. WEKA does not support CNN, so we implemented it using Python. The accuracy value we obtained was 72.00 % using EMODB Berlin. After these two, the same dataset was used and MLP was applied on it. This time it gave the highest accuracy of 85.35 %. The implicit aim was the processing of the extracted feature set by machine learning. We applied different algorithms to detect the emotions and generate its accuracy. Since, CNN could not be tested on WEKA its Python implementation was done. This way we had three multiple accuracy values against the three datasets, when 3 different machine learning algorithms were applied on them, respectively. We analyzed and compared the values from all of these approaches. And our goal was to find the best-fit algorithm for a specific feature set which turn out to be an achievement for us since we got 82.00 % and 85.35 % with SVM and MLP respectively.