Speech Emotion Recognition TSTNet Based on Spatial-temporal Features

NAVIGATE

Table of Contents

STATISTICS

Viewed2667

Downloads1078

Speech Emotion Recognition TSTNet Based on Spatial-temporal Features

[HTML] PDF下载 (1078)

[1]Xue Junxiao,Huang Shibo,Wang Yabo,et al.Speech Emotion Recognition TSTNet Based on Spatial-temporal Features[J].Journal of Zhengzhou University (Engineering Science),2021,42(06):29-34.[doi:10.13705/j.issn.1671-6833.2021.06.008]

Copy

Journal of Zhengzhou University (Engineering Science)[ISSN 1671-6833/CN 41-1339/T] Volume: 42 Number of periods: 2021 06 Page number: 29-34 Column: Public date: 2021-11-10

Title:: Speech Emotion Recognition TSTNet Based on Spatial-temporal Features

Author(s):: Xue Junxiao; Huang Shibo; Wang Yabo; Zhang Chaoyang; Shi Lei;; School of Software at Zhengzhou University; School of Network Space Security, Zhengzhou University; School of Information Engineering, Zhengzhou University;

Keywords:: speech emotion recognition; spectrogram; spatial-temporal features

CLC:: -

DOI:: 10.13705/j.issn.1671-6833.2021.06.008

Abstract:: For differences in tone, pitch, speaking speed, etc. of social speech and information loss or redundancy during filling, a speech emotional recognition method was proposed based on spatial-temporal features. The method applied convolutional neural network (CNN) and bilateral recurrent neural network (BiGRU), including spatial feature extraction module, temporal feature extraction module and feature fusion module. Considering the different lengths of audio data content, the audio data was preprocessed first, and three zero-padded padding lengths were applied to obtain spectrograms of different scales. Then the spatial feature extraction module was designed to capture the local feature of the audio, and used the temporal feature extraction module to obtain the temporal feature and the semantic relationship of the audio data, thus obtained three spatial-temporal feature vectors. In addition, these temporal feature vectors were fused and input full connection layer for classification of speech emotion. With the numerical experiment using IFLYTEK speech emotion data sets, the experiment achieved better results in the accuracy, precision, recall, and F1 value than those of the experiment of traditional speech emotion recognition model.

References:: [1] QAYYUM A B A,AREFEEN A,SHAHNAZ C.Convolutional neural network (CNN) based speech-emotion recognition[C]//2019 IEEE International Conference on Signal Processing,Information,Communication & Systems (SPICSCON).Piscataway:IEEE,2019:122-125.
[2] DAVIS S,MERMELSTEIN P.Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences[J].IEEE transactions on acoustics,speech,and signal processing,1980,28(4):357-366.
[3] HUANG Z W,DONG M,MAO Q R,et al.Speech emotion recognition using CNN[C]// Proceedings of the 22nd ACM International Conference on Multimedia.New York:ACM,2014:801-804.
[4] 王蔚,胡婷婷,冯亚琴.基于深度学习的自然与表演语音情感识别[J].南京大学学报(自然科学版),2019,55(4):660-666.
[5] 陈炜亮,孙晓.基于MFCCG-PCA的语音情感识别[J].北京大学学报(自然科学版),2015,51(2):269-274.
[6] SCHLOSBERG H.Three dimensions of emotion[J].Psychological review,1954,61(2):81-88.
[7] LIN Y L,WEI G.Speech emotion recognition based on HMM and SVM[C]//2005 International Conference on Machine Learning and Cybernetics.Piscataway:IEEE,2005:4898-4901.
[8] PAN Y, SHEN P, SHEN L. Speech emotion recognition using support vector machine[J]. International journal of smart home, 2012, 6(2): 101-108.
[9] MAO Q R,DONG M,HUANG Z W,et al.Learning salient features for speech emotion recognition using convolutional neural networks[J].IEEE transactions on multimedia,2014,16(8):2203-2213.
[10] TRIGEORGIS G,RINGEVAL F,BRUECKNER R,et al.Adieu features?End-to-end speech emotion recognition using a deep convolutional recurrent network[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).Piscataway:IEEE,2016:5200-5204.
[11] BADSHAH A M,AHMAD J,RAHIM N,et al.Speech emotion recognition from spectrograms with deep convolutional neural network[C]//2017 International Conference on Platform Technology and Service (PlatCon).Piscataway:IEEE,2017:1-5.
[12] TZIRAKIS P,ZHANG J H,SCHULLER B W.End-to-end speech emotion recognition using deep neural networks[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).Piscataway:IEEE,2018:5089-5093.
[13] ZHANG S Q,ZHANG S L,HUANG T J,et al.Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching[J].IEEE transactions on multimedia,2018,20(6):1576-1590.
[14] CHUNG J,GULCEHRE C,CHO K,et al.Empirical evaluation of gated recurrent neural networks on sequence modeling[EB/OL].(2014-12-11)[2021-03-10].https://arxiv.org/abs/1412.3555.
[15] CHOWDHURY A,ROSS A.Fusing MFCC and LPC features using 1D triplet CNN for speaker recognition in severely degraded audio signals[J].IEEE transactions on information forensics and security,2020,15:1616-1629.
[16] 赵淑芳,董小雨.基于改进的LSTM深度神经网络语音识别研究[J].郑州大学学报(工学版),2018,39(5):63-67.
[17] 李勇,金庆雨,张青川.融合位置注意力机制和改进BLSTM的食品评论情感分析[J].郑州大学学报(工学版),2020,41(1):58-62.
[18] BREIMAN L. Random forests[J]. Machine learning, 2001, 45: 5-36.
[19] 张雄,刘蓉,刘明.基于卷积特征提取与融合的语音情感识别研究[J].电子测量技术,2018,41(16):138-142.
[20] ZISAD S N,HOSSAIN M S,ANDERSSON K.Speech emotion recognition in neurological disorders using convolutional neural network[C]// International Conference on Brain Informatics.Cham:Springer,2020:287-296.
[21] 王金华,应娜,朱辰都,等.基于语谱图提取深度空间注意特征的语音情感识别算法[J].电信科学,2019,35(7):100-108.

Similar References:

Memo

Last Update: 2021-12-17