[1]薛均晓,黄世博,王亚博,等.基于时空特征的语音情感识别模型TSTNet[J].郑州大学学报(工学版),2021,42(6):29-34.[doi:10.13705/j.issn.1671-6833.2021.06.008]
 Xue Junxiao,Huang Shibo,Wang Yabo,et al.Speech Emotion Recognition TSTNet Based on Spatial-temporal Features[J].Journal of Zhengzhou University (Engineering Science),2021,42(6):29-34.[doi:10.13705/j.issn.1671-6833.2021.06.008]
点击复制

基于时空特征的语音情感识别模型TSTNet()
分享到:

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:
42
期数:
2021年6期
页码:
29-34
栏目:
出版日期:
2021-11-10

文章信息/Info

Title:
Speech Emotion Recognition TSTNet Based on Spatial-temporal Features
作者:
薛均晓1,2,黄世博1,王亚博1,张朝阳3,石磊1,2
1.郑州大学 软件学院,河南 郑州 450002; 2.郑州大学 网络空间安全学院,河南 郑州 450002; 3.郑州大学 信息工程学院,河南 郑州 450001


Author(s):
Xue Junxiao1,2; Huang Shibo1; Wang Yabo1; Zhang Chaoyang3; Shi Lei1,2;
1.School of Software,Zhengzhou University,Zhengzhou 450002,China; 2.School of Cyberspace Security,Zhengzhou University,Zhengzhou 450002,China; 3.School of Information Engineering,Zhengzhou University,Zhengzhou 450002,China

关键词:
Keywords:
speech emotion recognition spectrogram spatial-temporal features
DOI:
10.13705/j.issn.1671-6833.2021.06.008
文献标志码:
A
摘要:
针对社交语音由于语气、音调、语速等差异以及填充信息丢失或冗余等问题,提出一种基于时空特征的语音情感识别方法。该方法利用卷积神经网络( CNN) 和双向循环神经网络( BiGRU) 技术,包含空间特征提取、时间特征提取和特征融合 3 个模块。考虑到音频数据内容长短不一,首先对音频数据进行预处理,应用 3 种补零填充方法,得到不同尺度的语谱图。设计了空间特征提取方法捕获音频的局部特征,并利用时间特征提取方法获取音频数据的时间特征和前后语义关系,从而得到 3 个时空特征向量。此外,融合了时空特征向量并通过全连接层进行语音情感分类。利用科大讯飞语音情感数据集进行了数值实验,实验结果与传统语音情感识别模型的实验结果相比,在准确率、精确率、召回率和 F1 值等 4 项指标上均取得了较好结果。

Abstract:
For differences in tone, pitch, speaking speed, etc. of social speech and information loss or redundancy during filling, a speech emotional recognition method was proposed based on spatial-temporal features. The method applied convolutional neural network (CNN) and bilateral recurrent neural network (BiGRU), including spatial feature extraction module, temporal feature extraction module and feature fusion module. Considering the different lengths of audio data content, the audio data was preprocessed first, and three zero-padded padding lengths were applied to obtain spectrograms of different scales. Then the spatial feature extraction module was designed to capture the local feature of the audio, and used the temporal feature extraction module to obtain the temporal feature and the semantic relationship of the audio data, thus obtained three spatial-temporal feature vectors. In addition, these temporal feature vectors were fused and input full connection layer for classification of speech emotion. With the numerical experiment using IFLYTEK speech emotion data sets, the experiment achieved better results in the accuracy, precision, recall, and F1 value than those of the experiment of traditional speech emotion recognition model.

参考文献/References:

[1] QAYYUM A B A,AREFEEN A,SHAHNAZ C.Convolutional neural network (CNN) based speech-emotion recognition[C]//2019 IEEE International Conference on Signal Processing,Information,Communication & Systems (SPICSCON).Piscataway:IEEE,2019:122-125.

[2] DAVIS S,MERMELSTEIN P.Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences[J].IEEE transactions on acoustics,speech,and signal processing,1980,28(4):357-366.
[3] HUANG Z W,DONG M,MAO Q R,et al.Speech emotion recognition using CNN[C]// Proceedings of the 22nd ACM International Conference on Multimedia.New York:ACM,2014:801-804.
[4] 王蔚,胡婷婷,冯亚琴.基于深度学习的自然与表演语音情感识别[J].南京大学学报(自然科学版),2019,55(4):660-666.
[5] 陈炜亮,孙晓.基于MFCCG-PCA的语音情感识别[J].北京大学学报(自然科学版),2015,51(2):269-274.
[6] SCHLOSBERG H.Three dimensions of emotion[J].Psychological review,1954,61(2):81-88.
[7] LIN Y L,WEI G.Speech emotion recognition based on HMM and SVM[C]//2005 International Conference on Machine Learning and Cybernetics.Piscataway:IEEE,2005:4898-4901.
[8] PAN Y, SHEN P, SHEN L. Speech emotion recognition using support vector machine[J]. International journal of smart home, 2012, 6(2): 101-108.
[9] MAO Q R,DONG M,HUANG Z W,et al.Learning salient features for speech emotion recognition using convolutional neural networks[J].IEEE transactions on multimedia,2014,16(8):2203-2213.
[10] TRIGEORGIS G,RINGEVAL F,BRUECKNER R,et al.Adieu features?End-to-end speech emotion recognition using a deep convolutional recurrent network[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).Piscataway:IEEE,2016:5200-5204.
[11] BADSHAH A M,AHMAD J,RAHIM N,et al.Speech emotion recognition from spectrograms with deep convolutional neural network[C]//2017 International Conference on Platform Technology and Service (PlatCon).Piscataway:IEEE,2017:1-5.
[12] TZIRAKIS P,ZHANG J H,SCHULLER B W.End-to-end speech emotion recognition using deep neural networks[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).Piscataway:IEEE,2018:5089-5093.
[13] ZHANG S Q,ZHANG S L,HUANG T J,et al.Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching[J].IEEE transactions on multimedia,2018,20(6):1576-1590.
[14] CHUNG J,GULCEHRE C,CHO K,et al.Empirical evaluation of gated recurrent neural networks on sequence modeling[EB/OL].(2014-12-11)[2021-03-10].https://arxiv.org/abs/1412.3555.
[15] CHOWDHURY A,ROSS A.Fusing MFCC and LPC features using 1D triplet CNN for speaker recognition in severely degraded audio signals[J].IEEE transactions on information forensics and security,2020,15:1616-1629.
[16] 赵淑芳,董小雨.基于改进的LSTM深度神经网络语音识别研究[J].郑州大学学报(工学版),2018,39(5):63-67.
[17] 李勇,金庆雨,张青川.融合位置注意力机制和改进BLSTM的食品评论情感分析[J].郑州大学学报(工学版),2020,41(1):58-62.
[18] BREIMAN L. Random forests[J]. Machine learning, 2001, 45: 5-36.
[19] 张雄,刘蓉,刘明.基于卷积特征提取与融合的语音情感识别研究[J].电子测量技术,2018,41(16):138-142.
[20] ZISAD S N,HOSSAIN M S,ANDERSSON K.Speech emotion recognition in neurological disorders using convolutional neural network[C]// International Conference on Brain Informatics.Cham:Springer,2020:287-296.
[21] 王金华,应娜,朱辰都,等.基于语谱图提取深度空间注意特征的语音情感识别算法[J].电信科学,2019,35(7):100-108.

更新日期/Last Update: 2021-12-17