«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j.issn.1671-6833.2021.06.008]
点击复制

基于时空特征的语音情感识别模型TSTNet()

分享到：

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:: 42
期数:: 2021年06期

页码:: 29-34

栏目:

出版日期:: 2021-11-10

文章信息/Info

Title:: Speech Emotion Recognition TSTNet Based on Spatial-temporal Features

作者:: 薛均晓¹; 2; 黄世博¹; 王亚博¹; 张朝阳³; 石磊¹; 2; 郑州大学软件学院;郑州大学网络空间安全学院;郑州大学信息工程学院;

Author(s):: Xue Junxiao; Huang Shibo; Wang Yabo; Zhang Chaoyang; Shi Lei;; School of Software at Zhengzhou University; School of Network Space Security, Zhengzhou University; School of Information Engineering, Zhengzhou University;

关键词:: 语音情感识别; 语谱图; 时空特征

Keywords:: speech emotion recognition; spectrogram; spatial-temporal features

DOI:: 10.13705/j.issn.1671-6833.2021.06.008

文献标志码:: A

摘要:: 语音情感识别是从人类语音中自动识别出情感状态的技术。在社交场合中，由于双方谈话的语气、音调、语速以及情感的复杂性，导致语音情感识别成为一项极具挑战性的工作。本文提出一种基于时空特征的语音情感识别模型-TSTNet，该模型基于深度学习技术，由空间特征提取模块、时间特征提取模块和特征融合模块构成，通过将每个语音信号样本分别填充为400、800、1500三种长度，生成三个不同尺度的语谱图，进而将其分别输入卷积神经网络CNN和双向循环神经网络BiGRU中提取语音的空间特征、时间特征以及前后语义关系，从而得到三个时空特征向量，最后将这些时空特征向量进行融合并输入到全连接层进行语音情感分类。利用科大讯飞语音数据集进行了数值实验，结果表明文中方法在数据集上能够得到94.69%的识别准确率，相对于基于MFCC和随机森林等语音情感识别方法，文中方法在准确率、精确率、召回率、F1值等多个性能指标上都有很好的表现。

Abstract:: For differences in tone, pitch, speaking speed, etc. of social speech and information loss or redundancy during filling, a speech emotional recognition method was proposed based on spatial-temporal features. The method applied convolutional neural network (CNN) and bilateral recurrent neural network (BiGRU), including spatial feature extraction module, temporal feature extraction module and feature fusion module. Considering the different lengths of audio data content, the audio data was preprocessed first, and three zero-padded padding lengths were applied to obtain spectrograms of different scales. Then the spatial feature extraction module was designed to capture the local feature of the audio, and used the temporal feature extraction module to obtain the temporal feature and the semantic relationship of the audio data, thus obtained three spatial-temporal feature vectors. In addition, these temporal feature vectors were fused and input full connection layer for classification of speech emotion. With the numerical experiment using IFLYTEK speech emotion data sets, the experiment achieved better results in the accuracy, precision, recall, and F1 value than those of the experiment of traditional speech emotion recognition model.

参考文献/References:

[1] QAYYUM A B A,AREFEEN A,SHAHNAZ C.Convolutional neural network (CNN) based speech-emotion recognition[C]//2019 IEEE International Conference on Signal Processing,Information,Communication & Systems (SPICSCON).Piscataway:IEEE,2019:122-125.

[2] DAVIS S,MERMELSTEIN P.Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences[J].IEEE transactions on acoustics,speech,and signal processing,1980,28(4):357-366.

[3] HUANG Z W,DONG M,MAO Q R,et al.Speech emotion recognition using CNN[C]// Proceedings of the 22nd ACM International Conference on Multimedia.New York:ACM,2014:801-804.

[4] 王蔚,胡婷婷,冯亚琴.基于深度学习的自然与表演语音情感识别[J].南京大学学报(自然科学版),2019,55(4):660-666.

[5] 陈炜亮,孙晓.基于MFCCG-PCA的语音情感识别[J].北京大学学报(自然科学版),2015,51(2):269-274.

[6] SCHLOSBERG H.Three dimensions of emotion[J].Psychological review,1954,61(2):81-88.

[7] LIN Y L,WEI G.Speech emotion recognition based on HMM and SVM[C]//2005 International Conference on Machine Learning and Cybernetics.Piscataway:IEEE,2005:4898-4901.

[8] PAN Y, SHEN P, SHEN L. Speech emotion recognition using support vector machine[J]. International journal of smart home, 2012, 6(2): 101-108.

[9] MAO Q R,DONG M,HUANG Z W,et al.Learning salient features for speech emotion recognition using convolutional neural networks[J].IEEE transactions on multimedia,2014,16(8):2203-2213.

[10] TRIGEORGIS G,RINGEVAL F,BRUECKNER R,et al.Adieu features?End-to-end speech emotion recognition using a deep convolutional recurrent network[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).Piscataway:IEEE,2016:5200-5204.

[11] BADSHAH A M,AHMAD J,RAHIM N,et al.Speech emotion recognition from spectrograms with deep convolutional neural network[C]//2017 International Conference on Platform Technology and Service (PlatCon).Piscataway:IEEE,2017:1-5.

[12] TZIRAKIS P,ZHANG J H,SCHULLER B W.End-to-end speech emotion recognition using deep neural networks[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).Piscataway:IEEE,2018:5089-5093.

[13] ZHANG S Q,ZHANG S L,HUANG T J,et al.Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching[J].IEEE transactions on multimedia,2018,20(6):1576-1590.

[14] CHUNG J,GULCEHRE C,CHO K,et al.Empirical evaluation of gated recurrent neural networks on sequence modeling[EB/OL].(2014-12-11)[2021-03-10].https://arxiv.org/abs/1412.3555.

[15] CHOWDHURY A,ROSS A.Fusing MFCC and LPC features using 1D triplet CNN for speaker recognition in severely degraded audio signals[J].IEEE transactions on information forensics and security,2020,15:1616-1629.

[16] 赵淑芳,董小雨.基于改进的LSTM深度神经网络语音识别研究[J].郑州大学学报(工学版),2018,39(5):63-67.

[17] 李勇,金庆雨,张青川.融合位置注意力机制和改进BLSTM的食品评论情感分析[J].郑州大学学报(工学版),2020,41(1):58-62.

[18] BREIMAN L. Random forests[J]. Machine learning, 2001, 45: 5-36.

[19] 张雄,刘蓉,刘明.基于卷积特征提取与融合的语音情感识别研究[J].电子测量技术,2018,41(16):138-142.

[20] ZISAD S N,HOSSAIN M S,ANDERSSON K.Speech emotion recognition in neurological disorders using convolutional neural network[C]// International Conference on Brain Informatics.Cham:Springer,2020:287-296.

[21] 王金华,应娜,朱辰都,等.基于语谱图提取深度空间注意特征的语音情感识别算法[J].电信科学,2019,35(7):100-108.

更新日期/Last Update: 2021-12-17

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

文章信息/Info

参考文献/References:

常用功能

导航/Navigate

工具/Tools

统计/Statistics