[1]薛均晓,黄世博,王亚博,等.基于时空特征的语音情感识别模型TSTNet[J].郑州大学学报(工学版),2021,42(06):29-34.
点击复制

基于时空特征的语音情感识别模型TSTNet()
分享到:

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:
42卷
期数:
2021年06期
页码:
29-34
栏目:
出版日期:
2021-11-10

文章信息/Info

Title:
TSTNetSpeech Emotion Recognition ba<x>sed on Spatio-Temporal features
作者:
薛均晓黄世博王亚博张朝阳石磊
文献标志码:
A
摘要:
语音情感识别是从人类语音中自动识别出情感状态的技术。在社交场合中,由于双方谈话的语气、音调、语速以及情感的复杂性,导致语音情感识别成为一项极具挑战性的工作。本文提出一种基于时空特征的语音情感识别模型-TSTNet,该模型基于深度学习技术,由空间特征提取模块、时间特征提取模块和特征融合模块构成,通过将每个语音信号样本分别填充为400、800、1500三种长度,生成三个不同尺度的语谱图,进而将其分别输入卷积神经网络CNN和双向循环神经网络BiGRU中提取语音的空间特征、时间特征以及前后语义关系,从而得到三个时空特征向量,最后将这些时空特征向量进行融合并输入到全连接层进行语音情感分类。利用科大讯飞语音数据集进行了数值实验,结果表明文中方法在数据集上能够得到94.69%的识别准确率,相对于基于MFCC和随机森林等语音情感识别方法,文中方法在准确率、精确率、召回率、F1值等多个性能指标上都有很好的表现。
Abstract:
Speech emotion recognition is a technology that automatically recognizes emotional states from human speech. In social situations, due to the tone, pitch, speed and the complexity of emotion of the conversation between the two parties, speech emotion recognition has become a very challenging task. This paper proposes a speech emotion recognition method ba<x>sed on spatial-temporal features. This method is ba<x>sed on deep learning technology and consists of two feature extraction modules, space and time, and a feature fusion module. By filling each voice signal sample into three lengths of 400, 800, and 1500 respectively, generate three spectrograms of different scales, and then input them into the convolutional neural network CNN and the bidirectional cyclic neural network BiGRU to extract the spatial features, temporal features and pre- and post-semantic relationships of speech, thereby obtaining three spatial-temporal feature vectors. Finally, these spatial-temporal feature vectors are fused and input to the fully connected la<x>yer for speech emotion classification. Numerical experiments were carried out using the iFLYTEK speech data set, the results show that the method in the paper can obtain a recognition accuracy of 94.69% on the data set. Compared with traditional speech emotion recognition methods such as which are ba<x>sed on MFCC and Random Forest, the proposed method has good performance in accuracy, precision, recall rate and F1 value.
更新日期/Last Update: 2021-12-17