[ 1 ] GRAVES A, FERNÁNDEZ S, GOMEZ F, et al. Con⁃ nectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks[C]∥Pro⁃ ceedings of the 23rd International Conference on Machine learning. New York: ACM, 2006: 369-376. [ 2 ] GRAVES A, MOHAMED A R, HINTON G. Speech recognition with deep recurrent neural networks [ C] ∥ 2013 IEEE International Conference on Acoustics, Speech and Signal Processing ( ICASSP ). Piscataway: IEEE, 2013: 6645-6649.
[ 3 ] CHAN W, JAITLY N, LE Q, et al. Listen, attend and spell: a neural network for large vocabulary conversational speech recognition[C] ∥2016 IEEE In⁃ ternational Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2016: 4960 -4964.
[ 4 ] VASWANI A, SHAZEER N, PARMAR N, et al. Atten⁃ tion is all you need[C]∥Proceedings of the 31st Inter⁃ national Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010.
[ 5 ] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre⁃ training of deep bidirectional transformers for language understanding[EB/ OL]. ( 2019 - 05 - 24) [ 2023 - 03 - 10].https:∥arxiv.org / abs/ 1810. 04805.
[ 6 ] GALES M J F, KNILL K M, RAGNI A, et al. Speech recognition and keyword spotting for low⁃resource lan⁃ guages: Babel project research at CUED[C]∥The 4th International Workshop on Spoken Language Technologies for Under⁃Resourced Languages. St. Pe⁃ tersburg: RFBR, 2014: 16-23.
[ 7 ] 赵淑芳, 董小雨. 基于改进的 LSTM 深度神经网络语 音识别研究[J]. 郑州大学学报(工学版), 2018, 39 (5): 63-67.
ZHAO S F, DONG X Y. Research on speech recognition based on improved LSTM deep neural network[J]. Jour⁃ nal of Zhengzhou University ( Engineering Science ), 2018, 39(5): 63-67.
[ 8 ] THOMAS S, GANAPATHY S, HERMANSKY H. Mul⁃ tilingual MLP features for low⁃resource LVCSR systems [ C ] ∥ 2012 IEEE International Conference on Acoustics, Speech and Signal Processing ( ICASSP ). Pisca⁃taway: IEEE, 2012: 4269-4272.
[ 9 ] POVEY D, BURGET L, AGARWAL M, et al. The subspace Gaussian mixture model: a structured model for speech recognition [ J ]. Computer Speech & Language, 2011, 25(2): 404-439.
[10] IMSENG D, BOURLARD H, GARNER P N. Using KL⁃ divergence and multilingual information to improve ASR for under⁃resourced languages[C] ∥2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2012: 4869-4872.
[11] MOHAMED A R, DAHL G E, HINTON G. Acoustic modeling using deep belief networks [ J ]. IEEE 8 郑 州 大 学 学 报 (工 学 版) 2023 年 Transactions on Audio, Speech, and Language Processing, 2012, 20(1): 14-22.
[12] POVEY D, CHENG G F, WANG Y M, et al. Semi⁃ orthogonal low⁃rank matrix factorization for deep neural networks [ C] ∥ Interspeech 2018. Hyderabad: ISCA, 2018: 3743-3747.
[13] 薛均晓, 黄世博, 王亚博, 等. 基于时空特征的语音情 感识别模型 TSTNet [ J]. 郑州大学学报( 工学版), 2021, 42(6): 28-33.
XUE J X, HUANG S B, WANG Y B, et al. Speech emotion recognition TSTNet based on spatial⁃temporal fea⁃ tures[J]. Journal of Zhengzhou University ( Engineering Science), 2021, 42(6): 28-33.
[14] POVEY D, PEDDINTI V, GALVEZ D, et al. Purely se⁃ quence⁃trained neural networks for ASR based on lattice⁃ free MMI[C]∥Interspeech 2016. San Francisco: ISCA, 2016: 2751-2755.
[15] JAITLY N, HINTON E. Vocal tract length perturbation (VTLP) improves speech recognition[C] ∥Proceedings of the Workshop on Deep Learning for Audio, Speech and Language. Atlanta: ICML, 2013:1-5.
[16] KO T, PEDDINTI V, POVEY D, et al. Audio augmenta⁃ tion for speech recognition[C]∥Interspeech 2015. Dres⁃ den: ISCA, 2015: 3586-3589.
[17] PARK D S, CHAN W, ZHANG Y, et al. SpecAugment: a simple data augmentation method for automatic speech recognition[EB/ OL]. (2019- 04- 18) [2023- 03- 10]. https:∥arxiv.org / abs/ 1904. 08779.
[18] KHARITONOV E, RIVIÈRE M, SYNNAEVE G, et al. Data augmenting contrastive learning of speech represen⁃ tations in the time domain[C]∥2021 IEEE Spoken Lan⁃ guage Technology Workshop ( SLT). Piscataway: IEEE, 2021: 215-222.
[19] XIE Q Z, LUONG M T, HOVY E, et al. Self⁃training with noisy student improves ImageNet classification[C]∥ 2020 IEEE/ CVF Conference on Computer Vision and Pat⁃ tern Recognition ( CVPR). Piscataway: IEEE, 2020: 10684-10695.
[20] GOODFELLOW I J, POUGET⁃ABADIE J, MIRZA M, et al. Generative adversarial networks[EB/ OL]. (2014-06- 10)[2023-03-10]. https:∥arxiv.org / abs/ 1406. 2661.
[21] 王坤峰, 苟超, 段艳杰, 等. 生成式对抗网络 GAN 的研 究进展与展望[J]. 自动化学报, 2017, 43(3): 321-332.
WANG K F, GOU C, DUAN Y J, et al. Generative ad⁃ versarial networks: the state of the art and beyond [ J]. Acta Automatica Sinica, 2017, 43(3): 321-332.
[22] QIAN Y M, HU H, TAN T. Data augmentation using generative adversarial networks for robust speech recogni⁃ tion[J]. Speech Communication, 2019, 114: 1-9.
[23] SUN S N, YEH C F, OSTENDORF M, et al. Training augmentation with adversarial examples for robust speech recognition[EB/ OL].(2018-06-07)[2023-03-10].ht⁃ tps:∥arxiv.org / abs/ 1806. 02782.
[24] SHINOHARA Y. Adversarial multi⁃task learning of deep neural networks for robust speech recognition[C]∥Inter⁃ speech 2016. San Francisco: ISCA, 2016: 2369-2372.
[25] LIU B, NIE S, ZHANG Y P, et al. Boosting noise ro⁃ bustness of acoustic model via deep adversarial training [C]∥2018 IEEE International Conference on Acoustics, Speech and Signal Processing ( ICASSP ). Piscataway: IEEE, 2018: 5034-5038.
[26] LI C Y, VU N T. Improving speech recognition on noisy speech via speech enhancement with multi⁃discriminators CycleGAN [ C ] ∥ 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Pis⁃ cataway: IEEE, 2022: 830-836.
[27] 屈丹, 张文林, 杨绪魁. 实用深度学习基础[M]. 北 京: 清华大学出版社, 2022.
QU D, ZHANG W L, YANG X K. Practical deep learning foundation [ M]. Beijing: Tsinghua University Press, 2022.
[28] CHUNG Y A, HSU W N, TANG H, et al. An unsuper⁃ vised autoregressive model for speech representation learning[C]∥Interspeech 2019. Graz: ISCA, 2019: 146 -150.
[29] CHUNG Y A, TANG H, GLASS J. Vector⁃quantized au⁃ toregressive predictive coding [ C] ∥ Interspeech 2020. Shanghai: ISCA, 2020: 3760-3764.
[30] LIU A T, LI S W, LEE H Y. TERA: self⁃supervised learning of transformer encoder representation for speech [J]. IEEE/ ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 2351-2366.
[31] HSU W N, BOLTE B, TSAI Y H H, et al. HuBERT: self⁃supervised speech representation learning by masked prediction of hidden units [ J]. IEEE/ ACM Transactions on Audio, Speech and Language Processing, 2021, 29: 3451-3460.
[32] GUTMANN M, HYVÄRINEN A. Noise⁃contrastive esti⁃ mation: a new estimation principle for unnormalized sta⁃ tistical models [ J ]. Journal of Machine Learning Research, 2010, 9: 297-304.
[33] OORD A V D, LI Y Z, VINYALS O. Representation learning with contrastive predictive coding [ EB/ OL ]. (2019- 01 - 22) [ 2023 - 03 - 10]. https: ∥ arxiv. org / abs/ 1807. 03748.
[34] SCHNEIDER S, BAEVSKI A, COLLOBERT R, et al. Wav2vec: unsupervised pre⁃training for speech recognition [C]∥Interspeech 2019. Graz: ISCA, 2019: 3465-3469.
[35] TJANDRA A, SAKTI S, NAKAMURA S. Sequence⁃to⁃ sequence ASR optimization via reinforcement learning[C] ∥ 2018 IEEE International Conference on Acoustics, Speech and Signal Processing ( ICASSP ). Piscataway: IEEE, 2018: 5829-5833.
[36] TJANDRA A, SAKTI S, NAKAMURA S. End⁃to⁃end speech recognition sequence training with reinforcement learning[J]. IEEE Access, 2019, 7: 79758-79769.
[37] LUO Y P, CHIU C C, JAITLY N, et al. Learning online alignments with continuous rewards policy gradient[C]∥ 2017 IEEE International Conference on Acoustics, Speech and Signal Processing ( ICASSP ). Piscataway: IEEE, 2017: 2801-2805.
[38] VARIANI E, RYBACH D, ALLAUZEN C, et al. Hybrid autoregressive transducer (HAT) [C]∥2020 IEEE Inter⁃ national Conference on Acoustics, Speech and Signal Pro⁃ cessing (ICASSP). Piscataway: IEEE, 2020: 6139-6143.
[39] KALA T K, SHINOZAKI T. Reinforcement learning of speech recognition system based on policy gradient and hypothesis selection[C] ∥2018 IEEE International Con⁃ ference on Acoustics, Speech and Signal Processing (IC⁃ ASSP). Piscataway: IEEE, 2018: 5759-5763.
[40] CHUNG H, JEON H B, PARK J G. Semi⁃supervised training for sequence⁃to⁃sequence speech recognition using reinforcement learning [ C] ∥ 2020 International Joint Conference on Neural Networks ( IJCNN). Piscat⁃ away: IEEE, 2020: 1-6.
[41] RADZIKOWSKI K, NOWAK R, WANG L, et al. Dual supervised learning for non⁃native speech recognition[ J]. EURASIP Journal on Audio, Speech, and Music Process⁃ ing, 2019, 2019(1): 1-10.
[42] 王璐, 潘文林. 基于元学习的语音识别探究[J]. 云南 民族大学学报(自然科学版), 2019, 28(5): 510-516.
WANG L, PAN W L. Speech recognition based on meta⁃ learning [ J ]. Journal of Yunnan Minzu University (Natural Sciences Edition), 2019, 28(5): 510-516.
[43] 侯俊龙, 潘文林. 基于元度量学习的低资源语音识别 [J]. 云南民族大学学报 ( 自然科学版), 2021, 30 (3): 272-278.
HOU J L, PAN W L. Low⁃resource speech recognition based on meta⁃metric learning [ J ]. Journal of Yunnan Minzu University (Natural Sciences Edition), 2021, 30 (3): 272-278.
[44] KLEJCH O, FAINBERG J, BELL P. Learning to adapt: a meta⁃learning approach for speaker adaptation[C]∥In⁃ terspeech 2018. Hyderabad: ISCA, 2018: 867-871.
[45] HSU J Y, CHEN Y J, LEE H Y. Meta learning for end⁃to⁃ end low⁃resource speech recognition[C]∥2020 IEEE Inter⁃ national Conference on Acoustics, Speech and Signal Pro⁃ cessing (ICASSP). Piscataway: IEEE, 2020: 7844-7848.
[46] XIAO Y B, GONG K, ZHOU P, et al. Adversarial meta sampling for multilingual low⁃resource speech recognition [ J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(16): 14112-14120.
[47] WINATA G I, CAHYAWIJAYA S, LIU Z H, et al. Learning fast adaptation on cross⁃accented speech recog⁃ nition[ C] ∥Interspeech 2020. Shanghai: ISCA, 2020: 1276-1280.
[48] WINATA G I, CAHYAWIJAYA S, LIN Z J, et al. Meta⁃ transfer learning for code⁃switched speech recognition [EB/ OL]. (2020 - 03 - 04) [ 2023 - 03 - 10]. https:∥ arxiv.org / abs/ 2003.01901.