«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j. issn. 1671-6833. 2024. 03. 002]
点击复制

基于MacBERT 和R-Drop 的地质命名实体识别()

分享到：

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:: 45
期数:: 2024年03期

页码:: 89-95

栏目:

出版日期:: 2024-04-20

文章信息/Info

Title:: Geological Named Entity Recognition Based on MacBERT and R-Drop

文章编号:: 1671-6833( 2024) 03-0089-07

作者:: 刘昕1; 徐洪珍1; 2; 刘爱华2; 邓德军1; 1. 东华理工大学信息工程学院,江西南昌 330013;2. 东华理工大学软件学院,江西南昌 330013

Author(s):: LIU Xin¹; XU Hongzhen¹; 2; LIU Aihua²; DENG Dejun¹; 1. School of Information Engineering, East China University of Technology, Nanchang 330013, China; 2. School of Software, East China University of Technology, Nanchang 330013, China

关键词:: 命名实体识别; 地质; MacBERT; BiGRU; R-Drop

Keywords:: named entity recognition; geology; MacBERT; BiGRU; R-Drop

分类号:: TP311

DOI:: 10. 13705/ j. issn. 1671-6833. 2024. 03. 002

文献标志码:: A

摘要:: 地质命名实体识别中常用的基于BERT 预训练模型的深度学习方法是基于字的方法,没有利用词信息,且神经网络中的Dropout 机制会导致训练阶段和推理阶段之间存在不一致性。针对该问题, 提出了一种基于MacBERT 和R-Drop 的地质命名实体识别模型MBCR。首先,通过MacBERT 学习文本特征表示,充分利用字词信息;其次,运用BiGRU 编码上下文特征,有效提取完整的语义信息;最后,采用CRF 获取标签间的依赖关系,生成最优标签序列。此外,在训练过程中引入R-Drop,进一步提升模型的泛化能力。结果表明:与BiLSTM-CRF、BERTBiLSTM-CRF 等模型相比,所提MBCR 模型在NERdata 数据集上的F1 值提高了2. 08百分点~4. 62百分点,在Boson数据集上的F1 值提高了1. 26百分点~17. 54百分点。

Abstract:: The commonly used deep learning methods based on BERT pre-trained model in geological named entity recognition were character-based approaches, and could not utilize word-level information. Additionally, the dropout mechanism in neural networks might cause inconsistency between the training and inference stage. To address this issue, a geological named entity recognition model MBCR based on MacBERT and R-Drop was proposed. Firstly, MacBERT was used to learn text feature representations, which could fully utilize character and word information. Then, BiGRU was employed to encode context features, effectively extracting complete semantic information. Subsequently, CRF was adopted to capture dependencies between labels and generate the optimal label sequence. Moreover, R-Drop was introduced during the training process to further enhance the model′s generalization capabilities. Compared with BiLSTM-CRF, BERT-BiLSTM-CRF, and other models, the proposed MBCR model improved the F1-score on the NERdata dataset by 2. 08-4. 62 percentage points and on the Boson dataset by 1. 26-17. 54 percentage points.

参考文献/References:

[1] WANG C S, HAZEN R M, CHENG Q M, et al. The deep-time digital earth program: data-driven discovery in geosciences[J]. National Science Review, 2021, 8(9): nwab027.

[2] 马凯, 田苗, 谭永健, 等. 基于四份区域地质调查报告构建的命名实体识别试验数据集研发[J]. 全球变化数据学报( 中英文), 2022, 6 ( 1 ): 78 - 84,237-243.

MA K, TIAN M, TAN Y J, et al. Development of a named entity recognition dataset based on four regional geological survey reports[ J]. Journal of Global Change Data & Discovery, 2022, 6(1): 78-84, 237-243.

[3] QIU Q J, XIE Z, WU L, et al. Automatic spatiotemporal and semantic information extraction from unstructured geosciences reports using text mining techniques[J]. Earth Science Informatics, 2020, 13(4): 1393-1410.

[4] 储德平, 万波, 李红, 等. 基于ELMO-CNN-BiLSTMCRF 模型的地质实体识别[J]. 地球科学, 2021, 46(8): 3039-3048.

CHU D P, WAN B, LI H, et al. Geological entity recognition based on ELMO-CNN-BiLSTM-CRF model [ J].Earth Science, 2021, 46(8): 3039-3048.

[5] DEVLIN J, CHANG M W, LEE K, et al. Bert: pretraining of deep bidirectional transformers for language understanding[EB/ OL]. (2018 - 10 - 11) [2023 - 03 -15]. https:∥arxiv. org/ abs/ 1810. 04805.

[6] ZOLNA K, ARPIT D, SUHUBDY D, et al. Fraternal dropout[EB/ OL]. (2017-10-31)[2023-03-15]. https:∥arxiv. org/ abs/ 1711. 00066.

[7] CUI Y M, CHE W X, LIU T, et al. Pre-training with whole word masking for Chinese BERT[J]. IEEE/ ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504-3514.

[8] LIANG X B, WU L J, LI J T, et al. R-drop: regularized dropout for neural networks [ J/ OL]. ( 2021 - 10 - 29) [2023-03-15]. https:∥arxiv. org/ abs/ 2106. 14448.

[9] LIU P, GUO Y M, WANG F L, et al. Chinese named entity recognition: the state of the art[J]. Neurocomputing, 2022, 473: 37-53.

[10] LI J, SUN A X, HAN J L, et al. A survey on deep learning for named entity recognition[J]. IEEE Transactionson Knowledge and Data Engineering, 2022, 34(1):50-70.

[11] ZHANG J, SHEN D, ZHOU G D, et al. Enhancing HMM-based biomedical named entity recognition by studying special phenomena[J]. Journal of Biomedical Informatics, 2004, 37(6): 411-422.

[12] SAHA S K, SARKAR S, MITRA P. Feature selection techniques for maximum entropy based biomedical named entity recognition[J]. Journal of Biomedical Informatics, 2009, 42(5): 905-911.

[13] SUN C J, GUAN Y, WANG X L, et al. Rich features based conditional random fields for biological named entities recognition[J]. Computers in Biology and Medicine,2007, 37(9): 1327-1333.

[14] 张雪英, 叶鹏, 王曙, 等. 基于深度信念网络的地质实体识别方法[J]. 岩石学报, 2018, 34(2): 343-351.

ZHANG X Y, YE P, WANG S, et al. Geological entity recognition method based on deep belief networks [ J]. Acta Petrologica Sinica, 2018, 34(2): 343-351.

[15] QIU Q J, XIE Z, WU L, et al. BiLSTM-CRF for geological named entity recognition from the geoscience literature[J]. Earth Science Informatics, 2019, 12(4): 565-579.

[16] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[C]∥Proceedings of the 26th International Conferenceon Neural Information Processing Systems: Volume2. New York: ACM, 2013: 3111-3119.

[17] HUANG C, WANG Y Z, YU Y Q, et al. Chinese named entity recognition of geological news based on BERT model[J]. Applied Sciences, 2022, 12(15): 7708.

[18] 王权于, 李振华, 涂志鹏, 等. 基于BERT-BiGRUCRF 模型的岩土工程实体识别[J]. 地球科学, 2023,48(8): 3137-3150.

WANG Q Y, LI Z H, TU Z P, et al. Geotechnical named entity recognition based on BERT-BiGRU-CRF model[J]. Earth Science, 2023, 48(8): 3137-3150.

[19] YU Y Q, WANG Y Z, MU J Q, et al. Chinese mineral named entity recognition based on BERT model[J]. Expert Systems with Applications, 2022, 206: 117727.

[20] LIU H, QIU Q J, WU L, et al. Few-shot learning for name entity recognition in geological text based on GeoBERT[J]. Earth Science Informatics, 2022, 15(2):979-991.

[21] CHO K, VAN M B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[EB/ OL]. (2014-06-03)[2023-03-15]. https:∥arxiv. org/ abs/ 1406. 1078.

[22] STRUBELL E, VERGA P, BELANGER D, et al. Fast and accurate entity recognition with iterated dilated convolutions[EB/ OL]. (2017-02-07) [2023-03-15]. https:∥arxiv. org/ abs/ 1702. 02098.

[23] HUANG Z H, XU W, YU K. Bidirectional LSTM-CRF models for sequence tagging[EB/ OL]. (2015-08-09)[2023-03-15]. https:∥arxiv. org/ abs/ 1508. 01991.

[24] CUI Z J, YUAN Z M, WU Y F, et al. Intelligent recommendation for departments based on medical knowledge graph[J]. IEEE Access, 2023, 11: 25372-25385.

备注/Memo

备注/Memo:: 收稿日期:2023-09-15;修订日期:2023-10-20
基金项目:国家自然科学基金资助项目(62066003);江西省教育厅科技计划项目(GJJ160554);江西省抚州市人才计划项目(2021ED008);江西省网络空间安全智能感知重点实验室室开放项目(JKLCIP202202)
通信作者:徐洪珍(1976—),男,江西抚州人,东华理工大学教授,博士,主要从事机器学习、大数据、云计算研究,E-mail:xuhz@ ecut. edu. cn。

更新日期/Last Update: 2024-04-29

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

文章信息/Info

参考文献/References:

备注/Memo

常用功能

导航/Navigate

工具/Tools

统计/Statistics