[1]刘建平,初新涛,王 健,等.面向中文科学数据集的句子级语义匹配模型[J].郑州大学学报(工学版),2024,45(06):56-64.[doi:10.13705/j.issn.1671-6833.2024.03.008]
 LIU Jianping,CHU Xintao,WANG Jian,et al.Semantic Matching Model for Chinese Scientific Datasets[J].Journal of Zhengzhou University (Engineering Science),2024,45(06):56-64.[doi:10.13705/j.issn.1671-6833.2024.03.008]
点击复制

面向中文科学数据集的句子级语义匹配模型()
分享到:

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:
45
期数:
2024年06期
页码:
56-64
栏目:
出版日期:
2024-09-25

文章信息/Info

Title:
Semantic Matching Model for Chinese Scientific Datasets
文章编号:
1671-6833(2024)06-0056-09
作者:
刘建平12 初新涛1 王 健3 顾勋勋1 王 萌1 王影菲1
1. 北方民族大学 计算机科学与工程学院,宁夏 银川 750021;2. 北方民族大学 图像图形智能处理国家民委重点实验室,宁夏 银川 750021;3. 中国农业科学院 农业信息研究所,北京 100081
Author(s):
LIU Jianping12 CHU Xintao1 WANG Jian3 GU Xunxun1 WANG Meng1 WANG Yingfei1
1. College of Computer Science and Engineering, North Minzu University, Yinchuan 750021, China; 2. The Key Laboratory of Images and Graphics Intelligent Processing of State Ethnic Affairs Commission, North Minzu University, Yinchuan 750021, China; 3. Agricultural Information Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
关键词:
文本匹配 语义匹配 预训练模型 科学数据集 自然语言处理
Keywords:
text matching semantic matching pre-training model scientific datasets natural language processing
分类号:
TP3-05TP391. 1
DOI:
10.13705/j.issn.1671-6833.2024.03.008
文献标志码:
A
摘要:
针对现有以词为粒度的语义匹配模型难以理解句子级科学数据集元数据的问题,提出了一个面向中文科学数据集的句子级语义匹配(CSDSM)模型。 该模型使用 CSL 数据集对 SimCSE 和 CoSENT 进行训练生成 CoSENT预训练模型。 基于 CoSENT 模型,引入多头自注意力机制进行特征提取,通过余弦相似度与 KNN 分类结果加权求和得到最终输出。 以国家地球系统科学数据中心开放的语义元数据信息作为自建科学数据集进行实验,实验结果表明:与中文 BERT 模型相比,所提模型在公共数据集 AFQMC、LCQMC、Chinese-STS-B 和 PAWS-X 上的 Spearman 指标 ρ 分别提升了 0. 044 8,0. 029 0,0. 177 7 和 0. 050 9;在自建科学数据集上的 F1 和 Acc 分别提升了 0. 078 8 和0. 063 4,所提模型能够有效地解决科学数据集句子级语义匹配问题。
Abstract:
In order to address the difficulty of existing word-level semantic matching models in understanding sentence-level scientific dataset metadata, a sentence-level semantic matching ( CSDSM) model for Chinese scientificdatasets was proposed. The model used the CSL dataset to train and generate the CoSENT pre-training model basedon SimCSE and CoSENT. Building upon the CoSENT model, a multi-head self-attention mechanism was introducedfor feature extraction, and the final output was obtained by weighting the cosine similarity and KNN classificationresults. Experimental data from the National Earth System Science Data Center′s open semantic metadata information was used as a self-built scientific dataset. The experimental results showed that compared to the Chinese BERTmodel, the proposed model improved the Spearman′s ρ index by 0. 044 8, 0. 029 0, 0. 177 7 and 0. 050 9 on thepublic datasets AFQMC, LCQMC, Chinese-STS-B, and PAWS-X, respectively. Additionally, F1 and Acc on theself-built scientific dataset were improved by 0. 078 8 and 0. 063 4 respectively. The proposed model effectively addresses the problem of sentence-level semantic matching in scientific datasets.

参考文献/References:

[1] 罗鹏程, 王继民, 王世奇, 等. 基于深度学习的科学数据集检索方法研究[ J] . 情报理论与实践, 2022, 45(7) : 49-56.

LUO P C, WANG J M, WANG S Q, et al. Research ondeep learning based scientific dataset retrieval method[ J] . Information Studies: Theory & Application, 2022,45(7) : 49-56.
[2] CHEN S H, XU T J. Long text QA matching model basedon BiGRU-DAttention-DSSM[ J] . Mathematics, 2021, 9(10) : 1129.
[3] 冯皓楠, 何智勇, 马良荔. 基于图文注意力融合的主题标签推荐[ J] . 郑州大学学报( 工学版) , 2022, 43(6) : 30-35.
FENG H N, HE Z Y, MA L L. Multimodal hashtag recommendation based on image and text attention fusion[ J] . Journal of Zhengzhou University ( Engineering Science) , 2022, 43(6) : 30-35.
[4] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]∥Proceedings of the 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010.
[5] LI Y D, ZHANG Y Q, ZHAO Z, et al. CSL: a largescale Chinese scientific literature dataset [ EB / OL ] .(2022- 09 - 12 ) [ 2023 - 06 - 11 ] . https:∥arxiv. org /abs/ 2209. 05034.
[6] GAO T Y, YAO X C, CHEN D Q. SimCSE: simple contrastive learning of sentence embeddings [ C] ∥Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association forComputational Linguistics, 2021: 6894-6910.
[7] LI S H, GONG B. Word embedding and text classification based on deep learning methods[ J] . MATEC Web ofConferences, 2021, 336: 06022.
[8] LIU J P, CHU X T, WANG Y F, et al. Deep text retrieval models based on DNN, CNN, RNN and Transformer: a review[ C]∥2022 IEEE 8th International Conference on Cloud Computing and Intelligent Systems(CCIS) . Piscataway: IEEE, 2022: 391-400.
[9] HUANG P S, HE X D, GAO J F, et al. Learning deepstructured semantic models for web search using clickthrough data[C]∥Proceedings of the 22nd ACM international conference on Information & Knowledge Management. New York: ACM, 2013: 2333-2338.
[10] SHEN Y L, HE X D, GAO J F, et al. A latent semanticmodel with convolutional-pooling structure for informationretrieval[C]∥Proceedings of the 23rd ACM InternationalConference on Conference on Information and KnowledgeManagement. New York: ACM, 2014: 101-110.
[11] MOHAN S, FIORINI N, KIM S, et al. A fast deeplearning model for textual relevance in biomedical information retrieval [ C ] ∥Proceedings of the 2018 WorldWide Web Conference. New York: ACM, 2018: 77-86.
[12] KHURANA D, KOLI A, KHATTER K, et al. Naturallanguage processing: state of the art, current trends andchallenges[ J] . Multimedia tools and applications, 2023,82(3) : 3713-3744.
[13] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[C]∥Proceedings of the 26th International Conference on Neural Information Processing Systems:Volume2. New York: ACM, 2013: 3111-3119.
[14] 汪烨, 周思源, 翁知远, 等. 一种面向用户反馈的智能分析 与 服 务 设 计 方 法 [ J] . 郑 州 大 学 学 报 ( 工 学版) , 2023, 44(3) : 56-61.
WANG Y, ZHOU S Y, WENG Z Y, et al. An intelligentanalysis and service design method for user feedback[ J] .Journal of Zhengzhou University ( Engineering Science) ,2023, 44(3) : 56-61.
[15] LIU P F, YUAN W Z, FU J L, et al. Pre-train, prompt,and predict: a systematic survey of prompting methods innatural language processing [ J ] . ACM Computing Surveys, 2021,55(9) : 195.
[16] CHOUDHARY S, GUTTIKONDA H, CHOWDHURY DR, et al. Document retrieval using deep learning [ C]∥2020 Systems and Information Engineering Design Symposium ( SIEDS) . Piscataway: IEEE, 2020: 1-6.
[17] ESTEVA A, KALE A, PAULUS R, et al. COVID-19 information retrieval with deep-learning based semanticsearch, question answering, and abstractive summarization[ J] . NPJ Digital Medicine, 2021, 4: 68.
[18] BELTAGY I, LO K, COHAN A. SciBERT: a pretrainedlanguage model for scientific text[C]∥Proceedings of the2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing ( EMNLP-IJC-NLP ) . Stroudsburg: Association for Computational Linguistics, 2019: 3615-3620.
[19] CHOWDHURY A, ROSENTHAL J, WARING J, et al.Applying self-supervised learning to medicine: review ofthe state of the art and medical implementations[ J] . Informatics, 2021, 8(3) : 59.
[20] REIMERS N, GUREVYCH I. Sentence-BERT: sentenceembeddings using siamese BERT-networks[C]∥Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International JointConference on Natural Language Processing ( EMNLPIJCNLP ) . Stroudsburg: Association for ComputationalLinguistics, 2019: 3982-3992.
[21] LI L Y, SONG D M, MA R T, et al. KNN-BERT: finetuning pre-trained models with KNN classifier[ EB / OL] .(2021- 10 - 06 ) [ 2023 - 06 - 11 ] . https:∥arxiv. org /abs/ 2110. 02523.
[22] PALANIVINAYAGAM A, EL-BAYEH C Z, DAMAŠEVICˇIUSR. Twenty years of machine-learning-based text classification: a systematic review [ J ] . Algorithms, 2023, 16(5) : 236.
[23] CHICCO D. Siamese neural networks: an overview[ J] .Methods in Molecular Biology, 2021, 2190: 73-94.
[24] DEVLIN J, CHANG M W, LEE K, et al. BERT: pretraining of deep bidirectional transformers for languageunderstanding[ EB / OL ] . ( 2019 - 05 - 24 ) [ 2023 - 06 -11] . https:∥arxiv. org / abs/ 1810. 04805.
[25] CUI Y M, CHE W X, LIU T, et al. Pre-training withwhole word masking for Chinese BERT[ J] . IEEE / ACMTransactions on Audio, Speech, and Language Processing, 2021, 29: 3504-3514.
[26] CUI Y M, YANG Z Q, LIU T. PERT: pre-trainingBERT with permuted language model[ EB / OL] . ( 2022 -03 - 14 ) [ 2023 - 06 - 11 ] . https: ∥ arxiv. org /abs/ 2203. 06906.
[27] CUI Y M, CHE W X, WANG S J, et al. LERT: a linguistically-motivated pre-trained language model [ EB /OL] . ( 2022 - 11 - 10) [ 2023 - 06 - 11] . https:∥arxiv.org / abs/ 2211. 05344.

更新日期/Last Update: 2024-09-29