Semantic Matching Model for Chinese Scientific Datasets
刘建平12 初新涛1 王 健3 顾勋勋1 王 萌1 王影菲1
1. 北方民族大学 计算机科学与工程学院,宁夏 银川 750021;2. 北方民族大学 图像图形智能处理国家民委重点实验室,宁夏 银川 750021;3. 中国农业科学院 农业信息研究所,北京 100081
LIU Jianping12 CHU Xintao1 WANG Jian3 GU Xunxun1 WANG Meng1 WANG Yingfei1
1. College of Computer Science and Engineering, North Minzu University, Yinchuan 750021, China; 2. The Key Laboratory of Images and Graphics Intelligent Processing of State Ethnic Affairs Commission, North Minzu University, Yinchuan 750021, China; 3. Agricultural Information Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
文本匹配 语义匹配 预训练模型 科学数据集 自然语言处理
text matching semantic matching pre-training model scientific datasets natural language processing
TP3-05TP391. 1
针对现有以词为粒度的语义匹配模型难以理解句子级科学数据集元数据的问题,提出了一个面向中文科学数据集的句子级语义匹配(CSDSM)模型。 该模型使用 CSL 数据集对 SimCSE 和 CoSENT 进行训练生成 CoSENT预训练模型。 基于 CoSENT 模型,引入多头自注意力机制进行特征提取,通过余弦相似度与 KNN 分类结果加权求和得到最终输出。 以国家地球系统科学数据中心开放的语义元数据信息作为自建科学数据集进行实验,实验结果表明:与中文 BERT 模型相比,所提模型在公共数据集 AFQMC、LCQMC、Chinese-STS-B 和 PAWS-X 上的 Spearman 指标 ρ 分别提升了 0. 044 8,0. 029 0,0. 177 7 和 0. 050 9;在自建科学数据集上的 F1 和 Acc 分别提升了 0. 078 8 和0. 063 4,所提模型能够有效地解决科学数据集句子级语义匹配问题。
In order to address the difficulty of existing word-level semantic matching models in understanding sentence-level scientific dataset metadata, a sentence-level semantic matching ( CSDSM) model for Chinese scientificdatasets was proposed. The model used the CSL dataset to train and generate the CoSENT pre-training model basedon SimCSE and CoSENT. Building upon the CoSENT model, a multi-head self-attention mechanism was introducedfor feature extraction, and the final output was obtained by weighting the cosine similarity and KNN classificationresults. Experimental data from the National Earth System Science Data Center′s open semantic metadata information was used as a self-built scientific dataset. The experimental results showed that compared to the Chinese BERTmodel, the proposed model improved the Spearman′s ρ index by 0. 044 8, 0. 029 0, 0. 177 7 and 0. 050 9 on thepublic datasets AFQMC, LCQMC, Chinese-STS-B, and PAWS-X, respectively. Additionally, F1 and Acc on theself-built scientific dataset were improved by 0. 078 8 and 0. 063 4 respectively. The proposed model effectively addresses the problem of sentence-level semantic matching in scientific datasets.


