[1]贲可荣,杨佳辉,张献,等.基于Transformer和卷积神经网络的代码克隆检测[J].郑州大学学报(工学版),2023,44(06):12-18.[doi:10.13705/j.issn.1671-6833.2023.03.012]
BEN Kerong,YANG Jiahui,ZHANG Xian,et al.Code Clone Detection Based on Transformer and Convolutional Neural Network[J].Journal of Zhengzhou University (Engineering Science),2023,44(06):12-18.[doi:10.13705/j.issn.1671-6833.2023.03.012]
点击复制
基于Transformer和卷积神经网络的代码克隆检测(
)
《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]
- 卷:
-
44卷
- 期数:
-
2023年06期
- 页码:
-
12-18
- 栏目:
-
- 出版日期:
-
2023-09-25
文章信息/Info
- Title:
-
Code Clone Detection Based on Transformer and Convolutional Neural Network
- 作者:
-
贲可荣; 杨佳辉; 张献; 赵翀
-
海军工程大学 电子工程学院,湖北 武汉 430033
- Author(s):
-
BEN Kerong; YANG Jiahui; ZHANG Xian; ZHAO Chong
-
College of Electronic Engineering, Naval University of Engineering, Wuhan 430033, China
-
- 关键词:
-
代码克隆检测; 抽象语法树(AST) ; Transformer; 卷积神经网络; 代码特征提取
- Keywords:
-
code clone detection; abstract syntax tree ( AST) ; Transformer; convolutional neural network; code feature extraction
- DOI:
-
10.13705/j.issn.1671-6833.2023.03.012
- 文献标志码:
-
A
- 摘要:
-
基于深度学习的代码克隆检测方法往往作用在代码解析的词序列上或是整棵抽象语法树上,使用基于循 环神经网络的时间序列模型提取特征,这会遗漏源代码的重要语法语义信息并诱发梯度消失。 针对这一问题,提 出一种基于 Transformer 和卷积神经网络的代码克隆检测方法( TCCCD) 。 首先,TCCCD 将源代码表示成抽象语法 树,并将抽象语法树切割成语句子树输入给神经网络,其中,语句子树由先序遍历得到的语句结点序列构成,蕴含 了代码的结构和层次化信息。 其次,在神经网络设计方面,TCCCD 使用 Transformer 的 Encoder 部分提取代码的全 局信息,再利用卷积神经网络捕获代码的局部信息。 再次,融合 2 个不同网络提取出的特征,学习得到蕴含词法、 语法和结构信息的代码向量表示。 最后,采用两段代码向量的欧氏距离表征语义关联程度,训练一个分类器检测 代码克隆。 实验结果表明:在 OJClone 数据集上,精度、召回率、F1 值分别能达到 98. 9%、98. 1% 和 98. 5%;在 BigCloneBench 数据集上,精度、召回率、F1 值分别能达到 99. 1%、91. 5%和 94. 2%。 与其他方法对比,精度、召回率、F1 值均有提升,所提方法能够有效检测代码克隆。
- Abstract:
-
Code clone detection, based on deep learning, is often applied to use the model to extract features in the sequence of tokens or the entire AST. That may lead to the missing of important semantic information and induce gradient disappearance. Aiming at these problems, a method of code clone detection based on Transformer and CNN was proposed. First of all, source code was parsed into AST. Then the AST was cut into statement subtrees, which were input into the neural network. Statement subtrees were composed of a sequence of statement nodes obtained by pre-traversal, which contained the structure and hierarchical information. In terms of neural network design, Encoder of Transformer was used to extract global information of the code. CNN was used to capture the local information. Fusion of features were extracted from two different networks. Finally, a vector containing lexical, syntax, and structural information could be learned. The Euclidean distance was used to represent the degree of semantic association. A classifier is trained to detect code clone. Experimental results showed that on OJClone dataset, the Precision, Recall, and F1 values could reach 98. 9%, 98. 1%, and 98. 5%, respectively. On BigCloneBench dataset, the Precision, Recall, and F1 values could reach 99. 1%, 91. 5%, and 94. 2%, respectively. Compared with the relevant methods, the Precision, Recall, and F1 values were all improved. This method could effectively detect code clone.
更新日期/Last Update:
2023-10-22