[1]贲可荣,杨佳辉,张献,等.基于Transformer和卷积神经网络的代码克隆检测[J].郑州大学学报(工学版),2023,44(06):12-18.[doi:10.13705/j.issn.1671-6833.2023.03.012]
 BEN Kerong,YANG Jiahui,ZHANG Xian,et al.Code Clone Detection Based on Transformer and Convolutional Neural Network[J].Journal of Zhengzhou University (Engineering Science),2023,44(06):12-18.[doi:10.13705/j.issn.1671-6833.2023.03.012]
点击复制

基于Transformer和卷积神经网络的代码克隆检测()
分享到:

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:
44
期数:
2023年06期
页码:
12-18
栏目:
出版日期:
2023-12-25

文章信息/Info

Title:
Code Clone Detection Based on Transformer and Convolutional Neural Network
作者:
贲可荣杨佳辉张献赵翀
海军工程大学 电子工程学院,湖北 武汉 430033
Author(s):
BEN Kerong YANG Jiahui ZHANG Xian ZHAO Chong
College of Electronic Engineering, Naval University of Engineering, Wuhan 430033, China
关键词:
代码克隆检测 抽象语法树(AST) Transformer 卷积神经网络 代码特征提取
Keywords:
code clone detection abstract syntax tree ( AST) Transformer convolutional neural network code feature extraction
分类号:
TP393
DOI:
10.13705/j.issn.1671-6833.2023.03.012
文献标志码:
A
摘要:
基于深度学习的代码克隆检测方法往往作用在代码解析的词序列上或是整棵抽象语法树上,使用基于循 环神经网络的时间序列模型提取特征,这会遗漏源代码的重要语法语义信息并诱发梯度消失。 针对这一问题,提 出一种基于 Transformer 和卷积神经网络的代码克隆检测方法( TCCCD) 。 首先,TCCCD 将源代码表示成抽象语法 树,并将抽象语法树切割成语句子树输入给神经网络,其中,语句子树由先序遍历得到的语句结点序列构成,蕴含 了代码的结构和层次化信息。 其次,在神经网络设计方面,TCCCD 使用 Transformer 的 Encoder 部分提取代码的全 局信息,再利用卷积神经网络捕获代码的局部信息。 再次,融合 2 个不同网络提取出的特征,学习得到蕴含词法、 语法和结构信息的代码向量表示。 最后,采用两段代码向量的欧氏距离表征语义关联程度,训练一个分类器检测 代码克隆。 实验结果表明:在 OJClone 数据集上,精度、召回率、F1 值分别能达到 98. 9%、98. 1% 和 98. 5%;在 BigCloneBench 数据集上,精度、召回率、F1 值分别能达到 99. 1%、91. 5%和 94. 2%。 与其他方法对比,精度、召回率、F1 值均有提升,所提方法能够有效检测代码克隆。
Abstract:
Code clone detection, based on deep learning, is often applied to use the model to extract features in the sequence of tokens or the entire AST. That may lead to the missing of important semantic information and induce gradient disappearance. Aiming at these problems, a method of code clone detection based on Transformer and CNN was proposed. First of all, source code was parsed into AST. Then the AST was cut into statement subtrees, which were input into the neural network. Statement subtrees were composed of a sequence of statement nodes obtained by pre-traversal, which contained the structure and hierarchical information. In terms of neural network design, Encoder of Transformer was used to extract global information of the code. CNN was used to capture the local information. Fusion of features were extracted from two different networks. Finally, a vector containing lexical, syntax, and structural information could be learned. The Euclidean distance was used to represent the degree of semantic association. A classifier is trained to detect code clone. Experimental results showed that on OJClone dataset, the Precision, Recall, and F1 values could reach 98. 9%, 98. 1%, and 98. 5%, respectively. On BigCloneBench dataset, the Precision, Recall, and F1 values could reach 99. 1%, 91. 5%, and 94. 2%, respectively. Compared with the relevant methods, the Precision, Recall, and F1 values were all improved. This method could effectively detect code clone.

参考文献/References:

[1] 陈秋远, 李善平, 鄢萌, 等. 代码克隆检测研究进展 [ J] . 软件学报, 2019, 30(4) : 962-980. 

CHEN Q Y, LI S P, YAN M, et al. Code clone detection: a literature review[ J] . Journal of Software, 2019, 30(4) : 962-980.
 [2] BELLON S, KOSCHKE R, ANTONIOL G, et al. Comparison and evaluation of clone detection tools[ J] . IEEE Transactions on Software Engineering, 2007, 33(9) : 577 -591.
 [3] HINDLE A, BARR E T, SU Z D, et al. On the naturalness of software[ C]∥2012 34th International Conference on Software Engineering ( ICSE ) . Piscataway: IEEE, 2012: 837-847.
 [4] CORDY J R, ROY C K. The NiCad clone detector [C]∥ 2011 IEEE 19th International Conference on Program Comprehension. Piscataway: IEEE, 2011: 219-220. 
[5] LI L Q, FENG H, ZHUANG W J, et al. CCLearner: a deep learning-based clone detection approach [ C] ∥2017 IEEE International Conference on Software Maintenance and Evolution (ICSME). Piscataway: IEEE, 2017: 249-260. 
[6] WEI H H, LI M. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code[ C]∥Proceedings of the 26th International Joint Conference on Artificial Intelligence. New York: ACM, 2017: 3034-3040. 
[7] ALON U, LEVY O, YAHAV E. CODE2SEQ: generatingsequences from structured representations of code [ EB / OL] . (2018- 08 - 04) [ 2022 - 09 - 11] . https:∥arxiv. org / abs/ 1808. 01400. 
[8] ALON U, ZILBERSTEIN M, LEVY O, et al. CODE2VEC: learning distributed representations of code[ J] . Proceedings of the ACM on Programming Languages, 2019, 3: 40. 
[9] ZENG J, BEN K R, LI X W, et al. Fast code clone detection based on weighted recursive autoencoders [ J ] . IEEE Access, 2019, 7: 125062-125078. 
[10] ZHANG J, WANG X, ZHANG H Y, et al. A novel neural source code representation based on abstract syntax tree[C]∥2019 IEEE / ACM 41st International Conference on Software Engineering ( ICSE ) . Piscataway: IEEE, 2019: 783-794.
 [11] MENG Y, LIU L. A deep learning approach for a source code detection model using self-attention [ J] . Complexity, 2020, 2020: 1-15. 
[12] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[ C]∥Advances in Neural Information Processing Systems 30. Long Beach: NIPS, 2017:5998- 6008.
 [13] CORDONNIER J B, LOUKAS A, JAGGI M. On the relationship between self-attention and convolutional layers [EB / OL] . (2019-11-08) [2022-09-11] . https:∥arxiv. org / abs/ 1911. 03584.
 [14] GONG J J, QIU X P, CHEN X C, et al. Convolutional interaction network for natural language inference [ C] ∥ Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2018: 1576-1585.
 [15] YANG G, ZHOU Y L, CHEN X, et al. Fine-grained pseudo-code generation method via code feature extraction and transformer[C]∥2021 28th Asia-Pacific Software Engineering Conference ( APSEC ) . Piscataway: IEEE, 2022: 213-222. 
[16] 张安琳, 张启坤, 黄道颖, 等. 基于 CNN 与 BiGRU 融 合神经网络的入侵检测模型[ J] . 郑州大学学报( 工 学版) , 2022, 43(3) : 37-43. 
ZHANG A L, ZHANG Q K, HUANG D Y, et al. Intrusion detection model based on CNN and BiGRU fused neural network [ J ] . Journal of Zhengzhou University (Engineering Science) , 2022, 43(3) : 37-43. 
[17] YUAN Y H, HUANG L, GUO J Y, et al. OCNet: object context for semantic segmentation[ J] . International Journal of Computer Vision, 2021, 129(8) : 2375-2398.
 [18] GEHRING J, AULI M, GRANGIER D, et al. Convolutional sequence to sequence learning[ C]∥Proceedings of the 34th International Conference on Machine Learning. New York: ACM, 2017: 1243-1252.

更新日期/Last Update: 2023-10-22