«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j.issn.1671-6833.2023.03.012]
点击复制

基于Transformer和卷积神经网络的代码克隆检测()

分享到：

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:: 44
期数:: 2023年06期

页码:: 12-18

栏目:

出版日期:: 2023-12-25

文章信息/Info

Title:: Code Clone Detection Based on Transformer and Convolutional Neural Network

作者:: 贲可荣; 杨佳辉; 张献; 赵翀; 海军工程大学电子工程学院,湖北武汉 430033

Author(s):: BEN Kerong; YANG Jiahui; ZHANG Xian; ZHAO Chong; College of Electronic Engineering, Naval University of Engineering, Wuhan 430033, China

关键词:: 代码克隆检测; 抽象语法树(AST) ; Transformer; 卷积神经网络; 代码特征提取

Keywords:: code clone detection; abstract syntax tree ( AST) ; Transformer; convolutional neural network; code feature extraction

分类号:: TP393

DOI:: 10.13705/j.issn.1671-6833.2023.03.012

文献标志码:: A

摘要:: 基于深度学习的代码克隆检测方法往往作用在代码解析的词序列上或是整棵抽象语法树上,使用基于循环神经网络的时间序列模型提取特征,这会遗漏源代码的重要语法语义信息并诱发梯度消失。针对这一问题,提出一种基于 Transformer 和卷积神经网络的代码克隆检测方法( TCCCD) 。首先,TCCCD 将源代码表示成抽象语法树,并将抽象语法树切割成语句子树输入给神经网络,其中,语句子树由先序遍历得到的语句结点序列构成,蕴含了代码的结构和层次化信息。其次,在神经网络设计方面,TCCCD 使用 Transformer 的 Encoder 部分提取代码的全局信息,再利用卷积神经网络捕获代码的局部信息。再次,融合 2 个不同网络提取出的特征,学习得到蕴含词法、语法和结构信息的代码向量表示。最后,采用两段代码向量的欧氏距离表征语义关联程度,训练一个分类器检测代码克隆。实验结果表明:在 OJClone 数据集上,精度、召回率、F1 值分别能达到 98. 9%、98. 1% 和 98. 5%;在 BigCloneBench 数据集上,精度、召回率、F1 值分别能达到 99. 1%、91. 5%和 94. 2%。与其他方法对比,精度、召回率、F1 值均有提升,所提方法能够有效检测代码克隆。

Abstract:: Code clone detection, based on deep learning, is often applied to use the model to extract features in the sequence of tokens or the entire AST. That may lead to the missing of important semantic information and induce gradient disappearance. Aiming at these problems, a method of code clone detection based on Transformer and CNN was proposed. First of all, source code was parsed into AST. Then the AST was cut into statement subtrees, which were input into the neural network. Statement subtrees were composed of a sequence of statement nodes obtained by pre-traversal, which contained the structure and hierarchical information. In terms of neural network design, Encoder of Transformer was used to extract global information of the code. CNN was used to capture the local information. Fusion of features were extracted from two different networks. Finally, a vector containing lexical, syntax, and structural information could be learned. The Euclidean distance was used to represent the degree of semantic association. A classifier is trained to detect code clone. Experimental results showed that on OJClone dataset, the Precision, Recall, and F1 values could reach 98. 9%, 98. 1%, and 98. 5%, respectively. On BigCloneBench dataset, the Precision, Recall, and F1 values could reach 99. 1%, 91. 5%, and 94. 2%, respectively. Compared with the relevant methods, the Precision, Recall, and F1 values were all improved. This method could effectively detect code clone.

参考文献/References:

[1] 陈秋远, 李善平, 鄢萌, 等. 代码克隆检测研究进展 [ J] . 软件学报, 2019, 30(4) : 962-980.

CHEN Q Y, LI S P, YAN M, et al. Code clone detection: a literature review[ J] . Journal of Software, 2019, 30(4) : 962-980.

[2] BELLON S, KOSCHKE R, ANTONIOL G, et al. Comparison and evaluation of clone detection tools[ J] . IEEE Transactions on Software Engineering, 2007, 33(9) : 577 -591.

[3] HINDLE A, BARR E T, SU Z D, et al. On the naturalness of software[ C]∥2012 34th International Conference on Software Engineering ( ICSE ) . Piscataway: IEEE, 2012: 837-847.

[4] CORDY J R, ROY C K. The NiCad clone detector [C]∥ 2011 IEEE 19th International Conference on Program Comprehension. Piscataway: IEEE, 2011: 219-220.

[5] LI L Q, FENG H, ZHUANG W J, et al. CCLearner: a deep learning-based clone detection approach [ C] ∥2017 IEEE International Conference on Software Maintenance and Evolution (ICSME). Piscataway: IEEE, 2017: 249-260.

[6] WEI H H, LI M. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code[ C]∥Proceedings of the 26th International Joint Conference on Artificial Intelligence. New York: ACM, 2017: 3034-3040.

[7] ALON U, LEVY O, YAHAV E. CODE2SEQ: generatingsequences from structured representations of code [ EB / OL] . (2018- 08 - 04) [ 2022 - 09 - 11] . https:∥arxiv. org / abs/ 1808. 01400.

[8] ALON U, ZILBERSTEIN M, LEVY O, et al. CODE2VEC: learning distributed representations of code[ J] . Proceedings of the ACM on Programming Languages, 2019, 3: 40.

[9] ZENG J, BEN K R, LI X W, et al. Fast code clone detection based on weighted recursive autoencoders [ J ] . IEEE Access, 2019, 7: 125062-125078.

[10] ZHANG J, WANG X, ZHANG H Y, et al. A novel neural source code representation based on abstract syntax tree[C]∥2019 IEEE / ACM 41st International Conference on Software Engineering ( ICSE ) . Piscataway: IEEE, 2019: 783-794.

[11] MENG Y, LIU L. A deep learning approach for a source code detection model using self-attention [ J] . Complexity, 2020, 2020: 1-15.

[12] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[ C]∥Advances in Neural Information Processing Systems 30. Long Beach: NIPS, 2017:5998- 6008.

[13] CORDONNIER J B, LOUKAS A, JAGGI M. On the relationship between self-attention and convolutional layers [EB / OL] . (2019-11-08) [2022-09-11] . https:∥arxiv. org / abs/ 1911. 03584.

[14] GONG J J, QIU X P, CHEN X C, et al. Convolutional interaction network for natural language inference [ C] ∥ Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2018: 1576-1585.

[15] YANG G, ZHOU Y L, CHEN X, et al. Fine-grained pseudo-code generation method via code feature extraction and transformer[C]∥2021 28th Asia-Pacific Software Engineering Conference ( APSEC ) . Piscataway: IEEE, 2022: 213-222.

[16] 张安琳, 张启坤, 黄道颖, 等. 基于 CNN 与 BiGRU 融合神经网络的入侵检测模型[ J] . 郑州大学学报( 工学版) , 2022, 43(3) : 37-43.

ZHANG A L, ZHANG Q K, HUANG D Y, et al. Intrusion detection model based on CNN and BiGRU fused neural network [ J ] . Journal of Zhengzhou University (Engineering Science) , 2022, 43(3) : 37-43.

[17] YUAN Y H, HUANG L, GUO J Y, et al. OCNet: object context for semantic segmentation[ J] . International Journal of Computer Vision, 2021, 129(8) : 2375-2398.

[18] GEHRING J, AULI M, GRANGIER D, et al. Convolutional sequence to sequence learning[ C]∥Proceedings of the 34th International Conference on Machine Learning. New York: ACM, 2017: 1243-1252.

更新日期/Last Update: 2023-10-22

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

文章信息/Info

参考文献/References:

常用功能

导航/Navigate

工具/Tools

统计/Statistics