BEN Kerong, YANG Jiahui, ZHANG Xian, ZHAO Chong
Abstract:
Code clone detection, based on deep learning, is often applied to use the model to extract features in the sequence of tokens or the entire AST. That may lead to the missing of important semantic information and induce gradient disappearance. Aiming at these problems, a method of code clone detection based on Transformer and CNN was proposed. First of all, source code was parsed into AST. Then the AST was cut into statement subtrees, which were input into the neural network. Statement subtrees were composed of a sequence of statement nodes obtained by pre-traversal, which contained the structure and hierarchical information. In terms of neural network design, Encoder of Transformer was used to extract global information of the code. CNN was used to capture the local information. Fusion of features were extracted from two different networks. Finally, a vector containing lexical, syntax, and structural information could be learned. The Euclidean distance was used to represent the degree of semantic association. A classifier is trained to detect code clone. Experimental results showed that on OJClone dataset, the Precision, Recall, and F1 values could reach 98. 9%, 98. 1%, and 98. 5%, respectively. On BigCloneBench dataset, the Precision, Recall, and F1 values could reach 99. 1%, 91. 5%, and 94. 2%, respectively. Compared with the relevant methods, the Precision, Recall, and F1 values were all improved. This method could effectively detect code clone.