«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j.issn.1671-6833.2021.04.022]
点击复制

基于CNN-BiLSTM算法的钓鱼网页检测技术研究()

分享到：

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:: 42
期数:: 2021年06期

页码:: 15-21

栏目:

出版日期:: 2021-11-10

文章信息/Info

Title:: Research on Phishing URL Detection Technology Based on CNN-BiLSTM

作者:: 卜佑军¹; 张桥¹; 2; 陈博¹; 张稣荣¹; 王方玉²; 中国人民解放军战略支援部队信息工程大学;郑州大学中原网络安全研究院;

Author(s):: BU Youjun1; ZHANG Qiao1; 2; CHEN Bo1; ZHANG Surong1; WANG Fangyu2; China People’s Liberation Army Strategic Support Force Information Engineering University; Zhengzhou University Central Plains Network Security Research Institute;

关键词:: 钓鱼URL; URL分词; 卷积神经网络; 双向长短记忆网络

Keywords:: phishing URL; URL segmentation; CNN; BiLSTM

DOI:: 10.13705/j.issn.1671-6833.2021.04.022

文献标志码:: A

摘要:: 互联网快速发展的同时也为一些非法分子带来了可乘之机，网络攻击者通过钓鱼网页窃取受害者敏感信息进而获取经济利益。当前常用的钓鱼网页检测方法——基于黑名单检测和基于机器学习检测，存在无法检测新出现的钓鱼网页和需人工提取网页特征的问题。因此,已有研究者使用卷积神经网络（Convolution neural network，CNN）通过自动提取URL特征来检测钓鱼网页。但其方法存在着一些局限性：(1)URL转化为特征矩阵时内存受限，无法获取新单词的嵌入向量或者丢失敏感词的有效信息；(2)无法获取URL的长距离依赖特征。针对上述挑战，本文在现有工作的基础上提出一种基于CNN和双向长短记忆网络（Bi-directional Long Short-Term Memory, BiLSTM）的钓鱼检测方法：该方法基于敏感词分词以提升利用URL数据信息的程度；同时，在卷积神经网络的基础上加入BiLSTM以获取URL的长距离依赖特征。实验表明，该方法对钓鱼网页检测时能达到较高的准确率、召回率、F1值。

Abstract:: In order to solve the increasingly serious problem of phishing, a phishing URL detection method based on convolution neural network (CNN) and bi-directional long short termmemory (BiLSTM) was proposed.This method first classified the URL based on the sensitive word segmentation method; classified the URL according to the special characters and sensitive words; and classified the non-sensitive words in the character level, so as to obtain the effective information of the special characters and sensitive words, and improve the use of URL data information. Then the segmented URL was input into CNN and BiLSTM, to obtain the spatial local features of the URL through CNN, to obtain the bidirectional long-distance dependent features of the URL through BiLSTM, and to detect phishing webpages based on the automatically extracted features.Compared with traditional machine learning and blacklist detection methods. Experimental results showed that the phishing URL detection method based on CNN and BiLSTM could achieve better detection results, the accuracy rate was 98.84%, the precision rate was 99.71%, the recall rate was 98.04%, and the F1 value was 98.86%. This method did not require manual feature extraction and could identify newly emerging phishing webpages.

参考文献/References:

[1] 中国互联网络信息中心.第45次中国互联网络发展状况统计报告[R/OL].(2017-02-17)[2020-03-25].http://www.cnnic.cn/gywm/xwzx/rdxw/20172017_7057/202004/t20200427_70973.htm.

[2] 中国反钓鱼网站联盟.2020年8月钓鱼网站处理简报[EB/OL].(2020-03-20)[2020-10-08].http://www.apac.cn/gzdt/202003/P020200320392664104846.pdf.[3] CANALI D,COVA M,VIGNA G,et al.Prophiler:a fast filter for the large-scale detection of malicious web pages[C]//Proceedings of the 20th International Conference on World Wide Web-WWW′11.New York:ACM,2011:197-206.

[4] THOMAS K,GRIER C,MA J,et al.Design and evaluation of a real-time URL spam filtering service[C]//2011 IEEE Symposium on Security and Privacy.Pisca-taway:IEEE,2011:447-462.

[5] SHENG S,WARDMAN B,WARNER G,et al.An empirical analysis of phishing blacklists[EB/OL].(2009-01-01)[2020-04-08].http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.165.520.

[6] ALEROUD A,ZHOU L N.Phishing environments,techniques,and countermeasures:a survey[J].Computers & security,2017,68:160-196.

[7] LIU G,QIU B T,WENYIN L.Automatic detection of phishing target from phishing webpage[C]//20th International Conference on Pattern Recognition.Pisca-taway:IEEE,2010:4153-4156.

[8] MA J,SAUL L K,SAVAGE S,et al.Beyond blacklists:learning to detect malicious web sites from suspicious URLs[C]//Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining-KDD′09.New York:ACM,2009:681-688.

[9] 沙泓州,刘庆云,柳厅文,等.恶意网页识别研究综述[J].计算机学报,2016,39(3):529-542.

[10] KIM Y.Convolutional neural networks for sentence classification[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).Doha:Association for Computational Linguistics,2014:1746-1751.

[11] ZHANG M,XU B Y,BAI S,et al.A deep learning method to detect web attacks using a specially designed CNN[C]//Neural Information Processing.Berlin：Springer,2017:828-836.

[12] CUI J P, LIU M, HU J W. Malicious web request detection technology based on CNN [J]. Computer science,2020,47(2): 281-286.

[13] YU B,PAN J,HU J M,et al.Character level based detection of DGA domain names[C]//2018 International Joint Conference on Neural Networks (IJCNN).Piscataway:IEEE,2018:1-8.

更新日期/Last Update: 2021-12-17

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

文章信息/Info

参考文献/References:

常用功能

导航/Navigate

工具/Tools

统计/Statistics