[1]卜佑军,张桥,陈博,等.基于CNN-BiLSTM算法的钓鱼网页检测技术研究[J].郑州大学学报(工学版),2021,42(06):15-21.[doi:10.13705/j.issn.1671-6833.2021.04.022]
 Bu Youjun,Zhang Qiao,Chen Bo,et al.Research on phishing webpage detection technology ba<x>sed on CNN-BiLSTM algorithm[J].Journal of Zhengzhou University (Engineering Science),2021,42(06):15-21.[doi:10.13705/j.issn.1671-6833.2021.04.022]
点击复制

基于CNN-BiLSTM算法的钓鱼网页检测技术研究()
分享到:

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:
42
期数:
2021年06期
页码:
15-21
栏目:
出版日期:
2021-11-10

文章信息/Info

Title:
Research on phishing webpage detection technology ba<x>sed on CNN-BiLSTM algorithm
作者:
卜佑军张桥陈博张稣荣王方玉
中国人民解放军战略支援部队信息工程大学;郑州大学中原网络安全研究院;

Author(s):
Bu Youjun; Zhang Qiao; Chen Bo; Zhang Tasong; Wang Fangyu;
China People’s Liberation Army Strategic Support Force Information Engineering University; Zhengzhou University Central Plains Network Security Research Institute;

关键词:
Keywords:
DOI:
10.13705/j.issn.1671-6833.2021.04.022
文献标志码:
A
摘要:
互联网快速发展的同时也为一些非法分子带来了可乘之机,网络攻击者通过钓鱼网页窃取受害者敏感信息进而获取经济利益。当前常用的钓鱼网页检测方法——基于黑名单检测和基于机器学习检测,存在无法检测新出现的钓鱼网页和需人工提取网页特征的问题。因此,已有研究者使用卷积神经网络(Convolution neural network,CNN)通过自动提取URL特征来检测钓鱼网页。但其方法存在着一些局限性:(1)URL转化为特征矩阵时内存受限,无法获取新单词的嵌入向量或者丢失敏感词的有效信息;(2)无法获取URL的长距离依赖特征。针对上述挑战,本文在现有工作的基础上提出一种基于CNN和双向长短记忆网络(Bi-directional Long Short-Term Memory, BiLSTM)的钓鱼检测方法:该方法基于敏感词分词以提升利用URL数据信息的程度;同时,在卷积神经网络的基础上加入BiLSTM以获取URL的长距离依赖特征。实验表明,该方法对钓鱼网页检测时能达到较高的准确率、召回率、F1值。
Abstract:
The rapid development of the Internet has also brought opportunities for some illegal elements. Network attackers steal sensitive information from victims through phishing webpages to obtain economic benefits. Currently, the commonly used detection methods for phishing webpages, ba<x>sed on blacklist detection and machine learning detection, have the problems of being unable to detect newly emerging phishing webpages or requiring manual extraction of webpage features. Therefore, researchers have used Convolution Neural Network (CNN) to detect phishing webpages by automatically extracting URL features. However, its method has some limitations: (1) The memory is limited when the URL is transformed into the feature matrix, and the em<x>bedding vector of new words cannot be obtained or the effective information of sensitive words is lost (2) the long-distance dependent feature of the URL cannot be obtained. In response to the above challenges, this paper proposes a phishing detection method ba<x>sed on CNN and Bi-directional Long Short-Term Memory (Bi-LSTM) ba<x>sed on existing work: ba<x>sed on sensitive word segmentation-- segmentation ba<x>sed on sensitive words to improve the use of URL data information adding Bi-LSTM on the basis of convolutional neural network to obtain URL long-distance dependent features. Experimental results show that this method can achieve high accuracy, recall rate and F1 value
更新日期/Last Update: 2021-12-17