A Systematic Literature Review on Phishing Email Detection Using Natural Language Processing Techniques
Date
2022
Journal Title
Journal ISSN
Volume Title
Publisher
IEEE
Abstract
Every year, phishing results in losses of billions of dollars and is a major threat to the Internet
economy. Phishing attacks are now most often carried out by email. To better comprehend the existing
research trend of phishing email detection, several review studies have been performed. However, it is
important to assess this issue from different perspectives. None of the surveys have ever comprehensively
studied the use of Natural Language Processing (NLP) techniques for detection of phishing except one that
shed light on the use of NLP techniques for classification and training purposes, while exploring a few
alternatives. To bridge the gap, this study aims to systematically review and synthesise research on the use
of NLP for detecting phishing emails. Based on specific predefined criteria, a total of 100 research articles
published between 2006 and 2022 were identified and analysed. We study the key research areas in phishing
email detection using NLP, machine learning algorithms used in phishing detection email, text features in
phishing emails, datasets and resources that have been used in phishing emails, and the evaluation criteria.
The findings include that the main research area in phishing detection studies is feature extraction and
selection, followed by methods for classifying and optimizing the detection of phishing emails. Amongst
the range of classification algorithms, support vector machines (SVMs) are heavily utilised for detecting
phishing emails. The most frequently used NLP techniques are found to be TF-IDF and word embeddings.
Furthermore, the most commonly used datasets for benchmarking phishing email detection methods is the
Nazario phishing corpus. Also, Python is the most commonly used one for phishing email detection. It is
expected that the findings of this paper can be helpful for the scientific community, especially in the field
of NLP application in cybersecurity problems. This survey also is unique in the sense that it relates works
to their openly available tools and resources. The analysis of the presented works revealed that not much
work had been performed on Arabic language phishing emails using NLP techniques. Therefore, many open
issues are associated with Arabic phishing email detection.
Description
Keywords
Phishing email detection,systematic literature review, natural language processing, machine learning
Citation
Salloum, S. et al. (2022) “A Systematic Literature Review on Phishing Email Detection Using Natural Language Processing Techniques,” IEEE Access, 10.