Large-Scale Lexical Classification of Phishing Websites

David Medzinskii

Research output: Book/ReportOther report

299 Downloads (Pure)


The prominence of phishing has risen over the past years, with the number of unique attacks reaching an all time high in 2016. Attacks can be deployed with minimal cost and effort, enabling attackers to launch large volumes of attacks in short spaces of time. The fast-paced nature of phishing makesautomated detection processes critical for the safe-guarding of Internet users.This study investigates the use of machine learning for phishing detection, with features extracted from the URL only. Through experimentation, a set of 87 effective features were identified, including a significant number of novel features not found in existing research. An evaluation of classificationalgorithms identified that a Random Forest model with 150 trees maximized classification performance,obtaining an F1 score of 0.92 and ROC AUC of 0.97 when testing on a noisy data set of URLs obtained from spam email - a major communication channel where phishing attacks are found. A comparison against existing research indicated that the model built in this study outperforms state-of-the-artlexical classifiers, and often outperforms classifiers that use external features too.The obtained results were used to build a large-scale lexical classifier, Poseidon, that is able to acceleratethe classification of phishing sites, reducing the load on a more expensive classification process by99%. It is shown that Poseidon outperforms existing systems of this nature with respect to various evaluation metrics. Testing on a live feed of 2 million unlabelled URLs/day, Poseidon is able to detect 6000 phishing attacks/month, costing $0.01 per true positive when using a mainstream cloud servicesprovider.This study is one of the few to evaluate classification in a real-life scenario, using phishing and benign URLs retrieved from an environment in which a large proportion of phishing attacks operate.
Original languageEnglish
Place of PublicationBath, U. K.
PublisherDepartment of Computer Science, University of Bath
Number of pages76
Publication statusPublished - May 2017

Publication series

NameDepartment of Computer Science Technical Report Series
ISSN (Electronic)1740-9497

Bibliographical note

Supervised by Dr. Julian PAdget


Dive into the research topics of 'Large-Scale Lexical Classification of Phishing Websites'. Together they form a unique fingerprint.

Cite this