Large-Scale Lexical Classification of Phishing Websites

David Medzinskii

Research output: Book/ReportOther report

Abstract

The prominence of phishing has risen over the past years, with the number of unique attacks reaching an all time high in 2016. Attacks can be deployed with minimal cost and effort, enabling attackers to launch large volumes of attacks in short spaces of time. The fast-paced nature of phishing makesautomated detection processes critical for the safe-guarding of Internet users.This study investigates the use of machine learning for phishing detection, with features extracted from the URL only. Through experimentation, a set of 87 effective features were identified, including a significant number of novel features not found in existing research. An evaluation of classificationalgorithms identified that a Random Forest model with 150 trees maximized classification performance,obtaining an F1 score of 0.92 and ROC AUC of 0.97 when testing on a noisy data set of URLs obtained from spam email - a major communication channel where phishing attacks are found. A comparison against existing research indicated that the model built in this study outperforms state-of-the-artlexical classifiers, and often outperforms classifiers that use external features too.The obtained results were used to build a large-scale lexical classifier, Poseidon, that is able to acceleratethe classification of phishing sites, reducing the load on a more expensive classification process by99%. It is shown that Poseidon outperforms existing systems of this nature with respect to various evaluation metrics. Testing on a live feed of 2 million unlabelled URLs/day, Poseidon is able to detect 6000 phishing attacks/month, costing $0.01 per true positive when using a mainstream cloud servicesprovider.This study is one of the few to evaluate classification in a real-life scenario, using phishing and benign URLs retrieved from an environment in which a large proportion of phishing attacks operate.
LanguageEnglish
Place of PublicationBath, U. K.
PublisherDepartment of Computer Science, University of Bath
Number of pages76
StatusPublished - May 2017

Publication series

NameDepartment of Computer Science Technical Report Series
ISSN (Electronic)1740-9497

Fingerprint

Websites
Classifiers
Electronic mail
Testing
Learning systems
Internet
Costs

Cite this

Medzinskii, D. (2017). Large-Scale Lexical Classification of Phishing Websites. (Department of Computer Science Technical Report Series). Bath, U. K.: Department of Computer Science, University of Bath.

Large-Scale Lexical Classification of Phishing Websites. / Medzinskii, David.

Bath, U. K. : Department of Computer Science, University of Bath, 2017. 76 p. (Department of Computer Science Technical Report Series).

Research output: Book/ReportOther report

Medzinskii, D 2017, Large-Scale Lexical Classification of Phishing Websites. Department of Computer Science Technical Report Series, Department of Computer Science, University of Bath, Bath, U. K.
Medzinskii D. Large-Scale Lexical Classification of Phishing Websites. Bath, U. K.: Department of Computer Science, University of Bath, 2017. 76 p. (Department of Computer Science Technical Report Series).
Medzinskii, David. / Large-Scale Lexical Classification of Phishing Websites. Bath, U. K. : Department of Computer Science, University of Bath, 2017. 76 p. (Department of Computer Science Technical Report Series).
@book{8b126ad68887444393933cc55d5f30f6,
title = "Large-Scale Lexical Classification of Phishing Websites",
abstract = "The prominence of phishing has risen over the past years, with the number of unique attacks reaching an all time high in 2016. Attacks can be deployed with minimal cost and effort, enabling attackers to launch large volumes of attacks in short spaces of time. The fast-paced nature of phishing makesautomated detection processes critical for the safe-guarding of Internet users.This study investigates the use of machine learning for phishing detection, with features extracted from the URL only. Through experimentation, a set of 87 effective features were identified, including a significant number of novel features not found in existing research. An evaluation of classificationalgorithms identified that a Random Forest model with 150 trees maximized classification performance,obtaining an F1 score of 0.92 and ROC AUC of 0.97 when testing on a noisy data set of URLs obtained from spam email - a major communication channel where phishing attacks are found. A comparison against existing research indicated that the model built in this study outperforms state-of-the-artlexical classifiers, and often outperforms classifiers that use external features too.The obtained results were used to build a large-scale lexical classifier, Poseidon, that is able to acceleratethe classification of phishing sites, reducing the load on a more expensive classification process by99{\%}. It is shown that Poseidon outperforms existing systems of this nature with respect to various evaluation metrics. Testing on a live feed of 2 million unlabelled URLs/day, Poseidon is able to detect 6000 phishing attacks/month, costing $0.01 per true positive when using a mainstream cloud servicesprovider.This study is one of the few to evaluate classification in a real-life scenario, using phishing and benign URLs retrieved from an environment in which a large proportion of phishing attacks operate.",
author = "David Medzinskii",
note = "Supervised by Dr. Julian PAdget",
year = "2017",
month = "5",
language = "English",
series = "Department of Computer Science Technical Report Series",
publisher = "Department of Computer Science, University of Bath",

}

TY - BOOK

T1 - Large-Scale Lexical Classification of Phishing Websites

AU - Medzinskii,David

N1 - Supervised by Dr. Julian PAdget

PY - 2017/5

Y1 - 2017/5

N2 - The prominence of phishing has risen over the past years, with the number of unique attacks reaching an all time high in 2016. Attacks can be deployed with minimal cost and effort, enabling attackers to launch large volumes of attacks in short spaces of time. The fast-paced nature of phishing makesautomated detection processes critical for the safe-guarding of Internet users.This study investigates the use of machine learning for phishing detection, with features extracted from the URL only. Through experimentation, a set of 87 effective features were identified, including a significant number of novel features not found in existing research. An evaluation of classificationalgorithms identified that a Random Forest model with 150 trees maximized classification performance,obtaining an F1 score of 0.92 and ROC AUC of 0.97 when testing on a noisy data set of URLs obtained from spam email - a major communication channel where phishing attacks are found. A comparison against existing research indicated that the model built in this study outperforms state-of-the-artlexical classifiers, and often outperforms classifiers that use external features too.The obtained results were used to build a large-scale lexical classifier, Poseidon, that is able to acceleratethe classification of phishing sites, reducing the load on a more expensive classification process by99%. It is shown that Poseidon outperforms existing systems of this nature with respect to various evaluation metrics. Testing on a live feed of 2 million unlabelled URLs/day, Poseidon is able to detect 6000 phishing attacks/month, costing $0.01 per true positive when using a mainstream cloud servicesprovider.This study is one of the few to evaluate classification in a real-life scenario, using phishing and benign URLs retrieved from an environment in which a large proportion of phishing attacks operate.

AB - The prominence of phishing has risen over the past years, with the number of unique attacks reaching an all time high in 2016. Attacks can be deployed with minimal cost and effort, enabling attackers to launch large volumes of attacks in short spaces of time. The fast-paced nature of phishing makesautomated detection processes critical for the safe-guarding of Internet users.This study investigates the use of machine learning for phishing detection, with features extracted from the URL only. Through experimentation, a set of 87 effective features were identified, including a significant number of novel features not found in existing research. An evaluation of classificationalgorithms identified that a Random Forest model with 150 trees maximized classification performance,obtaining an F1 score of 0.92 and ROC AUC of 0.97 when testing on a noisy data set of URLs obtained from spam email - a major communication channel where phishing attacks are found. A comparison against existing research indicated that the model built in this study outperforms state-of-the-artlexical classifiers, and often outperforms classifiers that use external features too.The obtained results were used to build a large-scale lexical classifier, Poseidon, that is able to acceleratethe classification of phishing sites, reducing the load on a more expensive classification process by99%. It is shown that Poseidon outperforms existing systems of this nature with respect to various evaluation metrics. Testing on a live feed of 2 million unlabelled URLs/day, Poseidon is able to detect 6000 phishing attacks/month, costing $0.01 per true positive when using a mainstream cloud servicesprovider.This study is one of the few to evaluate classification in a real-life scenario, using phishing and benign URLs retrieved from an environment in which a large proportion of phishing attacks operate.

M3 - Other report

T3 - Department of Computer Science Technical Report Series

BT - Large-Scale Lexical Classification of Phishing Websites

PB - Department of Computer Science, University of Bath

CY - Bath, U. K.

ER -