The prominence of phishing has risen over the past years, with the number of unique attacks reaching an all time high in 2016. Attacks can be deployed with minimal cost and effort, enabling attackers to launch large volumes of attacks in short spaces of time. The fast-paced nature of phishing makesautomated detection processes critical for the safe-guarding of Internet users.This study investigates the use of machine learning for phishing detection, with features extracted from the URL only. Through experimentation, a set of 87 effective features were identified, including a significant number of novel features not found in existing research. An evaluation of classificationalgorithms identified that a Random Forest model with 150 trees maximized classification performance,obtaining an F1 score of 0.92 and ROC AUC of 0.97 when testing on a noisy data set of URLs obtained from spam email - a major communication channel where phishing attacks are found. A comparison against existing research indicated that the model built in this study outperforms state-of-the-artlexical classifiers, and often outperforms classifiers that use external features too.The obtained results were used to build a large-scale lexical classifier, Poseidon, that is able to acceleratethe classification of phishing sites, reducing the load on a more expensive classification process by99%. It is shown that Poseidon outperforms existing systems of this nature with respect to various evaluation metrics. Testing on a live feed of 2 million unlabelled URLs/day, Poseidon is able to detect 6000 phishing attacks/month, costing $0.01 per true positive when using a mainstream cloud servicesprovider.This study is one of the few to evaluate classification in a real-life scenario, using phishing and benign URLs retrieved from an environment in which a large proportion of phishing attacks operate.
|Name||Department of Computer Science Technical Report Series|