UoB at AI-SOCO 2020: Approaches to Source Code Classification and the Surprising Power of n-grams

Alexander Crosby, Harish Tayyar Madabushi

Research output: Chapter or section in a book/report/conference proceedingChapter in a published conference proceeding

Abstract

Authorship identification of source code is the process of identifying the composer of given source code. Code authorship identification plays an important role in many real-world scenarios, such as the detection of plagiarism and ghost writing in both education and workplace settings. Additionally, it can allow the identification of individuals or organisations that produce and distribute malware programs. In this paper we describe the experimentation and submission by team UoB to the AI-SOCO track at FIRE 2020, which achieved first place. We first perform extensive testing on a variety of techniques used in source code authorship identification including n-gram, stylometric, and abstract syntax tree derived features. We also investigate the application of CodeBERT, a new pre-trained model that demonstrates state-of-the-art performance in natural and programming language tasks. Finally, we explore the potential of ensembling multiple models together to create a single superior model. Our winning model utilises byte-level n-grams extracted from source codes to build feature vectors that represent an author’s programming style. These feature vectors are then used to train a densely connected neural network model to carry out authorship classification on previously unseen source codes, achieving an accuracy of 95.11%.
Original languageEnglish
Title of host publicationCEUR Workshop Proceedings
Subtitle of host publicationForum for Information Retrieval Evaluation, December 16 - 20, Hyderabad, India
PublisherCEUR-WS
Pages677-693
Number of pages17
Volume2826
Publication statusPublished - 20 Dec 2020

Fingerprint

Dive into the research topics of 'UoB at AI-SOCO 2020: Approaches to Source Code Classification and the Surprising Power of n-grams'. Together they form a unique fingerprint.

Cite this