Abstract
Authorship identification of source code is the process of identifying the composer of given source code. Code authorship identification plays an important role in many real-world scenarios, such as the detection of plagiarism and ghost writing in both education and workplace settings. Additionally, it can allow the identification of individuals or organisations that produce and distribute malware programs. In this paper we describe the experimentation and submission by team UoB to the AI-SOCO track at FIRE 2020, which achieved first place. We first perform extensive testing on a variety of techniques used in source code authorship identification including n-gram, stylometric, and abstract syntax tree derived features. We also investigate the application of CodeBERT, a new pre-trained model that demonstrates state-of-the-art performance in natural and programming language tasks. Finally, we explore the potential of ensembling multiple models together to create a single superior model. Our winning model utilises byte-level n-grams extracted from source codes to build feature vectors that represent an author’s programming style. These feature vectors are then used to train a densely connected neural network model to carry out authorship classification on previously unseen source codes, achieving an accuracy of 95.11%.
Original language | English |
---|---|
Title of host publication | CEUR Workshop Proceedings |
Subtitle of host publication | Forum for Information Retrieval Evaluation, December 16 - 20, Hyderabad, India |
Publisher | CEUR-WS |
Pages | 677-693 |
Number of pages | 17 |
Volume | 2826 |
Publication status | Published - 20 Dec 2020 |