Active learning in recommender systems for predicting vulnerabilities in software
Summary
Due to a rapid advancement of digital technology and growing reliance on the internet, cybersecurity
has become a paramount issue for individuals, organizations, and governments. To address this
challenge, penetration testing has emerged as a critical tool to ensure the security of computer
systems and networks. The reconnaissance phase of penetration testing plays a crucial role in
identifying vulnerabilities in a system by gathering relevant information. Although various tools are
available to automate this process, most of them are limited to identifying reported vulnerabilities,
and they do not provide suggestions or predictions about vulnerabilities. Therefore, this research
aims to investigate the application of recommender systems to predict common vulnerabilities
during the reconnaissance phase. The main objective of this research is to investigate how active
learning affects the performance of a recommender system to identify vulnerabilities in software
products.
Item-Based k-NN Collaborative Filtering, a recommender system, can improve the identification of
potential vulnerabilities and the effectiveness of penetration testing by analyzing information from
similar data points. This research involves a comprehensive data preprocessing phase, which utilizes
data from the National Vulnerability Database (NVD). Several recommender systems are built using
this data, which enables the prediction of potential vulnerabilities during the reconnaissance phase
of penetration testing. The performances of these recommender systems are evaluated, and the topperforming recommender system implements active learning to enhance its performance.
The findings of this research demonstrate that Item-Based k-NN Collaborative Filtering outperforms
other recommender systems in terms of overall performance when it comes to identifying software
vulnerabilities. Furthermore, when compared to Item-Based k-NN Collaborative Filtering prior
to active learning or with active learning and a random sampling technique, Item-Based k-NN
Collaborative Filtering with active learning incorporating a 4- or 10-batch sampling technique with
20 or 40 items added yields a statistically significant improvement in the precision score. This
indicates that a greater proportion of the predicted vulnerabilities are correct. Item-Based k-NN
Collaborative Filtering with active learning and a single-batch sampling strategy only results in
a statistically significant improvement in precision, compared to Item-Based k-NN Collaborative
Filtering prior active learning or with active learning and a random sampling technique, when 20
items are added instead of 40.
Furthermore, only Item-Based k-NN Collaborative Filtering with a 10-batch sampling strategy
adding 20 items demonstrated a statistically significant improvement in nDCG scores compared to
Item-Based k-NN Collaborative Filtering prior to active learning. This implies a more accurate
ranking of the vulnerabilities. However, this could potentially be a type I error.
From these findings, it can be concluded that introducing active learning in Item-Based k-NN
Collaborative Filtering, using the approaches outlined, leads to significant improvement in precision
score but not necessarily in nDCG score.
Considering this conclusion, it is advised to use Item-Based k-NN Collaborative Filtering with
active learning to predict vulnerabilities in software products and enhance the reconnaissance phase
of penetration testing. This can be achieved by incorporating a single-batch sampling technique
with 20 items added or a 4- or 10-batch sampling technique with 20 or 40 added.
The insights gained from this research can help individuals, organizations, and governments strengthen
their cybersecurity defences and protect against potential cyber threats.