Improving neural network trojan detection via network abstraction

Eiermann, Marcello

View/Open

Thesis_Eiermann_Neural_Trojan_Detection.pdf (6.559Mb)

Publication date

2024

Author

Eiermann, Marcello

Metadata

Show full item record

Summary

Deep learning-based image recognition systems have become essential in a variety of applications, including autonomous driving functions in vehicles. The increased use of third-party datasets and pretrained models open up a new security risk, where any potential user cannot know if the data or model have been manipulated. Attackers can plant a backdoor during the training phase by poisoning a part of the dataset with a trojan trigger. The trojaned model behaves normally on benign inputs, but inputs that contain the trigger will cause the model to intentionally select a wrong output. In the domain of autonomous vehicles, an attack which causes an intentional misclassification of a road sign could have fatal consequences. One of the methods for detecting neural network trojans is Artificial Brain Stimulation (ABS), which manually stimulates a neuron’s activation value and observes the change in output activation values. We combine ABS with the neural network abstraction tool DeepAbstract, which computes clusters and cluster representatives, based on Input/Output similarity of neurons. Our strategy involves selectively applying the ABS analysis on the subset of cluster representatives, to possibly reduce the computational load and increase the detection accuracy. To assess the efficacy of our method, we conducted experiments using the GTSRB dataset, trojaning multiple models with six distinct triggers of varying visibility. We analyze two research questions: Whether our method can lead to a runtime improvement compared to ABS, and whether it can increase the detection accuracy. One model showed an improvement in stimulation runtime, while the runtime of the other models remained equal. Our method consistently yields superior or equivalent detection accuracy across all tested models compared to ABS. At best, our method increased the reverse-engineered attack success rate score by 33% and the number of detected trojaned neurons by 59%, demonstrating a clear improvement in detection accuracy

URI

https://studenttheses.uu.nl/handle/20.500.12932/46064

Collections

Theses